# Buffers and dataflow

This page describes the dataflow in Perfetto when recording traces. It describes
all the buffering stages, explains how to size the buffers and how to debug
data losses.

## Concepts

Tracing in Perfetto is an asynchronous multiple-writer single-reader pipeline.
In many senses, its architecture is very similar to modern GPUs' command
buffers.

The design principles of the tracing dataflow are:

* The tracing fastpath is based on direct writes into a shared memory buffer.
* Highly optimized for low-overhead writing. NOT optimized for low-latency
  reading.
* Trace data is eventually committed in the central trace buffer by the end
  of the trace or when explicit flush requests are issued via the IPC channel.
* Producers are untrusted and should not be able to see each-other's trace data,
  as that would leak sensitive information.

In the general case, there are two types buffers involved in a trace. When
pulling data from the Linux kernel's ftrace infrastructure, there is a third
stage of buffering (one per-CPU) involved:

![Buffers](/docs/images/buffers.png)

#### Tracing service's central buffers

These buffers (yellow, in the picture above) are defined by the user in the
`buffers` section of the [trace config](config.md). In the most simple cases,
one tracing session = one buffer, regardless of the number of data sources and
producers.

This is the place where the tracing data is ultimately kept, while in memory,
whether it comes from the kernel ftrace infrastructure, from some other data
source in `traced_probes` or from another userspace process using the
[Perfetto SDK](/docs/instrumentation/tracing-sdk.md).
At the end of the trace (or during, if in [streaming mode]) these buffers are
written into the output trace file.

These buffers can contain a mixture of trace packets coming from different data
sources and even different producer processes. What-goes-where is defined in the
[buffers mapping section](config.md#dynamic-buffer-mapping) of the trace config.
Because of this, the tracing buffers are not shared across processes, to avoid
cross-talking and information leaking across producer processes.

#### Shared memory buffers

Each producer process has one memory buffer shared 1:1 with the tracing service
(blue, in the picture above), regardless of the number of data sources it hosts.
This buffer is a temporary staging buffer and has two purposes:

1. Zero-copy on the writer path. This buffer allows direct serialization of the
   tracing data from the writer fastpath in a memory region directly readable by
   the tracing service.

2. Decoupling writes from reads of the tracing service. The tracing service has
   the job of moving trace packets from the shared memory buffer (blue) into the
   central buffer (yellow) as fast as it can.
   The shared memory buffer hides the scheduling and response latencies of the
   tracing service, allowing the producer to keep writing without losing data
   when the tracing service is temporarily blocked.

#### Ftrace buffer

When the `linux.ftrace` data source is enabled, the kernel will have its own
per-CPU buffers. These are unavoidable because the kernel cannot write directly
into user-space buffers. The `traced_probes` process will periodically read
those buffers, convert the data into binary protos and follow the same dataflow
of userspace tracing. These buffers need to be just large enough to hold data
between two frace read cycles (`TraceConfig.FtraceConfig.drain_period_ms`).

## Life of a trace packet

Here is a summary to understand the dataflow of trace packets across buffers.
Consider the case of a producer process hosting two data sources writing packets
at a different rates, both targeting the same central buffer.

1. When each data source starts writing, it will grab a free page of the shared
   memory buffer and directly serialize proto-encoded tracing data onto it.

2. When a page of the shared memory buffer is filled, the producer will send an
   async IPC to the service, asking it to copy the shared memory page just
   written. Then, the producer will grab the next free page in the shared memory
   buffer and keep writing.

3. When the service receives the IPC, it copies the shared memory page into
   the central buffer and marks the shared memory buffer page as free again. Data
   sources within the producer are able to reuse that page at this point.

4. When the tracing session ends, the service sends a `Flush` request to all
   data sources. In reaction to this, data sources will commit all outstanding
   shared memory pages, even if not completely full. The services copies these
   pages into the service's central buffer.

![Dataflow animation](/docs/images/dataflow.svg)

## Buffer sizing

#### Central buffer sizing

The math for sizing the central buffer is quite straightforward: in the default
case of tracing without `write_into_file` (when the trace file is written only
at the end of the trace), the buffer will hold as much data as it has been
written by the various data sources.

The total length of the trace will be `(buffer size) / (aggregated write rate)`.
If all producers write at a combined rate of 2 MB/s, a 16 MB buffer will hold
~ 8 seconds of tracing data.

The write rate is highly dependent on the data sources configured and by the
activity of the system. 1-2 MB/s is a typical figure on Android traces with
scheduler tracing, but can go up easily by 1+ orders of magnitude if chattier
data sources are enabled (e.g., syscall or pagefault tracing).

When using [streaming mode] the buffer needs to be able to hold enough data
between two `file_write_period_ms` periods (default: 5s).
For instance, if `file_write_period_ms = 5000` and the write data rate is 2 MB/s
the central buffer needs to be at least 5 * 2 = 10 MB to avoid data losses.

#### Shared memory buffer sizing

The sizing of the shared memory buffer depends on:

* The scheduling characteristics of the underlying system, i.e. for how long the
 tracing service can be blocked on the scheduler queues. This is a function of
 the kernel configuration and nice-ness level of the `traced` process.
* The max write rate of all data sources within a producer process.

Suppose that a producer produce at a max rate of 8 MB/s. If `traced` gets
blocked for 10 ms, the shared memory buffer need to be at least 8 * 0.01 = 80 KB
to avoid losses.

Empirical measurements suggest that on most Android systems a shared memory
buffer size of 128-512 KB is good enough.

The default shared memory buffer size is 256 KB. When using the Perfetto Client
Library, this value can be tweaked setting `TracingInitArgs.shmem_size_hint_kb`.

WARNING: if a data source writes very large trace packets in a single batch,
either the shared memory buffer needs to be big enough to handle that or
`BufferExhaustedPolicy.kStall` must be employed.

For instance, consider a data source that emits a 2MB screenshot every 10s.
Its (simplified) code, would look like:
```c++
for (;;) {
  ScreenshotDataSource::Trace([](ScreenshotDataSource::TraceContext ctx) {
    auto packet = ctx.NewTracePacket();
    packet.set_bitmap(Grab2MBScreenshot());
  });
  std::this_thread::sleep_for(std::chrono::seconds(10));
}
```

Its average write rate is 2MB / 10s = 200 KB/s. However, the data source will
create bursts of 2MB back-to-back without yielding; it is limited only by the
tracing serialization overhead. In practice, it will write the 2MB buffer at
O(GB/s). If the shared memory buffer is < 2 MB, the tracing service will be
unlikely to catch up at that rate and data losses will be experienced.

In a case like this these options are:

* Increase the size of the shared memory buffer in the producer that hosts the
  data source.
* Split the write into chunks spaced by some delay.
* Adopt the `BufferExhaustedPolicy::kStall` when defining the data source:

```c++
class ScreenshotDataSource : public perfetto::DataSource {
 public:
  constexpr static BufferExhaustedPolicy kBufferExhaustedPolicy =
      BufferExhaustedPolicy::kStall;
 ...
};
```

## Debugging data losses

#### Ftrace kernel buffer losses

When using the Linux kernel ftrace data source, losses can occur in the
kernel -> userspace path if the `traced_probes` process gets blocked for too
long.

At the trace proto level, losses in this path are recorded:
* In the [`FtraceCpuStats`][FtraceCpuStats] messages, emitted both at the
  beginning and end of the trace. If the `overrun` field is non-zero, data has
  been lost.
* In the [`FtraceEventBundle.lost_events`][FtraceEventBundle] field. This allows
  to locate precisely the point where data loss happened.

At the TraceProcessor SQL level, this data is available in the `stats` table:

```sql
> select * from stats where name like 'ftrace_cpu_overrun_end'
name                 idx                  severity             source value
-------------------- -------------------- -------------------- ------ ------
ftrace_cpu_overrun_e                    0 data_loss            trace       0
ftrace_cpu_overrun_e                    1 data_loss            trace       0
ftrace_cpu_overrun_e                    2 data_loss            trace       0
ftrace_cpu_overrun_e                    3 data_loss            trace       0
ftrace_cpu_overrun_e                    4 data_loss            trace       0
ftrace_cpu_overrun_e                    5 data_loss            trace       0
ftrace_cpu_overrun_e                    6 data_loss            trace       0
ftrace_cpu_overrun_e                    7 data_loss            trace       0
```

These losses can be mitigated either increasing
[`TraceConfig.FtraceConfig.buffer_size_kb`][FtraceConfig]
 or decreasing 
[`TraceConfig.FtraceConfig.drain_period_ms`][FtraceConfig]

#### Shared memory losses

Tracing data can be lost in the shared memory due to bursts while traced is
blocked.

At the trace proto level, losses in this path are recorded:

* In [`TraceStats.BufferStats.trace_writer_packet_loss`][BufferStats].
* In [`TracePacket.previous_packet_dropped`][TracePacket].
  Caveat: the very first packet emitted by every data source is also marked as
  `previous_packet_dropped=true`. This is because the service has no way to
  tell if that was the truly first packet or everything else before that was
  lost.

At the TraceProcessor SQL level, this data is available in the `stats` table:
```sql
> select * from stats where name = 'traced_buf_trace_writer_packet_loss'
name                 idx                  severity             source    value
-------------------- -------------------- -------------------- --------- -----
traced_buf_trace_wri                    0 data_loss            trace         0
```

#### Central buffer losses

Data losses in the central buffer can happen for two different reasons:

1. When using `fill_policy: RING_BUFFER`, older tracing data is overwritten by
   virtue of wrapping in the ring buffer.
   These losses are recorded, at the trace proto level, in
   [`TraceStats.BufferStats.chunks_overwritten`][BufferStats].

2. When using `fill_policy: DISCARD`, newer tracing data committed after the
   buffer is full is dropped.
   These losses are recorded, at the trace proto level, in
   [`TraceStats.BufferStats.chunks_discarded`][BufferStats].

At the TraceProcessor SQL level, this data is available in the `stats` table,
one entry per central buffer:

```sql
> select * from stats where name = 'traced_buf_chunks_overwritten' or name = 'traced_buf_chunks_discarded'
name                 idx                  severity             source  value
-------------------- -------------------- -------------------- ------- -----
traced_buf_chunks_di                    0 info                 trace       0
traced_buf_chunks_ov                    0 data_loss            trace       0
```

Summary: the best way to detect and debug data losses is to use Trace Processor
and issue the query:
`select * from stats where severity = 'data_loss' and value != 0`

## Atomicity and ordering guarantees

A "writer sequence" is the sequence of trace packets emitted by a given
TraceWriter from a data source. In almost all cases 1 data source ==
1+ TraceWriter(s). Some data sources that support writing from multiple threads
typically create one TraceWriter per thread.

* Trace packets written from a sequence are emitted in the trace file in the
  same order they have been written.

* There is no ordering guarantee between packets written by different sequences.
  Sequences are, by design, concurrent and more than one linearization is
  possible. The service does NOT respect global timestamp ordering across
  different sequences. If two packets from two sequences were emitted in
  global timestamp order, the service can still emit them in the trace file in
  the opposite order.

* Trace packets are atomic. If a trace packet is emitted in the trace file, it
  is guaranteed to be contain all the fields that the data source wrote. If a
  trace packet is large and spans across several shared memory buffer pages, the
  service will save it in the trace file only if it can observe that all
  fragments have been committed without gaps.

* If a trace packet is lost (e.g. because of wrapping in the ring buffer
  or losses in the shared memory buffer), no further trace packet will be
  emitted for that sequence, until all packets before are dropped as well.
  In other words, if the tracing service ends up in a situation where it sees
  packets 1,2,5,6 for a sequence, it will only emit 1, 2. If, however, new
  packets (e.g., 7, 8, 9) are written and they overwrite 1, 2, clearing the gap,
  the full sequence 5, 6, 7, 8, 9 will be emitted.
  This behavior, however, doesn't hold when using [streaming mode] because,
  in that case, the periodic read will consume the packets in the buffer and
  clear the gaps, allowing the sequence to restart.

## Incremental state in trace packets

In many cases trace packets are fully independent of each other and can be
processed and interpreted without further context.
In some cases, however, they can have _incremental state_ and behave similarly
to inter-frame video encoding techniques, where some frames require the keyframe
to be present to be meaningfully decoded.

Here are are two concrete examples:

1. Ftrace scheduling slices and /proc/pid scans. ftrace scheduling events are
   keyed by thread id. In most cases users want to map those events back to the
   parent process (the thread-group). To solve this, when both the
   `linux.ftrace` and the `linux.process_stats` data sources are enabled in a
   Perfetto trace, the latter does capture process<>thread associations from
   the /proc pseudo-filesystem, whenever a new thread-id is seen by ftrace.
   A typical trace in this case looks as follows:
   ```
    # From process_stats's /proc scanner.
    pid: 610; ppid: 1; cmdline: "/system/bin/surfaceflinger"

    # From ftrace
    timestamp: 95054961131912; sched_wakeup: pid: 610;     target_cpu: 2;
    timestamp: 95054977528943; sched_switch: prev_pid: 610 prev_prio: 98
  ```
  The /proc entry is emitted only once per process to avoid bloating the size of
  the trace. In lack of data losses this is fine to be able to reconstruct all
  scheduling events for that pid. If, however, the process_stats packet gets
  dropped in the ring buffer, there will be no way left to work out the process
  details for all the other ftrace events that refer to that PID.

2. The [Track Event library](/docs/instrumentation/track-events) in the Perfetto
   SDK makes extensive use of string interning. Mos strings and descriptors
   (e.g. details about processes / threads) are emitted only once and later
   referred to using a monotonic ID. In case a loss of the descriptor packet,
   it is not possible to make fully sense of those events.

Trace Processor has built-in mechanism that detect loss of interning data and
skips ingesting packets that refer to missing interned strings or descriptors.

When using tracing in ring-buffer mode, these types of losses are very likely to
happen.

There are two mitigations for this:

1. Issuing periodic invalidations of the incremental state via
   [`TraceConfig.IncrementalStateConfig.clear_period_ms`][IncrStateConfig].
   This will cause the data sources that make use of incremental state to
   periodically drop the interning / process mapping tables and re-emit the
   descriptors / strings on the next occurrence. This mitigates quite well the
   problem in the context of ring-buffer traces, as long as the
   `clear_period_ms` is one order of magnitude lower than the estimated length
   of trace data in the central trace buffer.

2. Recording the incremental state into a dedicated buffer (via
   `DataSourceConfig.target_buffer`). This technique is quite commonly used with
   in the ftrace + process_stats example mentioned before, recording the
   process_stats packet in a dedicated buffer less likely to wrap (ftrace events
   are much more frequent than descriptors for new processes).

## Flushes and windowed trace importing

Another common problem experienced in traces that involve multiple data sources
is the non-synchronous nature of trace commits. As explained in the
[Life of a trace packet](#life-of-a-trace-packet) section above, trace data is
committed only when a full memory page of the shared memory buffer is filled (or
at when the tracing session ends). In most cases, if data sources produce events
at a regular cadence, pages are filled quite quickly and events are committed
in the central buffers within seconds.

In some other cases, however, a data source can emit events only sporadically.
Imagine the case of a data source that emits events when the display is turned
on/off. Such an infrequent event might end up being staged in the shared memory
buffer for very long times and can end up being committed in the trace buffer
hours after it happened.

Another scenario where this can happen is when using ftrace and when a
particular CPU is idle most of the time or gets hot-unplugged (ftrace uses
per-cpu buffers). In this case a CPU might record little-or-no data for several
minutes while the other CPUs pump thousands of new trace events per second.

This causes two side effects that end up breaking user expectations or causing
bugs:

* The UI can show an abnormally long timeline with a huge gap in the middle.
  The packet ordering of events doesn't matter for the UI because events are
  sorted by timestamp at import time. The trace in this case will contain very
  recent events plus a handful of stale events that happened hours before. The
  UI, for correctness, will try to display all events, showing a handful of
  early events, followed by a huge temporal gap when nothing happened,
  followed by the stream of recent events.

* When recording long traces, Trace Processor can show import errors of the form
  "XXX event out-of-order". This is because. in order to limit the memory usage
  at import time, Trace Processor sorts events using a sliding window. If trace
  packets are too out-of-order (trace file order vs timestamp order), the
  sorting will fail and some packets will be dropped.

#### Mitigations

The best mitigation for these sort of problems is to specify a
[`flush_period_ms`][TraceConfig] in the trace config (10-30 seconds is usually
good enough for most cases), especially when recording long traces.

This will cause the tracing service to issue periodic flush requests to data
sources. A flush requests causes the data source to commit the shared memory
buffer pages into the central buffer, even if they are not completely full.
By default, a flush issued only at the end of the trace.

In case of long traces recorded without `flush_period_ms`, another option is to
pass the `--full-sort` option to `trace_processor_shell` when importing the
trace. Doing so will disable the windowed sorting at the cost of a higher
memory usage (the trace file will be fully buffered in memory before parsing).

[streaming mode]: /docs/concepts/config#long-traces
[TraceConfig]: /docs/reference/trace-config-proto.autogen#TraceConfig
[FtraceConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig
[IncrStateConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig.IncrementalStateConfig
[FtraceCpuStats]: /docs/reference/trace-packet-proto.autogen#FtraceCpuStats
[FtraceEventBundle]: /docs/reference/trace-packet-proto.autogen#FtraceEventBundle
[TracePacket]: /docs/reference/trace-packet-proto.autogen#TracePacket
[BufferStats]: /docs/reference/trace-packet-proto.autogen#TraceStats.BufferStats