Contract between NNAPI systrace code and parser
===============================================

This text files documents how tracing in the NNAPI manifests in systrace output
and how that output is interpreted by the systrace parser.

Please view in a 160 character window.

Special cases
-------------

- Execution is exposed as an asynchronous event from the runtime. Time taken
  by the runtime is calculated as the time between start of
  ANeuralNetworksExecution_startCompute and end of
  ANeuralNetworksEvent_wait. This special case is not reflected in the cases
  presented below.

Notation
--------
- ...: elided code
- t_m_w: tracing_mark_write
- tX: timestamps
- T1, T2: thread ids


Cases for the parser
====================

Source code                                    Systrace                                                   Interpretation for timing statistics - all
                                                                                                          times are wallclock
------------------------------------------------------------------------------------------------------------------------------------------------------

*Baseline case*

... funcP(...) {                               t0: t_m_w:B|T1|[NN_LR_PP]funcP                             Add (t1-t0) to total time spent in Layer
  NNTRACE_RT(NNTRACE_PHASE_PREPARATION,        t1: t_m_w:E|T1                                             Runtime, Phase Preparation
             "funcP);
  ...
}

------------------------------------------------------------------------------------------------------------------------------------------------------

*Local call to other layer*

... funcA1(...) {                              t0:   t_m_w:B|T1|[NN_LA_PP]funcE1                          Add (t3-t0) to total time spent in Layer
  NNTRACE_APP(NNTRACE_PHASE_PREPARATION,       t1:   t_m_w:B|T1|[NN_LR_PP]funcC1                          Application, Phase Preparation
              "funcA1);                        t2:   t_m_w:E|T1
  ... funcR1(...);                             t3:   t_m_w:E|T1                                           Add (t2-t1) to total time spent in Layer
  ...                                                                                                     Runtime, Phase Preparation
}
... funcR1(...) {                                                                                         Note: Self-time of Layer Application,
  NNTRACE_RT(NNTRACE_PHASE_PREPARATION,                                                                   Phase Preparation will be calculated as
             "funcR1"); ...                                                                               total time in Layer Application - total time
}                                                                                                         in Layer Runtime

                                                                                                          Note: These can be nested as per rows
                                                                                                          below ("Switch phase ...", "Subphases ...",
                                                                                                          "Additional detail...")

                                                                                                          Note: If the called function specifies a
                                                                                                          phase that is not supposed to be nested,
                                                                                                          the parser will emit a diagnostic.

------------------------------------------------------------------------------------------------------------------------------------------------------

*Switch phase during function*

... funcC1(...) {                              t0:   t_m_w:B|T1|[NN_LC_PTR]funcC1                         Add (t1-t0) to total time spent in Layer
  NNTRACE_TRANS("funcC1");                     t1:   t_m_w:B|T1|[SW][NN_LC_PCO]funcC1                     CPU, Phase Transformation
  ...                                          t2:   t_m_w:E|T1
  NNTRACE_COMP_SWITCH("funcC1");               t3:   t_m_w:E|T1                                           Add (t2-t1) to total time spent in Layer
  ...                                                                                                     CPU, Phase Computation
}
                                                                                                          (t3-t2 treated as negligible - only the
                                                                                                          destructors of objects created between
                                                                                                          the tracepoints)

------------------------------------------------------------------------------------------------------------------------------------------------------

*Subphases of execution*

... funcR2(...) {                              t0:   t_m_w:B|T1|[NN_LR_PE]funcR2                          Add (t2-t1) to total time spent in Layer
  NNTRACE_RT(NNTRACE_PHASE_EXECUTION,          t1:   t_m_w:B|T1|[NN_LC_PCO]funcC2                         CPU, (sub)Phase Computation and to
             "funcR2);                         t2:   t_m_w:E|T1                                           total time in Phase Execution
  ... funcC2(...);                             t3:   t_m_w:E|T1
  ...                                                                                                     Add (t3-t0) to total time spent in Layer
}                                                                                                         Runtime, Phase Execution
... funcC2(...) {
  NNTRACE_COMP("funcC2");
  ...
}

------------------------------------------------------------------------------------------------------------------------------------------------------

*Additional detail in the same layer*

... funcR3(...) {                              t0:   t_m_w:B|T1|[NN_LR_PE]funcR3                          Add (t3-t0) to total time spent in Layer
  NNTRACE_RT(NNTRACE_PHASE_EXECUTION,          t1:   t_m_w:B|T1|[NN_LR_PE]funcR4                          Runtime, Phase Execution
             "funcR3);                         t2:   t_m_w:E|T1
  ... funcR4(...);                             t3:   t_m_w:E|T1                                           Note: funcR4 will be visible in the systrace
  ...                                                                                                     visualization
}
... funcR4(...) {
  NNTRACE_RT(NNTRACE_PHASE_EXECUTION,
             "funcR4");
  ...
}


------------------------------------------------------------------------------------------------------------------------------------------------------

*Synchronous IPC call*

... funcR5(...) {                              t0:   t_m_w:B|T1|[NN_LR_PC]funcR5                          Add (t5-t0) - (t4-r1) to total time spent in
  NNTRACE_RT(NNTRACE_PHASE_COMPILATION,        t1:   t_m_w:B|T1|[NN_LI_PI]getCapabilities                 Layer Runtime, Phase Compilation; see
             "funcR5");                        t2:   t_m_w:B|T1|HIDL::IDevice::getCapabilities::client    "Onetime initialization code".
  ... device->getCapabilities()                t3:   t_m_w:E|T1
  ...                                          t4:   t_m_w:E|T1                                           Add (t4-t1) to total time spent in Layer
}                                              t5:   t_m_w:E|T1                                           IPC, Phase Initialization
... VersionedIDevice::getCapabilities(...) {
   NTRACE_FULL(NNTRACE_LAYER_IPC,                                                                         Note: Self-time of Layer Runtime, Phase
               NNTRACE_PHASE_COMPILATION,                                                                 Compilation will be calculated as total
              "getCapabilities");                                                                         time in Layer Runtime - total time in Layer
                                                                                                          IPC
}
                                                                                                          Note: Tracepoints are needed for the
                                                                                                          client IPC calls. The HIDL tracing isn't
                                                                                                          guaranteed to wait for the server - it just
                                                                                                          sends the transaction even if the call is
                                                                                                          synchronous.

------------------------------------------------------------------------------------------------------------------------------------------------------

*Asynchronous IPC call that is synchronously waited for by the runtime*

// Runtime code                                t0: t_m_w:B|T1|[NN_LI_PC]prepareModel                      Add (t10-t0) to total time spent in Layer
... funcRC(...) {                              t1: t_m_w:B|T1|HIDL::IDevice::prepareModel_1_1::client     IPC, Phase Compilation
   ...                                         t2: t_m_w:B|T2|HIDL::IDevice::prepareModel_1_1::server
   NTRACE_FULL(NNTRACE_LAYER_IPC,              t3: t_m_w:B|T2|[NN_LD_PC]SampleDriver::prepareModel        Add (t6-t2) to total time spent in Layer
               NNTRACE_PHASE_COMPILATION,      t4: t_m_w:B|T2|HIDL::IPreparedModelCallback::notify::clie  Driver, Phase Compilation. This includes
              "prapareModel");                 t5: t_m_w:E|T2                                             the generated HIDL stub code, which is
   ...                                         t6: t_m_w:E|T2                                             <0.05ms.
   device->prepareModel(...);                  t7: t_m_w:E|T2
   ...                                         t8: t_m_w:B|T1|HIDL::IPreparedModelCallback::notify::serv  Note: the HIDL trace rows are added by
   cb->wait();                                 t9: t_m_w:E|T1                                             the automatically generated proxy and
   ...                                         t10: t_m_w:E|T1                                            stub code. For the driver side, the
}                                              t11: t_m_w:E|T1                                            mapping of the HIDL functions to layers
                                                                                                          and phases is done in the systrace
// Driver code                                                                                            parser.
... SampleDriver::prepareModel(...) {
  NNTRACE_FULL(NNTRACE_LAYER_DRIVER,                                                                      Note: the SampleDriver::prepareModel is
               NNTRACE_PHASE_COMPILATION,                                                                 treated as additional detail for Layer
               "SampleDriver::prepareModel");                                                             Driver, Phase Compilation.
}
                                                                                                          Note: the "t_m_w" output of
                                                                                                          systrace uses thread ids, so that the call
                                                                                                          stack can be reconstructed. The systrace
                                                                                                          rows are also annotated with process ids.
                                                                                                          The parser uses the process ids to
                                                                                                          distinguish between the application
                                                                                                          process from the driver process (used for
                                                                                                          diagnostics and for distinguishing CPU
                                                                                                          fallback from sample driver).

                                                                                                          Note: the next row in this table gives more
                                                                                                          detail for prepareModel specifically

                                                                                                          Note: the driver-side HIDL traces get us
                                                                                                          the time spent in sample and hvx drivers.
                                                                                                          With a different driver threading model
                                                                                                          this may not be the case - future drivers
                                                                                                          should add manual tracing.

                                                                                                          TODO: attribute driver process IPC call
                                                                                                          (callback) overhead to IPC layer.

------------------------------------------------------------------------------------------------------------------------------------------------------

*Subtracting time when nesting is violated*

// Runtime code                                t0:   t_m_w:B|T1|[NN_LI_PC]prepareModel                    Add (t3 - t0) - (t2 - t1) to total time spent
... funcRC(...) {                              t1:   t_m_w:B|T1|[SUB][NN_LR_PC]VersionedIDevice::prepareM in Layer IPC, Phase compilation
   ...                                         t2:   t_m_w:E|T1
   NTRACE_FULL(NNTRACE_LAYER_IPC,              t3:   t_m_w:E|T1                                           Add (t2-t1) to total time spent in Layer
               NNTRACE_PHASE_COMPILATION,                                                                 Runtime, Phase compilation
              "prapareModel");
   ...
   device->prepareModel(...);
   ...
   cb->wait();
   ...
}

... VersionedIDevice::prepareModel(...) {
    // IPC work
    {
        NNTRACE_FULL_SUBTRACT(
            NNTRACE_LAYER_RUNTIME,
            NNTRACE_PHASE_COMPILATION,
            "VersionedIDevice::prepareModel");
        // Runtime work
    }
    // IPC work
}

------------------------------------------------------------------------------------------------------------------------------------------------------

*Onetime initialization code*

... funcR5(...) {                              t0:   t_m_w:B|T1|[NN_LR_PP]funcR5                          Add (t2-t1) to total time spent in Layer
  NNTRACE_RT(NNTRACE_PHASE_PREPARATION,        t1:   t_m_w:B|T1|[NN_LR_PI]funcI                           Runtime, Phase Initialization
             "runcR5);                         t2:   t_m_w:E|T1
  ... funcI(...);                              t3:   t_m_w:E|T1                                           Add (t3 - t0) - (t2 - t1) to total time spent
  ...                                                                                                     in Layer Runtime, Phase Preparation.
}
... funcI(...) {
  NNTRACE_RT(NNTRACE_PHASE_INITIALIZATION,
             "funcI")
  ...
}

------------------------------------------------------------------------------------------------------------------------------------------------------

*Utility code*

... funcR6(...) {                              t0:   t_m_w:B|T1|[NN_LR_PP]funcR6                          Add (t3-t0) to total time spent in Layer
  NNTRACE_RT(NNTRACE_PHASE_PREPARATION,        t1:   t_m_w:B|T1|[NN_LU_PU]funcU                           Runtime, Phase Preparation
             "funcR6");                        t2:   t_m_w:E|T1
  ... funcU(...);                              t3:   t_m_w:E|T1                                           Note: the funcU is treated as additional
  ...                                                                                                     detail.
}
... funcU(...) {
  NNTRACE_FULL(NNTRACE_LAYER_UTILITY,
               NNTRACE_PHASE_UNSPECIFIED,
               "funcU")
  ...
}