perf-arm-spe(1) — Linux manual page

PERF-ARM-SPE(1)                perf Manual               PERF-ARM-SPE(1)

NAME

       perf-arm-spe - Support for Arm Statistical Profiling Extension
       within Perf tools

SYNOPSIS

       perf record -e arm_spe//

DESCRIPTION

       The SPE (Statistical Profiling Extension) feature provides
       accurate attribution of latencies and events down to individual
       instructions. Rather than being interrupt-driven, it picks an
       instruction to sample and then captures data for it during
       execution. Data includes execution time in cycles. For loads and
       stores it also includes data address, cache miss events, and data
       origin.

       The sampling has 5 stages:

        1. Choose an operation

        2. Collect data about the operation

        3. Optionally discard the record based on a filter

        4. Write the record to memory

        5. Interrupt when the buffer is full

   Choose an operation
       This is chosen from a sample population, for SPE this is an
       IMPLEMENTATION DEFINED choice of all architectural instructions
       or all micro-ops. Sampling happens at a programmable interval.
       The architecture provides a mechanism for the SPE driver to infer
       the minimum interval at which it should sample. This minimum
       interval is used by the driver if no interval is specified. A
       pseudo-random perturbation is also added to the sampling interval
       by default.

   Collect data about the operation
       Program counter, PMU events, timings and data addresses related
       to the operation are recorded. Sampling ensures there is only one
       sampled operation is in flight.

   Optionally discard the record based on a filter
       Based on programmable criteria, choose whether to keep the record
       or discard it. If the record is discarded then the flow stops
       here for this sample.

   Write the record to memory
       The record is appended to a memory buffer

   Interrupt when the buffer is full
       When the buffer fills, an interrupt is sent and the driver
       signals Perf to collect the records. Perf saves the raw data in
       the perf.data file.

OPENING THE FILE

       Up until this point no decoding of the SPE data was done by
       either the kernel or Perf. Only when the recorded file is opened
       with perf report or perf script does the decoding happen. When
       decoding the data, Perf generates "synthetic samples" as if these
       were generated at the time of the recording. These samples are
       the same as if normal sampling was done by Perf without using
       SPE, although they may have more attributes associated with them.
       For example a normal sample may have just the instruction
       pointer, but an SPE sample can have data addresses and latency
       attributes.

WHY SAMPLING?

       •   Sampling, rather than tracing, cuts down the profiling
           problem to something more manageable for hardware. Only one
           sampled operation is in flight at a time.

       •   Allows precise attribution data, including: Full PC of
           instruction, data virtual and physical addresses.

       •   Allows correlation between an instruction and events, such as
           TLB and cache miss. (Data source indicates which particular
           cache was hit, but the meaning is implementation defined
           because different implementations can have different cache
           configurations.)

       However, SPE does not provide any call-graph information, and
       relies on statistical methods.

COLLISIONS

       When an operation is sampled while a previous sampled operation
       has not finished, a collision occurs. The new sample is dropped.
       Collisions affect the integrity of the data, so the sample rate
       should be set to avoid collisions.

       The sample_collision PMU event can be used to determine the
       number of lost samples. Although this count is based on
       collisions before filtering occurs. Therefore this can not be
       used as an exact number for samples dropped that would have made
       it through the filter, but can be a rough guide.

THE EFFECT OF MICROARCHITECTURAL SAMPLING

       If an implementation samples micro-operations instead of
       instructions, the results of sampling must be weighted
       accordingly.

       For example, if a given instruction A is always converted into
       two micro-operations, A0 and A1, it becomes twice as likely to
       appear in the sample population.

       The coarse effect of conversions, and, if applicable, sampling of
       speculative operations, can be estimated from the sample_pop and
       inst_retired PMU events.

KERNEL REQUIREMENTS

       The ARM_SPE_PMU config must be set to build as either a module or
       statically.

       Depending on CPU model, the kernel may need to be booted with
       page table isolation disabled (kpti=off). If KPTI needs to be
       disabled, this will fail with a console message "profiling buffer
       inaccessible. Try passing kpti=off on the kernel command line".

       For the full criteria that determine whether KPTI needs to be
       forced off or not, see function unmap_kernel_at_el0() in the
       kernel sources. Common cases where it’s not required are on the
       CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is
       mandatory.

       The SPE interrupt must also be described by the firmware. If the
       module is loaded and KPTI is disabled (or isn’t required to be
       disabled) but the SPE PMU still doesn’t show in
       /sys/bus/event_source/devices/, then it’s possible that the SPE
       interrupt isn’t described by ACPI or DT. In this case no warning
       will be printed by the driver.

CAPTURING SPE WITH PERF COMMAND-LINE TOOLS

       You can record a session with SPE samples:

           perf record -e arm_spe// -- ./mybench

       The sample period is set from the -c option, and because the
       minimum interval is used by default it’s recommended to set this
       to a higher value. The value is written to PMSIRR.INTERVAL.

   Config parameters
       These are placed between the // in the event and comma separated.
       For example -e arm_spe/load_filter=1,min_latency=10/

           branch_filter=1     - collect branches only (PMSFCR.B)
           event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
           jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
           load_filter=1       - collect loads only (PMSFCR.LD)
           min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
           pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
           pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
           store_filter=1      - collect stores only (PMSFCR.ST)
           ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)

       * Latency is the total latency from the point at which sampling
       started on that instruction, rather than only the execution
       latency.

       Only some events can be filtered on; these include:

           bit 1     - instruction retired (i.e. omit speculative instructions)
           bit 3     - L1D refill
           bit 5     - TLB refill
           bit 7     - mispredict
           bit 11    - misaligned access

       So to sample just retired instructions:

           perf record -e arm_spe/event_filter=2/ -- ./mybench

       or just mispredicted branches:

           perf record -e arm_spe/event_filter=0x80/ -- ./mybench

   Viewing the data
       By default perf report and perf script will assign samples to
       separate groups depending on the attributes/events of the SPE
       record. Because instructions can have multiple events associated
       with them, the samples in these groups are not necessarily
       unique. For example perf report shows these groups:

           Available samples
           0 arm_spe//
           0 dummy:u
           21 l1d-miss
           897 l1d-access
           5 llc-miss
           7 llc-access
           2 tlb-miss
           1K tlb-access
           36 branch-miss
           0 remote-access
           900 memory

       The arm_spe// and dummy:u events are implementation details and
       are expected to be empty.

       To get a full list of unique samples that are not sorted into
       groups, set the itrace option to generate instruction samples.
       The period option is also taken into account, so set it to 1
       instruction unless you want to further downsample the already
       sampled SPE data:

           perf report --itrace=i1i

       Memory access details are also stored on the samples and this can
       be viewed with:

           perf report --mem-mode

   Common errors
       •   "Cannot find PMU ‘arm_spe’. Missing kernel support?"

               Module not built or loaded, KPTI not disabled, interrupt not described by firmware,
               or running on a VM. See 'Kernel Requirements' above.

       •   "Arm SPE CONTEXT packets not found in the traces."

               Root privilege is required to collect context packets. But these only increase the accuracy of
               assigning PIDs to kernel samples. For userspace sampling this can be ignored.

       •   Excessively large perf.data file size

               Increase sampling interval (see above)

COLOPHON

       This page is part of the perf (Performance analysis tools for
       Linux (in Linux source tree)) project.  Information about the
       project can be found at 
       ⟨https://perf.wiki.kernel.org/index.php/Main_Page⟩.  If you have a
       bug report for this manual page, send it to
       linux-kernel@vger.kernel.org.  This page was obtained from the
       project's upstream Git repository
       ⟨http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git⟩
       on 2024-06-14.  (At that time, the date of the most recent commit
       that was found in the repository was 2024-06-13.)  If you
       discover any rendering problems in this HTML version of the page,
       or you believe there is a better or more up-to-date source for
       the page, or you have corrections or improvements to the
       information in this COLOPHON (which is not part of the original
       manual page), send a mail to man-pages@man7.org

perf                           2024-03-21                PERF-ARM-SPE(1)

Pages that refer to this page: perf(1), perf-c2c(1), perf-mem(1)

...
https://man7.org/linux/man-pages/man1/perf-arm-spe.1.html

perf-arm-spe(1) — Linux manual page

NAME

SYNOPSIS

DESCRIPTION

OPENING THE FILE

WHY SAMPLING?

COLLISIONS

THE EFFECT OF MICROARCHITECTURAL SAMPLING

KERNEL REQUIREMENTS

CAPTURING SPE WITH PERF COMMAND-LINE TOOLS

SEE ALSO

COLOPHON