NAME
perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf toolsSYNOPSIS
perf record -e arm_spe//
DESCRIPTION
The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and events down to individual instructions. Rather than being interrupt-driven, it picks an instruction to sample and then captures data for it during execution. Data includes execution time in cycles. For loads and stores it also includes data address, cache miss events, and data origin. 1.Choose an operation
2.Collect data about the operation
3.Optionally discard the record based on a
filter
4.Write the record to memory
5.Interrupt when the buffer is full
Choose an operation
This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all architectural instructions or all micro-ops. Sampling happens at a programmable interval. The architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random perturbation is also added to the sampling interval by default.Collect data about the operation
Program counter, PMU events, timings and data addresses related to the operation are recorded. Sampling ensures there is only one sampled operation is in flight.Optionally discard the record based on a filter
Based on programmable criteria, choose whether to keep the record or discard it. If the record is discarded then the flow stops here for this sample.Write the record to memory
The record is appended to a memory bufferInterrupt when the buffer is full
When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records. Perf saves the raw data in the perf.data file.OPENING THE FILE
Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the recorded file is opened with perf report or perf script does the decoding happen. When decoding the data, Perf generates "synthetic samples" as if these were generated at the time of the recording. These samples are the same as if normal sampling was done by Perf without using SPE, although they may have more attributes associated with them. For example a normal sample may have just the instruction pointer, but an SPE sample can have data addresses and latency attributes.WHY SAMPLING?
•Sampling, rather than tracing, cuts
down the profiling problem to something more manageable for hardware. Only one
sampled operation is in flight at a time.
•Allows precise attribution data,
including: Full PC of instruction, data virtual and physical addresses.
•Allows correlation between an
instruction and events, such as TLB and cache miss. (Data source indicates
which particular cache was hit, but the meaning is implementation defined
because different implementations can have different cache
configurations.)
COLLISIONS
When an operation is sampled while a previous sampled operation has not finished, a collision occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate should be set to avoid collisions.THE EFFECT OF MICROARCHITECTURAL SAMPLING
If an implementation samples micro-operations instead of instructions, the results of sampling must be weighted accordingly.KERNEL REQUIREMENTS
The ARM_SPE_PMU config must be set to build as either a module or statically.CAPTURING SPE WITH PERF COMMAND-LINE TOOLS
You can record a session with SPE samples:perf record -e arm_spe// -- ./mybench
Config parameters
These are placed between the // in the event and comma separated. For example -e arm_spe/load_filter=1,min_latency=10/branch_filter=1 - collect branches only (PMSFCR.B) event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND) load_filter=1 - collect loads only (PMSFCR.LD) min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR) pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege store_filter=1 - collect stores only (PMSFCR.ST) ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS)
bit 1 - instruction retired (i.e. omit speculative instructions) bit 3 - L1D refill bit 5 - TLB refill bit 7 - mispredict bit 11 - misaligned access
perf record -e arm_spe/event_filter=2/ -- ./mybench
perf record -e arm_spe/event_filter=0x80/ -- ./mybench
Viewing the data
By default perf report and perf script will assign samples to separate groups depending on the attributes/events of the SPE record. Because instructions can have multiple events associated with them, the samples in these groups are not necessarily unique. For example perf report shows these groups:Available samples 0 arm_spe// 0 dummy:u 21 l1d-miss 897 l1d-access 5 llc-miss 7 llc-access 2 tlb-miss 1K tlb-access 36 branch-miss 0 remote-access 900 memory
perf report --itrace=i1i
perf report --mem-mode
Common errors
•"Cannot find PMU
‘arm_spe’. Missing kernel support?"
Module not built or loaded, KPTI not disabled (see above), or running on a VM
•"Arm SPE CONTEXT packets not
found in the traces."
Root privilege is required to collect context packets. But these only increase the accuracy of assigning PIDs to kernel samples. For userspace sampling this can be ignored.
•Excessively large perf.data file size
Increase sampling interval (see above)
SEE ALSO
perf-record(1), perf-script(1), perf-report(1), perf-inject(1)2024-06-21 | perf |