NAME
perf-intel-pt - Support for Intel Processor Trace within perf toolsSYNOPSIS
perf record -e intel_pt//
DESCRIPTION
Intel Processor Trace (Intel PT) is an extension of Intel Architecture that collects information about software execution such as control flow, execution modes and timings and formats it into highly compressed binary packets. Technical details are documented in the Intel 64 and IA-32 Architectures Software Developer Manuals, Chapter 36 Intel Processor Trace.QUICKSTART
It is important to start small. That is because it is easy to capture vastly more data than can possibly be processed.perf record -e intel_pt//u ls
perf report
sudo perf record -o pt_ls --kcore -e intel_pt// -- ls
sudo perf report -i pt_ls
sudo perf report pt_ls --itrace=i1usge
perf record -e intel_pt//u ls perf script --itrace=ibxwpe
perf script --itrace=ibxwpe -F+flags
perf script --insn-trace --xed
perf script --call-trace
perf script --call-ret-trace
perf script --time starttime,stoptime --insn-trace --xed
perf script --time starttime,stoptime --insn-trace --xed -C 1
perf script --itrace=be -F+ipc
--insn-trace - instruction trace --src-trace - source trace
PERF RECORD
new event
The Intel PT kernel driver creates a new PMU for Intel PT. PMU events are selected by providing the PMU name followed by the "config" separated by slashes. An enhancement has been made to allow default "config" e.g. the option-e intel_pt//
-e intel_pt/tsc,noretcomp=0/
-e intel_pt/tsc=1,noretcomp=0/
$ grep -H . /sys/bus/event_source/devices/intel_pt/format/* /sys/bus/event_source/devices/intel_pt/format/cyc:config:1 /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22 /sys/bus/event_source/devices/intel_pt/format/mtc:config:9 /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17 /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11 /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27 /sys/bus/event_source/devices/intel_pt/format/tsc:config:10
-e intel_pt/noretcomp=0/
-e intel_pt/tsc=1,noretcomp=0/
-e intel_pt/tsc=0/
-e intel_pt/config=0x400/
u userspace k kernel h hypervisor G guest H host p precise ip
------------------------------------------------------------ perf_event_attr: type 6 size 112 config 0x400 { sample_period, sample_freq } 1 sample_type IP|TID|TIME|CPU|IDENTIFIER read_format ID disabled 1 inherit 1 exclude_kernel 1 exclude_hv 1 enable_on_exec 1 sample_id_all 1 ------------------------------------------------------------ sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 ------------------------------------------------------------
config terms
The June 2015 version of Intel 64 and IA-32 Architectures Software Developer Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features. Some of the features are reflect in new config terms. All the config terms are described below.The default config selects tsc (i.e. tsc=1).
The default config does not select noretcomp (i.e. noretcomp=0).
The PSB packet is a synchronization packet that provides a starting point for decoding or recovery from errors.
Support for psb_period is indicated by:
/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
which contains "1" if the feature is supported and "0" otherwise.
Valid values are given by:
/sys/bus/event_source/devices/intel_pt/caps/psb_periods
which contains a hexadecimal value, the bits of which represent valid values e.g. bit 2 set means value 2 is valid.
The psb_period value is converted to the approximate number of trace bytes between PSB packets as:
2 ^ (value + 11)
e.g. value 3 means 16KiB bytes between PSBs
If an invalid value is entered, the error message will give a list of valid values e.g.
$ perf record -e intel_pt/psb_period=15/u uname Invalid psb_period for intel_pt. Valid values are: 0-5
If MTC packets are selected, the default config selects a value of 3 (i.e. psb_period=3) or the nearest lower value that is supported (0 is always supported). Otherwise the default is 0.
If decoding is expected to be reliable and the buffer is large then a large PSB period can be used.
Because a TSC packet is produced with PSB, the PSB period can also affect the granularity to timing information in the absence of MTC or CYC.
MTC packets provide finer grain timestamp information than TSC packets. MTC packets record time using the hardware crystal clock (CTC) which is related to TSC packets using a TMA packet.
Support for this feature is indicated by:
/sys/bus/event_source/devices/intel_pt/caps/mtc
which contains "1" if the feature is supported and "0" otherwise.
The frequency of MTC packets can also be specified - see mtc_period below.
Valid values are given by:
/sys/bus/event_source/devices/intel_pt/caps/mtc_periods
which contains a hexadecimal value, the bits of which represent valid values e.g. bit 2 set means value 2 is valid.
The mtc_period value is converted to the MTC frequency as:
CTC-frequency / (2 ^ value)
e.g. value 3 means one eighth of CTC-frequency
Where CTC is the hardware crystal clock, the frequency of which can be related to TSC via values provided in cpuid leaf 0x15.
If an invalid value is entered, the error message will give a list of valid values e.g.
$ perf record -e intel_pt/mtc_period=15/u uname Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
The default value is 3 or the nearest lower value that is supported (0 is always supported).
CYC packets provide even finer grain timestamp information than MTC and TSC packets. A CYC packet contains the number of CPU cycles since the last CYC packet. Unlike MTC and TSC packets, CYC packets are only sent when another packet is also sent.
Support for this feature is indicated by:
/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
which contains "1" if the feature is supported and "0" otherwise.
The number of CYC packets produced can be reduced by specifying a threshold - see cyc_thresh below.
Valid cyc_thresh values are given by:
/sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
which contains a hexadecimal value, the bits of which represent valid values e.g. bit 2 set means value 2 is valid.
The cyc_thresh value represents the minimum number of CPU cycles that must have passed before a CYC packet can be sent. The number of CPU cycles is:
2 ^ (value - 1)
e.g. value 4 means 8 CPU cycles must pass before a CYC packet can be sent. Note a CYC packet is still only sent when another packet is sent, not at, e.g. every 8 CPU cycles.
If an invalid value is entered, the error message will give a list of valid values e.g.
$ perf record -e intel_pt/cyc,cyc_thresh=15/u uname Invalid cyc_thresh for intel_pt. Valid values are: 0-12
CYC packets are not requested by default.
The default config selects 'pt' if it is available, so a user will never need to specify this term.
The default config selects 'branch' if it is available.
Support for this feature is indicated by:
/sys/bus/event_source/devices/intel_pt/caps/ptwrite
which contains "1" if the feature is supported and "0" otherwise.
As an alternative, refer to "Emulated PTWRITE" further below.
Support for this feature is indicated by:
/sys/bus/event_source/devices/intel_pt/caps/power_event_trace
which contains "1" if the feature is supported and "0" otherwise.
Support for this feature is indicated by:
/sys/bus/event_source/devices/intel_pt/caps/event_trace
which contains "1" if the feature is supported and "0" otherwise.
Support for this feature is indicated by:
/sys/bus/event_source/devices/intel_pt/caps/tnt_disable
which contains "1" if the feature is supported and "0" otherwise.
AUX area sampling option
To select Intel PT "sampling" the AUX area sampling option can be used:--aux-sample
--aux-sample=8192
-e intel_pt//u
perf record --aux-sample -e '{intel_pt//u,branch-misses:u}'
perf record -e intel_pt//u -e branch-misses/aux-sample-size=8192/u
perf record -e '{intel_pt//u,branch-misses/aux-sample-size=8192/u}'
perf record -e intel_pt//u --filter 'filter * @/bin/ls' -e branch-misses/aux-sample-size=8192/u -- ls
Intel PT sample size (%zu) may be too small for PSB period (%zu)
new snapshot option
The difference between full trace and snapshot from the kernel’s perspective is that in full trace we don’t overwrite trace data that the user hasn’t collected yet (and indicated that by advancing aux_tail), whereas in snapshot mode we let the trace run and overwrite older data in the buffer so that whenever something interesting happens, we can stop it and grab a snapshot of what was going on around that interesting moment.-S
-S0x100000
Intel PT snapshot size: %zu
new auxtrace mmap size option
Intel PT buffer size is specified by an addition to the -m option e.g.-m,16
mmap length 528384 auxtrace mmap length 4198400
Intel PT modes of operation
Intel PT can be used in 3 modes: full-trace mode sample mode snapshot modeperf record -e intel_pt//u uname
perf record --aux-sample -e intel_pt//u -e branch-misses:u
perf record -v -e intel_pt//u -S ./loopy 1000000000 & [1] 11435 kill -USR2 11435 Recording AUX area tracing snapshot
$ mkfifo perf.control $ mkfifo perf.ack $ cat perf.ack & [1] 15235 $ sudo ~/bin/perf record --control fifo:perf.control,perf.ack -S -e intel_pt//u -- sleep 60 & [2] 15243 $ ps -e | grep perf 15244 pts/1 00:00:00 perf $ kill -USR2 15244 bash: kill: (15244) - Operation not permitted $ echo snapshot > perf.control ack
Buffer handling
There may be buffer limitations (i.e. single ToPa entry) which means that actual buffer sizes are limited to powers of 2 up to 4MiB (MAX_ORDER). In order to provide other sizes, and in particular an arbitrarily large size, multiple buffers are logically concatenated. However an interrupt must be used to switch between buffers. That has two potential problems: a) the interrupt may not be handled in time so that the current buffer becomes full and some trace data is lost. b) the interrupts may slow the system and affect the performance results.Intel PT and build ids
By default "perf record" post-processes the event stream to find all build ids for executables for all addresses sampled. Deliberately, Intel PT is not decoded for that purpose (it would take too long). Instead the build ids for all executables encountered (due to mmap, comm or task events) are included in the perf.data file.perf buildid-list
perf buildid-list --with-hits
Snapshot mode and event disabling
In order to make a snapshot, the intel_pt event is disabled using an IOCTL, namely PERF_EVENT_IOC_DISABLE. However doing that can also disable the collection of side-band information. In order to prevent that, a dummy software event has been introduced that permits tracking events (like mmaps) to continue to be recorded while intel_pt is disabled. That is important to ensure there is complete side-band information to allow the decoding of subsequent snapshots.perf test list ... 23: Test using a dummy software event to keep tracking
perf test 23 23: Test using a dummy software event to keep tracking : Ok
perf record modes (nothing new here)
perf record essentially operates in one of three modes: per thread per cpu workload onlyPrivileged vs non-privileged users
Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users have memory limits imposed upon them. That affects what buffer sizes they can have as outlined above.sched_switch tracepoint
The sched_switch tracepoint is used to provide side-band data for Intel PT decoding in kernels where the PERF_RECORD_SWITCH metadata event isn’t available.$ perf record -vv -e intel_pt//u uname ------------------------------------------------------------ perf_event_attr: type 6 size 112 config 0x400 { sample_period, sample_freq } 1 sample_type IP|TID|TIME|CPU|IDENTIFIER read_format ID disabled 1 inherit 1 exclude_kernel 1 exclude_hv 1 enable_on_exec 1 sample_id_all 1 ------------------------------------------------------------ sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 ------------------------------------------------------------ perf_event_attr: type 2 size 112 config 0x108 { sample_period, sample_freq } 1 sample_type IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER read_format ID inherit 1 sample_id_all 1 exclude_guest 1 ------------------------------------------------------------ sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 ------------------------------------------------------------ perf_event_attr: type 1 size 112 config 0x9 { sample_period, sample_freq } 1 sample_type IP|TID|TIME|IDENTIFIER read_format ID disabled 1 inherit 1 exclude_kernel 1 exclude_hv 1 mmap 1 comm 1 enable_on_exec 1 task 1 sample_id_all 1 mmap2 1 comm_exec 1 ------------------------------------------------------------ sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 mmap size 528384B AUX area mmap length 4194304 perf event ring buffer mmapped per cpu Synthesizing auxtrace information Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.042 MB perf.data ]
PERF SCRIPT
By default, perf script will decode trace data found in the perf.data file. This can be further controlled by new option --itrace.New --itrace option
Having no option is the same as--itrace
--itrace=cepwx
i synthesize "instructions" events b synthesize "branches" events x synthesize "transactions" events w synthesize "ptwrite" events p synthesize "power" events (incl. PSB events) c synthesize branches events (calls only) r synthesize branches events (returns only) o synthesize PEBS-via-PT events I synthesize Event Trace events e synthesize tracing error events d create a debug log g synthesize a call chain (use with i or x) G synthesize a call chain on existing event records l synthesize last branch entries (use with i or x) L synthesize last branch entries on existing event records s skip initial number of events q quicker (less detailed) decoding A approximate IPC Z prefer to ignore timestamps (so-called "timeless" decoding)
cbr: cbr: 22 freq: 2189 MHz (200%) mwait: hints: 0x60 extensions: 0x1 pwre: hw: 0 cstate: 2 sub-cstate: 0 exstop: ip: 1 pwrx: deepest cstate: 2 last cstate: 2 wake reason: 0x4
"cbr" includes the frequency and the percentage of maximum non-turbo "mwait" shows mwait hints and extensions "pwre" shows C-state transitions (to a C-state deeper than C0) and whether initiated by hardware "exstop" indicates execution stopped and whether the IP was recorded exactly, "pwrx" indicates return to C0
-o Suppress overflow errors -l Suppress trace data lost errors
--itrace=e-o-l
-a Suppress logging of perf events +a Log all perf events +e Output only on decoding errors (size configurable) +o Output to stdout instead of "intel_pt.log"
--itrace=i10us
--itrace=ig32 --itrace=xg32
--itrace=il10 --itrace=xl10
perf record --aux-sample -e '{intel_pt//u,branch-misses:u}' -- ls perf report --itrace=Ge
perf report
--itrace=i0nss1000000
•direct calls and jmps
•conditional branches
•non-branch instructions
•asynchronous branches such as
interrupts
•indirect branches
•function return target address
if the noretcomp config term (refer config terms section) was
used
•start of (control-flow) tracing
•end of (control-flow) tracing, if it
is not out of context
•power events, ptwrite, transaction
start and abort
•instruction pointer associated with
PSB packets
•everything except instruction pointer
associated with PSB packets
•instruction pointer associated with
PSB packets
dlfilter-show-cycles.so
Cycles can be displayed using dlfilter-show-cycles.so in which case the itrace A option can be useful to provide higher granularity cycle information:perf script --itrace=A --call-trace --dlfilter dlfilter-show-cycles.so
perf script -v --list-dlfilters
dump option
perf script has an option (-D) to "dump" the events i.e. display the binary data.PERF REPORT
By default, perf report will decode trace data found in the perf.data file. This can be further controlled by new option --itrace exactly the same as perf script, with the exception that the default is --itrace=igxe.PERF INJECT
perf inject also accepts the --itrace option in which case tracing data is removed and replaced with the synthesized events. e.g.perf inject --itrace -i perf.data -o perf.data.new
$ gcc-5 -O3 sort.c -o sort_optimized $ ./sort_optimized 30000 Bubble sorting array of 30000 elements 2254 ms
$ cat ~/.perfconfig [intel-pt] mispred-all = on
$ perf record -e intel_pt//u ./sort 3000 Bubble sorting array of 3000 elements 58 ms [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 3.939 MB perf.data ] $ perf inject -i perf.data -o inj --itrace=i100usle --strip $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1 $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo $ ./sort_autofdo 30000 Bubble sorting array of 30000 elements 2155 ms
PEBS VIA INTEL PT
Some hardware has the feature to redirect PEBS records to the Intel PT trace. Recording is selected by using the aux-output config term e.g.perf record -c 10000 -e '{intel_pt/branch=0/,cycles/aux-output/ppp}' uname
perf script -D --no-itrace | grep PERF_RECORD_AUX_OUTPUT_HW_ID
perf script --itrace=oe
XED
For --xed the xed tool is needed. Here is how to install it:$ git clone https://github.com/intelxed/mbuild.git mbuild $ git clone https://github.com/intelxed/xed $ cd xed $ ./mfile.py --share $ ./mfile.py examples $ sudo ./mfile.py --prefix=/usr/local install $ sudo ldconfig $ sudo cp obj/examples/xed /usr/local/bin
$ xed | head -3 ERROR: required argument(s) were missing Copyright (C) 2017, Intel Corporation. All rights reserved. XED version: [v10.0-328-g7d62c8c49b7b] $
TRACING VIRTUAL MACHINES (KERNEL ONLY)
Currently, kernel tracing is supported with either "timeless" decoding (i.e. no TSC timestamps) or VM Time Correlation. VM Time Correlation is an extra step using perf inject and requires unchanging VMX TSC Offset and no VMX TSC Scaling.VMX controls may suppress packets needed for decoding resulting in decoding errors VMX controls may block the perf NMI to the host potentially resulting in lost trace data Guest kernel self-modifying code (e.g. jump labels or JIT-compiled eBPF) will result in decoding errors Guest thread information is unknown Guest VCPU is unknown but may be able to be inferred from the host thread Callchains are not supported
$ sudo virsh start kubuntu20.04 Domain kubuntu20.04 started
$ mkdir vm0 $ sshfs -o direct_io root@vm0:/ vm0
$ perf buildid-cache -v --kcore vm0/proc/kcore kcore added to build-id cache directory /home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306 $ KALLSYMS=/home/user/.debug/[kernel.kcore]/9600f316a53a0f54278885e8d9710538ec5f6a08/2021021807494306/kallsyms
$ ps -eLl | grep 'KVM\|PID' F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD 3 S 64055 1430 1 1440 1 80 0 - 1921718 - ? 00:02:47 CPU 0/KVM 3 S 64055 1430 1 1441 1 80 0 - 1921718 - ? 00:02:41 CPU 1/KVM 3 S 64055 1430 1 1442 1 80 0 - 1921718 - ? 00:02:38 CPU 2/KVM 3 S 64055 1430 1 1443 2 80 0 - 1921718 - ? 00:03:18 CPU 3/KVM
$ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/tsc=0,mtc=0,cyc=1/k -p 1430 --per-thread ^C [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 5.829 MB ]
$ perf script --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21 CPU 0/KVM 1440 ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms]) movq 0x48(%rax), %r9 CPU 0/KVM 1440 ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms]) movq 0x50(%rax), %r10 CPU 0/KVM 1440 ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms]) movq 0x58(%rax), %r11 CPU 0/KVM 1440 ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms]) movq 0x60(%rax), %r12 CPU 0/KVM 1440 ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms]) movq 0x68(%rax), %r13 CPU 0/KVM 1440 ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms]) movq 0x70(%rax), %r14 CPU 0/KVM 1440 ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms]) movq 0x78(%rax), %r15 CPU 0/KVM 1440 ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms]) movq (%rax), %rax CPU 0/KVM 1440 ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms]) callq 0xffffffff82133c40 CPU 0/KVM 1440 ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms]) jz 0xffffffff82133c46 CPU 0/KVM 1440 ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms]) vmresume IPC: 0.11 (50/445) :1440 1440 ffffffffbb678b06 native_write_msr+0x6 ([guest.kernel.kallsyms]) nopl %eax, (%rax,%rax,1) :1440 1440 ffffffffbb678b0b native_write_msr+0xb ([guest.kernel.kallsyms]) retq IPC: 0.04 (2/41) :1440 1440 ffffffffbb666646 lapic_next_deadline+0x26 ([guest.kernel.kallsyms]) data16 nop :1440 1440 ffffffffbb666648 lapic_next_deadline+0x28 ([guest.kernel.kallsyms]) xor %eax, %eax :1440 1440 ffffffffbb66664a lapic_next_deadline+0x2a ([guest.kernel.kallsyms]) popq %rbp :1440 1440 ffffffffbb66664b lapic_next_deadline+0x2b ([guest.kernel.kallsyms]) retq IPC: 0.16 (4/25) :1440 1440 ffffffffbb74607f clockevents_program_event+0x8f ([guest.kernel.kallsyms]) test %eax, %eax :1440 1440 ffffffffbb746081 clockevents_program_event+0x91 ([guest.kernel.kallsyms]) jz 0xffffffffbb74603c IPC: 0.06 (2/30) :1440 1440 ffffffffbb74603c clockevents_program_event+0x4c ([guest.kernel.kallsyms]) popq %rbx :1440 1440 ffffffffbb74603d clockevents_program_event+0x4d ([guest.kernel.kallsyms]) popq %r12
$ sudo virsh start kubuntu20.04 Domain kubuntu20.04 started
$ mkdir -p vm0 $ sshfs -o direct_io root@vm0:/ vm0
$ perf buildid-cache -v --kcore vm0/proc/kcore same kcore found in /home/user/.debug/[kernel.kcore]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777 $ KALLSYMS=/home/user/.debug/\[kernel.kcore\]/cc9c55a98c5e4ec0aeda69302554aabed5cd6491/2021021312450777/kallsyms
$ ps -eLl | grep 'KVM\|PID' F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD 3 S 64055 16998 1 17005 13 80 0 - 1818189 - ? 00:00:16 CPU 0/KVM 3 S 64055 16998 1 17006 4 80 0 - 1818189 - ? 00:00:05 CPU 1/KVM 3 S 64055 16998 1 17007 3 80 0 - 1818189 - ? 00:00:04 CPU 2/KVM 3 S 64055 16998 1 17008 4 80 0 - 1818189 - ? 00:00:05 CPU 3/KVM
$ sudo perf kvm --guest --host --guestkallsyms $KALLSYMS record --kcore -e intel_pt/cyc=1/k -p 16998 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 9.041 MB perf.data.kvm ]
$ perf inject -i perf.data.kvm --vm-time-correlation=dry-run ERROR: Unknown TSC Offset for VMCS 0x1bff6a VMCS: 0x1bff6a TSC Offset 0xffffe42722c64c41 ERROR: Unknown TSC Offset for VMCS 0x1cbc08 VMCS: 0x1cbc08 TSC Offset 0xffffe42722c64c41 ERROR: Unknown TSC Offset for VMCS 0x1c3ce8 VMCS: 0x1c3ce8 TSC Offset 0xffffe42722c64c41 ERROR: Unknown TSC Offset for VMCS 0x1cbce9 VMCS: 0x1cbce9 TSC Offset 0xffffe42722c64c41
$ perf inject -i perf.data.kvm --vm-time-correlation="dry-run 0xffffe42722c64c41"
[ dry-run ] [ <TSC Offset> [ : <VMCS> [ , <VMCS> ]... ] ]...
$ perf inject -i perf.data.kvm --vm-time-correlation=0xffffe42722c64c41 --force
$ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --itrace=e-o
$ perf script -i perf.data.kvm --guestkallsyms $KALLSYMS --insn-trace --xed -F+ipc | grep -C10 vmresume | head -21 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cdd __vmx_vcpu_run+0x3d ([kernel.kallsyms]) movq 0x48(%rax), %r9 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce1 __vmx_vcpu_run+0x41 ([kernel.kallsyms]) movq 0x50(%rax), %r10 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce5 __vmx_vcpu_run+0x45 ([kernel.kallsyms]) movq 0x58(%rax), %r11 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ce9 __vmx_vcpu_run+0x49 ([kernel.kallsyms]) movq 0x60(%rax), %r12 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133ced __vmx_vcpu_run+0x4d ([kernel.kallsyms]) movq 0x68(%rax), %r13 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf1 __vmx_vcpu_run+0x51 ([kernel.kallsyms]) movq 0x70(%rax), %r14 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf5 __vmx_vcpu_run+0x55 ([kernel.kallsyms]) movq 0x78(%rax), %r15 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cf9 __vmx_vcpu_run+0x59 ([kernel.kallsyms]) movq (%rax), %rax CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133cfc __vmx_vcpu_run+0x5c ([kernel.kallsyms]) callq 0xffffffff82133c40 CPU 1/KVM 17006 [001] 11500.262865593: ffffffff82133c40 vmx_vmenter+0x0 ([kernel.kallsyms]) jz 0xffffffff82133c46 CPU 1/KVM 17006 [001] 11500.262866075: ffffffff82133c42 vmx_vmenter+0x2 ([kernel.kallsyms]) vmresume IPC: 0.05 (40/769) :17006 17006 [001] 11500.262869216: ffffffff82200cb0 asm_sysvec_apic_timer_interrupt+0x0 ([guest.kernel.kallsyms]) clac :17006 17006 [001] 11500.262869216: ffffffff82200cb3 asm_sysvec_apic_timer_interrupt+0x3 ([guest.kernel.kallsyms]) pushq $0xffffffffffffffff :17006 17006 [001] 11500.262869216: ffffffff82200cb5 asm_sysvec_apic_timer_interrupt+0x5 ([guest.kernel.kallsyms]) callq 0xffffffff82201160 :17006 17006 [001] 11500.262869216: ffffffff82201160 error_entry+0x0 ([guest.kernel.kallsyms]) cld :17006 17006 [001] 11500.262869216: ffffffff82201161 error_entry+0x1 ([guest.kernel.kallsyms]) pushq %rsi :17006 17006 [001] 11500.262869216: ffffffff82201162 error_entry+0x2 ([guest.kernel.kallsyms]) movq 0x8(%rsp), %rsi :17006 17006 [001] 11500.262869216: ffffffff82201167 error_entry+0x7 ([guest.kernel.kallsyms]) movq %rdi, 0x8(%rsp) :17006 17006 [001] 11500.262869216: ffffffff8220116c error_entry+0xc ([guest.kernel.kallsyms]) pushq %rdx :17006 17006 [001] 11500.262869216: ffffffff8220116d error_entry+0xd ([guest.kernel.kallsyms]) pushq %rcx :17006 17006 [001] 11500.262869216: ffffffff8220116e error_entry+0xe ([guest.kernel.kallsyms]) pushq %rax
TRACING VIRTUAL MACHINES (INCLUDING USER SPACE)
It is possible to use perf record to record sideband events within a virtual machine, so that an Intel PT trace on the host can be decoded. Sideband events from the guest perf.data file can be injected into the host perf.data file using perf inject.$ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.10.0-16-amd64 root=UUID=cb49c910-e573-47e0-bce7-79e293df8e1d ro no-kvmclock
$ echo 0 | sudo tee /proc/sys/net/core/bpf_jit_enable 0
$ sudo perf record -o guest-sideband-testing-guest-perf.data --sample-identifier --buildid-all --switch-events --kcore -a -e dummy
$ sudo perf record -o guest-sideband-testing-host-perf.data -m,64M --kcore -a -e intel_pt/cyc/
$ uname Linux
^C [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 76.122 MB guest-sideband-testing-host-perf.data ]
^C [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.247 MB guest-sideband-testing-guest-perf.data ]
$ perf inject -i guest-sideband-testing-host-perf.data --vm-time-correlation=dry-run VMCS: 0x103fc6 TSC Offset 0xfffffa6ae070cb20 VMCS: 0x103ff2 TSC Offset 0xfffffa6ae070cb20 VMCS: 0x10fdaa TSC Offset 0xfffffa6ae070cb20 VMCS: 0x24d57c TSC Offset 0xfffffa6ae070cb20
$ perf inject -i guest-sideband-testing-host-perf.data --vm-time-correlation=0xfffffa6ae070cb20 --force
$ perf script -i guest-sideband-testing-host-perf.data --no-itrace --show-task-events | grep KVM CPU 0/KVM 0 [000] 0.000000: PERF_RECORD_COMM: CPU 0/KVM:13376/13381 CPU 1/KVM 0 [000] 0.000000: PERF_RECORD_COMM: CPU 1/KVM:13376/13382 CPU 2/KVM 0 [000] 0.000000: PERF_RECORD_COMM: CPU 2/KVM:13376/13383 CPU 3/KVM 0 [000] 0.000000: PERF_RECORD_COMM: CPU 3/KVM:13376/13384
$ mkdir -p ~/guestmount/13376 $ sshfs -o direct_io vm_to_test:/ ~/guestmount/13376
$ perf inject -i guest-sideband-testing-host-perf.data -o inj --guestmount ~/guestmount --guest-data=guest-sideband-testing-guest-perf.data,13376,0xfffffa6ae070cb20
•the CPU displayed, [002] in this case,
is always the host CPU
•events happening in the virtual
machine start with VM:13376 VCPU:003, which shows the hypervisor PID 13376 and
the VCPU number
•only calls and errors are displayed
i.e. --itrace=ce
•branches entering and exiting the
virtual machine are split, and show as 2 branches to/from "0 [unknown]
([unknown])"
$ perf script -i inj --itrace=ce -F+machine_pid,+vcpu,+addr,+pid,+tid,-period --ns --time 7919.408803365,7919.408804631 -C 2 CPU 3/KVM 13376/13384 [002] 7919.408803365: branches: ffffffffc0f8ebe0 vmx_vcpu_enter_exit+0xc0 ([kernel.kallsyms]) => ffffffffc0f8edc0 __vmx_vcpu_run+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803365: branches: ffffffffc0f8edd5 __vmx_vcpu_run+0x15 ([kernel.kallsyms]) => ffffffffc0f8eca0 vmx_update_host_rsp+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803365: branches: ffffffffc0f8ee1b __vmx_vcpu_run+0x5b ([kernel.kallsyms]) => ffffffffc0f8ed60 vmx_vmenter+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803461: branches: ffffffffc0f8ed62 vmx_vmenter+0x2 ([kernel.kallsyms]) => 0 [unknown] ([unknown]) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408803461: branches: 0 [unknown] ([unknown]) => 7f851c9b5a5c init_cacheinfo+0x3ac (/usr/lib/x86_64-linux-gnu/libc-2.31.so) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408803567: branches: 7f851c9b5a5a init_cacheinfo+0x3aa (/usr/lib/x86_64-linux-gnu/libc-2.31.so) => 0 [unknown] ([unknown]) CPU 3/KVM 13376/13384 [002] 7919.408803567: branches: 0 [unknown] ([unknown]) => ffffffffc0f8ed80 vmx_vmexit+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803596: branches: ffffffffc0f6619a vmx_vcpu_run+0x26a ([kernel.kallsyms]) => ffffffffb2255c60 x86_virt_spec_ctrl+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803801: branches: ffffffffc0f66445 vmx_vcpu_run+0x515 ([kernel.kallsyms]) => ffffffffb2290b30 native_write_msr+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803850: branches: ffffffffc0f661f8 vmx_vcpu_run+0x2c8 ([kernel.kallsyms]) => ffffffffc1092300 kvm_load_host_xsave_state+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803850: branches: ffffffffc1092327 kvm_load_host_xsave_state+0x27 ([kernel.kallsyms]) => ffffffffc1092220 kvm_load_host_xsave_state.part.0+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803862: branches: ffffffffc0f662cf vmx_vcpu_run+0x39f ([kernel.kallsyms]) => ffffffffc0f63f90 vmx_recover_nmi_blocking+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803862: branches: ffffffffc0f662e9 vmx_vcpu_run+0x3b9 ([kernel.kallsyms]) => ffffffffc0f619a0 __vmx_complete_interrupts+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803872: branches: ffffffffc109cfb2 vcpu_enter_guest+0x752 ([kernel.kallsyms]) => ffffffffc0f5f570 vmx_handle_exit_irqoff+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803881: branches: ffffffffc109d028 vcpu_enter_guest+0x7c8 ([kernel.kallsyms]) => ffffffffb234f900 __srcu_read_lock+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803897: branches: ffffffffc109d06f vcpu_enter_guest+0x80f ([kernel.kallsyms]) => ffffffffc0f72e30 vmx_handle_exit+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803897: branches: ffffffffc0f72e3d vmx_handle_exit+0xd ([kernel.kallsyms]) => ffffffffc0f727c0 __vmx_handle_exit+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803897: branches: ffffffffc0f72b15 __vmx_handle_exit+0x355 ([kernel.kallsyms]) => ffffffffc0f60ae0 vmx_flush_pml_buffer+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803903: branches: ffffffffc0f72994 __vmx_handle_exit+0x1d4 ([kernel.kallsyms]) => ffffffffc10b7090 kvm_emulate_cpuid+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803903: branches: ffffffffc10b70f1 kvm_emulate_cpuid+0x61 ([kernel.kallsyms]) => ffffffffc10b6e10 kvm_cpuid+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803941: branches: ffffffffc10b7125 kvm_emulate_cpuid+0x95 ([kernel.kallsyms]) => ffffffffc1093110 kvm_skip_emulated_instruction+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803941: branches: ffffffffc109311f kvm_skip_emulated_instruction+0xf ([kernel.kallsyms]) => ffffffffc0f5e180 vmx_get_rflags+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803951: branches: ffffffffc109312a kvm_skip_emulated_instruction+0x1a ([kernel.kallsyms]) => ffffffffc0f5fd30 vmx_skip_emulated_instruction+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803951: branches: ffffffffc0f5fd79 vmx_skip_emulated_instruction+0x49 ([kernel.kallsyms]) => ffffffffc0f5fb50 skip_emulated_instruction+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803956: branches: ffffffffc0f5fc68 skip_emulated_instruction+0x118 ([kernel.kallsyms]) => ffffffffc0f6a940 vmx_cache_reg+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803964: branches: ffffffffc0f5fc11 skip_emulated_instruction+0xc1 ([kernel.kallsyms]) => ffffffffc0f5f9e0 vmx_set_interrupt_shadow+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803980: branches: ffffffffc109f8b1 vcpu_run+0x71 ([kernel.kallsyms]) => ffffffffc10ad2f0 kvm_cpu_has_pending_timer+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803980: branches: ffffffffc10ad2fb kvm_cpu_has_pending_timer+0xb ([kernel.kallsyms]) => ffffffffc10b0490 apic_has_pending_timer+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803991: branches: ffffffffc109f899 vcpu_run+0x59 ([kernel.kallsyms]) => ffffffffc109c860 vcpu_enter_guest+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803993: branches: ffffffffc109cd4c vcpu_enter_guest+0x4ec ([kernel.kallsyms]) => ffffffffc0f69140 vmx_prepare_switch_to_guest+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803996: branches: ffffffffc109cd7d vcpu_enter_guest+0x51d ([kernel.kallsyms]) => ffffffffb234f930 __srcu_read_unlock+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803996: branches: ffffffffc109cd9c vcpu_enter_guest+0x53c ([kernel.kallsyms]) => ffffffffc0f609b0 vmx_sync_pir_to_irr+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408803996: branches: ffffffffc0f60a6d vmx_sync_pir_to_irr+0xbd ([kernel.kallsyms]) => ffffffffc10adc20 kvm_lapic_find_highest_irr+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804010: branches: ffffffffc0f60abd vmx_sync_pir_to_irr+0x10d ([kernel.kallsyms]) => ffffffffc0f60820 vmx_set_rvi+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804019: branches: ffffffffc109ceca vcpu_enter_guest+0x66a ([kernel.kallsyms]) => ffffffffb2249840 fpregs_assert_state_consistent+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804021: branches: ffffffffc109cf10 vcpu_enter_guest+0x6b0 ([kernel.kallsyms]) => ffffffffc0f65f30 vmx_vcpu_run+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804024: branches: ffffffffc0f6603b vmx_vcpu_run+0x10b ([kernel.kallsyms]) => ffffffffb229bed0 __get_current_cr3_fast+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804024: branches: ffffffffc0f66055 vmx_vcpu_run+0x125 ([kernel.kallsyms]) => ffffffffb2253050 cr4_read_shadow+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804030: branches: ffffffffc0f6608d vmx_vcpu_run+0x15d ([kernel.kallsyms]) => ffffffffc10921e0 kvm_load_guest_xsave_state+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804030: branches: ffffffffc1092207 kvm_load_guest_xsave_state+0x27 ([kernel.kallsyms]) => ffffffffc1092110 kvm_load_guest_xsave_state.part.0+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804032: branches: ffffffffc0f660c6 vmx_vcpu_run+0x196 ([kernel.kallsyms]) => ffffffffb22061a0 perf_guest_get_msrs+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804032: branches: ffffffffb22061a9 perf_guest_get_msrs+0x9 ([kernel.kallsyms]) => ffffffffb220cda0 intel_guest_get_msrs+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804039: branches: ffffffffc0f66109 vmx_vcpu_run+0x1d9 ([kernel.kallsyms]) => ffffffffc0f652c0 clear_atomic_switch_msr+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804040: branches: ffffffffc0f66119 vmx_vcpu_run+0x1e9 ([kernel.kallsyms]) => ffffffffc0f73f60 intel_pmu_lbr_is_enabled+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804042: branches: ffffffffc0f73f81 intel_pmu_lbr_is_enabled+0x21 ([kernel.kallsyms]) => ffffffffc10b68e0 kvm_find_cpuid_entry+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804045: branches: ffffffffc0f66454 vmx_vcpu_run+0x524 ([kernel.kallsyms]) => ffffffffc0f61ff0 vmx_update_hv_timer+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804057: branches: ffffffffc0f66142 vmx_vcpu_run+0x212 ([kernel.kallsyms]) => ffffffffc10af100 kvm_wait_lapic_expire+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804057: branches: ffffffffc0f66156 vmx_vcpu_run+0x226 ([kernel.kallsyms]) => ffffffffb2255c60 x86_virt_spec_ctrl+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804057: branches: ffffffffc0f66161 vmx_vcpu_run+0x231 ([kernel.kallsyms]) => ffffffffc0f8eb20 vmx_vcpu_enter_exit+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804057: branches: ffffffffc0f8eb44 vmx_vcpu_enter_exit+0x24 ([kernel.kallsyms]) => ffffffffb2353e10 rcu_note_context_switch+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804057: branches: ffffffffb2353e1c rcu_note_context_switch+0xc ([kernel.kallsyms]) => ffffffffb2353db0 rcu_qs+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804066: branches: ffffffffc0f8ebe0 vmx_vcpu_enter_exit+0xc0 ([kernel.kallsyms]) => ffffffffc0f8edc0 __vmx_vcpu_run+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804066: branches: ffffffffc0f8edd5 __vmx_vcpu_run+0x15 ([kernel.kallsyms]) => ffffffffc0f8eca0 vmx_update_host_rsp+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804066: branches: ffffffffc0f8ee1b __vmx_vcpu_run+0x5b ([kernel.kallsyms]) => ffffffffc0f8ed60 vmx_vmenter+0x0 ([kernel.kallsyms]) CPU 3/KVM 13376/13384 [002] 7919.408804162: branches: ffffffffc0f8ed62 vmx_vmenter+0x2 ([kernel.kallsyms]) => 0 [unknown] ([unknown]) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408804162: branches: 0 [unknown] ([unknown]) => 7f851c9b5a5c init_cacheinfo+0x3ac (/usr/lib/x86_64-linux-gnu/libc-2.31.so) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408804273: branches: 7f851cb7c0e4 _dl_init+0x74 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) => 7f851cb7bf50 call_init.part.0+0x0 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408804526: branches: 55e0c00136f0 _start+0x0 (/usr/bin/uname) => ffffffff83200ac0 asm_exc_page_fault+0x0 ([kernel.kallsyms]) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408804526: branches: ffffffff83200ac3 asm_exc_page_fault+0x3 ([kernel.kallsyms]) => ffffffff83201290 error_entry+0x0 ([kernel.kallsyms]) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408804534: branches: ffffffff832012fa error_entry+0x6a ([kernel.kallsyms]) => ffffffff830b59a0 sync_regs+0x0 ([kernel.kallsyms]) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408804631: branches: ffffffff83200ad9 asm_exc_page_fault+0x19 ([kernel.kallsyms]) => ffffffff830b8210 exc_page_fault+0x0 ([kernel.kallsyms]) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408804631: branches: ffffffff830b82a4 exc_page_fault+0x94 ([kernel.kallsyms]) => ffffffff830b80e0 __kvm_handle_async_pf+0x0 ([kernel.kallsyms]) VM:13376 VCPU:003 uname 3404/3404 [002] 7919.408804631: branches: ffffffff830b80ed __kvm_handle_async_pf+0xd ([kernel.kallsyms]) => ffffffff830b80c0 kvm_read_and_reset_apf_flags+0x0 ([kernel.kallsyms])
TRACING VIRTUAL MACHINES - GUEST CODE
A common case for KVM test programs is that the test program acts as the hypervisor, creating, running and destroying the virtual machine, and providing the guest object code from its own object code. In this case, the VM is not running an OS, but only the functions loaded into it by the hypervisor test program, and conveniently, loaded at the same virtual addresses. To support that, option "--guest-code" has been added to perf script and perf kvm report.# perf record --kcore -e intel_pt/cyc/ -- tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.280 MB perf.data ] # perf script --guest-code --itrace=bep --ns -F-period,+addr,+flags [SNIP] tsc_msrs_test 18436 [007] 10897.962087733: branches: call ffffffffc13b2ff5 __vmx_vcpu_run+0x15 (vmlinux) => ffffffffc13b2f50 vmx_update_host_rsp+0x0 (vmlinux) tsc_msrs_test 18436 [007] 10897.962087733: branches: return ffffffffc13b2f5d vmx_update_host_rsp+0xd (vmlinux) => ffffffffc13b2ffa __vmx_vcpu_run+0x1a (vmlinux) tsc_msrs_test 18436 [007] 10897.962087733: branches: call ffffffffc13b303b __vmx_vcpu_run+0x5b (vmlinux) => ffffffffc13b2f80 vmx_vmenter+0x0 (vmlinux) tsc_msrs_test 18436 [007] 10897.962087836: branches: vmentry ffffffffc13b2f82 vmx_vmenter+0x2 (vmlinux) => 0 [unknown] ([unknown]) [guest/18436] 18436 [007] 10897.962087836: branches: vmentry 0 [unknown] ([unknown]) => 402c81 guest_code+0x131 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) [guest/18436] 18436 [007] 10897.962087836: branches: call 402c81 guest_code+0x131 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) => 40dba0 ucall+0x0 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) [guest/18436] 18436 [007] 10897.962088248: branches: vmexit 40dba0 ucall+0x0 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) => 0 [unknown] ([unknown]) tsc_msrs_test 18436 [007] 10897.962088248: branches: vmexit 0 [unknown] ([unknown]) => ffffffffc13b2fa0 vmx_vmexit+0x0 (vmlinux) tsc_msrs_test 18436 [007] 10897.962088248: branches: jmp ffffffffc13b2fa0 vmx_vmexit+0x0 (vmlinux) => ffffffffc13b2fd2 vmx_vmexit+0x32 (vmlinux) tsc_msrs_test 18436 [007] 10897.962088256: branches: return ffffffffc13b2fd2 vmx_vmexit+0x32 (vmlinux) => ffffffffc13b3040 __vmx_vcpu_run+0x60 (vmlinux) tsc_msrs_test 18436 [007] 10897.962088270: branches: return ffffffffc13b30b6 __vmx_vcpu_run+0xd6 (vmlinux) => ffffffffc13b2f2e vmx_vcpu_enter_exit+0x4e (vmlinux) [SNIP] tsc_msrs_test 18436 [007] 10897.962089321: branches: call ffffffffc13b2ff5 __vmx_vcpu_run+0x15 (vmlinux) => ffffffffc13b2f50 vmx_update_host_rsp+0x0 (vmlinux) tsc_msrs_test 18436 [007] 10897.962089321: branches: return ffffffffc13b2f5d vmx_update_host_rsp+0xd (vmlinux) => ffffffffc13b2ffa __vmx_vcpu_run+0x1a (vmlinux) tsc_msrs_test 18436 [007] 10897.962089321: branches: call ffffffffc13b303b __vmx_vcpu_run+0x5b (vmlinux) => ffffffffc13b2f80 vmx_vmenter+0x0 (vmlinux) tsc_msrs_test 18436 [007] 10897.962089424: branches: vmentry ffffffffc13b2f82 vmx_vmenter+0x2 (vmlinux) => 0 [unknown] ([unknown]) [guest/18436] 18436 [007] 10897.962089424: branches: vmentry 0 [unknown] ([unknown]) => 40dba0 ucall+0x0 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) [guest/18436] 18436 [007] 10897.962089701: branches: jmp 40dc1b ucall+0x7b (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) => 40dc39 ucall+0x99 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) [guest/18436] 18436 [007] 10897.962089701: branches: jcc 40dc3c ucall+0x9c (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) => 40dc20 ucall+0x80 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) [guest/18436] 18436 [007] 10897.962089701: branches: jcc 40dc3c ucall+0x9c (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) => 40dc20 ucall+0x80 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) [guest/18436] 18436 [007] 10897.962089701: branches: jcc 40dc37 ucall+0x97 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) => 40dc50 ucall+0xb0 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) [guest/18436] 18436 [007] 10897.962089878: branches: vmexit 40dc55 ucall+0xb5 (/home/user/git/work/tools/testing/selftests/kselftest_install/kvm/tsc_msrs_test) => 0 [unknown] ([unknown]) tsc_msrs_test 18436 [007] 10897.962089878: branches: vmexit 0 [unknown] ([unknown]) => ffffffffc13b2fa0 vmx_vmexit+0x0 (vmlinux) tsc_msrs_test 18436 [007] 10897.962089878: branches: jmp ffffffffc13b2fa0 vmx_vmexit+0x0 (vmlinux) => ffffffffc13b2fd2 vmx_vmexit+0x32 (vmlinux) tsc_msrs_test 18436 [007] 10897.962089887: branches: return ffffffffc13b2fd2 vmx_vmexit+0x32 (vmlinux) => ffffffffc13b3040 __vmx_vcpu_run+0x60 (vmlinux) tsc_msrs_test 18436 [007] 10897.962089901: branches: return ffffffffc13b30b6 __vmx_vcpu_run+0xd6 (vmlinux) => ffffffffc13b2f2e vmx_vcpu_enter_exit+0x4e (vmlinux) [SNIP]
# perf kvm --guest-code --guest --host report -i perf.data --stdio | head -20
# To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 12 of event 'instructions' # Event count (approx.): 2274583 # # Children Self Command Shared Object Symbol # ........ ........ ............. .................... ........................................... # 54.70% 0.00% tsc_msrs_test [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe | ---entry_SYSCALL_64_after_hwframe do_syscall_64 | |--29.44%--syscall_exit_to_user_mode | exit_to_user_mode_prepare | task_work_run | __fput
EVENT TRACE
Event Trace records information about asynchronous events, for example interrupts, faults, VM exits and entries. The information is recorded in CFE and EVD packets, and also the Interrupt Flag is recorded on the MODE.Exec packet. The CFE packet contains a type field to identify one of the following:1 INTR interrupt, fault, exception, NMI 2 IRET interrupt return 3 SMI system management interrupt 4 RSM resume from system management mode 5 SIPI startup interprocessor interrupt 6 INIT INIT signal 7 VMENTRY VM-Entry 8 VMEXIT VM-Entry 9 VMEXIT_INTR VM-Exit due to interrupt 10 SHUTDOWN Shutdown
perf record -e intel_pt/event/u uname
perf script --itrace=Ie
evt: hw int (t) cfe: INTR IP: 1 vector: 3 PFA: 0x8877665544332211
iflag: t IFLAG: 1->0 via branch
no flag, interrupts enabled IF=1 t interrupts become disabled IF=1 -> IF=0 D interrupts are disabled IF=0 Dt interrupts become enabled IF=0 -> IF=1
TNT DISABLE
TNT packets are disabled using the "notnt" config term. e.g.perf record -e intel_pt/notnt/u uname
EMULATED PTWRITE
Later perf tools support a method to emulate the ptwrite instruction, which can be useful if hardware does not support the ptwrite instruction.#include <stdint.h>
void perf_emulate_ptwrite(uint64_t x) __attribute__((externally_visible, noipa, no_instrument_function, naked));
#define PERF_EMULATE_PTWRITE_8_BITS \ "1: shl %rax\n" \ " jc 1f\n" \ "1: shl %rax\n" \ " jc 1f\n" \ "1: shl %rax\n" \ " jc 1f\n" \ "1: shl %rax\n" \ " jc 1f\n" \ "1: shl %rax\n" \ " jc 1f\n" \ "1: shl %rax\n" \ " jc 1f\n" \ "1: shl %rax\n" \ " jc 1f\n" \ "1: shl %rax\n" \ " jc 1f\n"
/* Undefined instruction */ #define PERF_EMULATE_PTWRITE_UD2 ".byte 0x0f, 0x0b\n"
#define PERF_EMULATE_PTWRITE_MAGIC PERF_EMULATE_PTWRITE_UD2 ".ascii \"perf,ptwrite \"\n"
void perf_emulate_ptwrite(uint64_t x __attribute__ ((__unused__))) { /* Assumes SysV ABI : x passed in rdi */ __asm__ volatile ( "jmp 1f\n" PERF_EMULATE_PTWRITE_MAGIC "1: mov %rdi, %rax\n" PERF_EMULATE_PTWRITE_8_BITS PERF_EMULATE_PTWRITE_8_BITS PERF_EMULATE_PTWRITE_8_BITS PERF_EMULATE_PTWRITE_8_BITS PERF_EMULATE_PTWRITE_8_BITS PERF_EMULATE_PTWRITE_8_BITS PERF_EMULATE_PTWRITE_8_BITS PERF_EMULATE_PTWRITE_8_BITS "1: ret\n" ); }
#include <stdio.h> #include <stdint.h> #include <stdlib.h>
#include "perf_emulate_ptwrite.h"
int main(int argc, char *argv[]) { uint64_t x = 0;
if (argc > 1) x = strtoull(argv[1], NULL, 0); perf_emulate_ptwrite(x); return 0; }
$ gcc -Wall -Wextra -O3 -g -o eg_ptw eg_ptw.c $ perf record -e intel_pt//u ./eg_ptw 0x1234567890abcdef [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.017 MB perf.data ] $ perf script --itrace=ew eg_ptw 19875 [007] 8061.235912: ptwrite: IP: 0 payload: 0x1234567890abcdef 55701249a196 perf_emulate_ptwrite+0x16 (/home/user/eg_ptw) $
PIPE MODE
Pipe mode is a problem for Intel PT and possibly other auxtrace users. It’s not recommended to use a pipe as data output with Intel PT because of the following reason.EXAMPLE
Examples can be found on perf wiki page "Perf tools support for Intel® Processor Trace":SEE ALSO
perf-record(1), perf-script(1), perf-report(1), perf-inject(1)2024-06-21 | perf |