stapprobes - systemtap probe points
The following sections enumerate the variety of probe points supported by the
systemtap translator, and some of the additional aliases defined by standard
tapset scripts. Many are individually documented in the
3stap manual
section, with the
probe:: prefix.
probe PROBEPOINT [, PROBEPOINT] { [STMT ...] }
A probe declaration may list multiple comma-separated probe points in order to
attach a handler to all of the named events. Normally, the handler statements
are run whenever any of events occur. Depending on the type of probe point,
the handler statements may refer to context variables (denoted with a
dollar-sign prefix like $foo) to read or write state. This may include
function parameters for function probes, or local variables for statement
probes.
The syntax of a single probe point is a general dotted-symbol sequence. This
allows a breakdown of the event namespace into parts, somewhat like the Domain
Name System does on the Internet. Each component identifier may be
parametrized by a string or number literal, with a syntax like a function
call. A component may include a "*" character, to expand to a set of
matching probe points. It may also include "**" to match multiple
sequential components at once. Probe aliases likewise expand to other probe
points.
Probe aliases can be given on their own, or with a suffix. The suffix attaches
to the underlying probe point that the alias is expanded to. For example,
syscall.read.return.maxactive(10)
expands to
kernel.function("sys_read").return.maxactive(10)
with the component
maxactive(10) being recognized as a suffix.
Normally, each and every probe point resulting from wildcard- and
alias-expansion must be resolved to some low-level system instrumentation
facility (e.g., a kprobe address, marker, or a timer configuration), otherwise
the elaboration phase will fail.
However, a probe point may be followed by a "?" character, to indicate
that it is optional, and that no error should result if it fails to resolve.
Optionalness passes down through all levels of alias/wildcard expansion.
Alternately, a probe point may be followed by a "!" character, to
indicate that it is both optional and sufficient. (Think vaguely of the Prolog
cut operator.) If it does resolve, then no further probe points in the same
comma-separated list will be resolved. Therefore, the "!"
sufficiency mark only makes sense in a list of probe point alternatives.
Additionally, a probe point may be followed by a "if (expr)"
statement, in order to enable/disable the probe point on-the-fly. With the
"if" statement, if the "expr" is false when the probe
point is hit, the whole probe body including alias's body is skipped. The
condition is stacked up through all levels of alias/wildcard expansion. So the
final condition becomes the logical-and of conditions of all expanded
alias/wildcard. The expressions are necessarily restricted to global
variables.
These are all
syntactically valid probe points. (They are generally
semantically invalid, depending on the contents of the tapsets, and the
versions of kernel/user software installed.)
kernel.function("foo").return
process("/bin/vi").statement(0x2222)
end
syscall.*
syscall.*.return.maxactive(10)
syscall.{open,close}
sys**open
kernel.function("no_such_function") ?
module("awol").function("no_such_function") !
signal.*? if (switch)
kprobe.function("foo")
Probes may be broadly classified into "synchronous" and
"asynchronous". A "synchronous" event is deemed to occur
when any processor executes an instruction matched by the specification. This
gives these probes a reference point (instruction address) from which more
contextual data may be available. Other families of probe points refer to
"asynchronous" events such as timers/counters rolling over, where
there is no fixed reference point that is related. Each probe point
specification may match multiple locations (for example, using wildcards or
aliases), and all them are then probed. A probe declaration may also contain
several comma-separated specifications, all of which are probed.
Brace expansion is a mechanism which allows a list of probe points to be
generated. It is very similar to shell expansion. A component may be
surrounded by a pair of curly braces to indicate that the comma-separated
sequence of one or more subcomponents will each constitute a new probe point.
The braces may be arbitrarily nested. The ordering of expanded results is
based on product order.
The question mark (?), exclamation mark (!) indicators and probe point
conditions may not be placed in any expansions that are before the last
component.
The following is an example of brace expansion.
syscall.{write,read}
# Expands to
syscall.write, syscall.read
{kernel,module("nfs")}.function("nfs*")!
# Expands to
kernel.function("nfs*")!, module("nfs").function("nfs*")!
Resolving some probe points requires DWARF debuginfo or "debug
symbols" for the
specific program being instrumented. For some
others, DWARF is automatically synthesized on the fly from source code header
files. For others, it is not needed at all. Since a systemtap script may use
any mixture of probe points together, the union of their DWARF requirements
has to be met on the computer where script compilation occurs. (See the
--use-server option and the
stap-server(8) man page for
information about the remote compilation facility, which allows these
requirements to be met on a different machine.)
The following point lists many of the available probe point families, to
classify them with respect to their need for DWARF debuginfo for the specific
program for that probe point.
DWARF |
NON-DWARF |
SYMBOL-TABLE |
|
|
|
kernel.function, .statement |
kernel.mark |
kernel.function*
|
module.function, .statement |
process.mark, process.plt |
module.function*
|
process.function, .statement |
begin, end, error, never |
process.function*
|
process.mark*
|
timer |
|
|
|
|
python2, python3 |
procfs |
|
|
kernel.statement.absolute |
|
AUTO-GENERATED-DWARF |
kernel.data |
|
|
kprobe.function |
|
kernel.trace |
process.statement.absolute |
|
|
process.begin, .end |
|
|
netfilter |
|
|
java |
|
The probe types marked with
* asterisks mark fallbacks, where systemtap
can sometimes infer subset or substitute information. In general, the more
symbolic / debugging information available, the higher quality probing will be
available.
The following types of probe points may be armed/disarmed on-the-fly to save
overheads during uninteresting times. Arming conditions may also be added to
other types of probes, but will be treated as a wrapping conditional and won't
benefit from overhead savings.
DISARMABLE |
exceptions |
kernel.function, kernel.statement |
|
module.function, module.statement |
|
process.*.function, process.*.statement |
|
process.*.plt, process.*.mark |
|
timer. |
timer.profile |
java |
|
The probe points
begin and
end are defined by the translator to
refer to the time of session startup and shutdown. All "begin" probe
handlers are run, in some sequence, during the startup of the session. All
global variables will have been initialized prior to this point. All
"end" probes are run, in some sequence, during the
normal
shutdown of a session, such as in the aftermath of an
exit () function
call, or an interruption from the user. In the case of an error-triggered
shutdown, "end" probes are not run. There are no target variables
available in either context.
If the order of execution among "begin" or "end" probes is
significant, then an optional sequence number may be provided:
The number N may be positive or negative. The probe handlers are run in
increasing order, and the order between handlers with the same sequence number
is unspecified. When "begin" or "end" are given without a
sequence, they are effectively sequence zero.
The
error probe point is similar to the
end probe, except that
each such probe handler run when the session ends after errors have occurred.
In such cases, "end" probes are skipped, but each "error"
probe is still attempted. This kind of probe can be used to clean up or emit a
"final gasp". It may also be numerically parametrized to set a
sequence.
The probe point
never is specially defined by the translator to mean
"never". Its probe handler is never run, though its statements are
analyzed for symbol / type correctness as usual. This probe point may be
useful in conjunction with optional probes.
The
syscall.* and
nd_syscall.* aliases define several hundred
probes, too many to detail here. They are of the general form:
syscall.NAME
nd_syscall.NAME
syscall.NAME.return
nd_syscall.NAME.return
Generally, a pair of probes are defined for each normal system call as listed in
the
syscalls(2) manual page, one for entry and one for return. Those
system calls that never return do not have a corresponding
.return
probe. The nd_* family of probes are about the same, except it uses
non-DWARF based searching mechanisms, which may result in a lower
quality of symbolic context data (parameters), and may miss some system calls.
You may want to try them first, in case kernel debugging information is not
immediately available.
Each probe alias provides a variety of variables. Looking at the tapset source
code is the most reliable way. Generally, each variable listed in the standard
manual page is made available as a script-level variable, so
syscall.open exposes
filename,
flags, and
mode. In
addition, a standard suite of variables is available at most aliases:
- argstr
- A pretty-printed form of the entire argument list, without
parentheses.
- name
- The name of the system call.
- retval
- For return probes, the raw numeric system-call result.
- retstr
- For return probes, a pretty-printed string form of the
system-call result.
As usual for probe aliases, these variables are all initialized once from the
underlying $context variables, so that later changes to $context variables are
not automatically reflected. Not all probe aliases obey all of these general
guidelines. Please report any bothersome ones you encounter as a bug. Note
that on some kernel/userspace architecture combinations (e.g., 32-bit
userspace on 64-bit kernel), the underlying $context variables may need
explicit sign extension / masking. When this is an issue, consider using the
tapset-provided variables instead of raw $context variables.
If debuginfo availability is a problem, you may try using the non-DWARF syscall
probe aliases instead. Use the
nd_syscall. prefix instead of
syscall. The same context variables are available, as far as possible.
nd_syscall probes on kernels that use syscall wrappers to pass arguments
via pt_regs (currently 4.17+ on x86_64 and 4.19+ on aarch64) support syscall
argument writing when guru mode is enabled. If a probe syscall parameter is
modified in the probe body then immediately before the probe exits the
parameter's current value will be written to pt_regs. This overwrites the
previous value.
nd_syscall probes also include two parameters for each
of the syscall's string parameters. One holds a quoted version of the string
passed to the syscall. The other holds an unquoted version of the string
intended to be used when modifying the parameter. If the probe modifies the
unquoted string variable then as the probe is about to exit the contents of
this variable will be written to the user space buffer passed to the syscall.
It is the user's responsibility to ensure that this buffer is large enough to
hold the modified string and that it is located in a writable memory segment.
There are two main types of timer probes: "jiffies" timer probes and
time interval timer probes.
Intervals defined by the standard kernel "jiffies" timer may be used
to trigger probe handlers asynchronously. Two probe point variants are
supported by the translator:
timer.jiffies(N)
timer.jiffies(N).randomize(M)
The probe handler is run every N jiffies (a kernel-defined unit of time,
typically between 1 and 60 ms). If the "randomize" component is
given, a linearly distributed random value in the range [-M..+M] is added to N
every time the handler is run. N is restricted to a reasonable range (1 to
around a million), and M is restricted to be smaller than N. There are no
target variables provided in either context. It is possible for such probes to
be run concurrently on a multi-processor computer.
Alternatively, intervals may be specified in units of time. There are two probe
point variants similar to the jiffies timer:
timer.ms(N)
timer.ms(N).randomize(M)
Here, N and M are specified in milliseconds, but the full options for units are
seconds (s/sec), milliseconds (ms/msec), microseconds (us/usec), nanoseconds
(ns/nsec), and hertz (hz). Randomization is not supported for hertz timers.
The actual resolution of the timers depends on the target kernel. For kernels
prior to 2.6.17, timers are limited to jiffies resolution, so intervals are
rounded up to the nearest jiffies interval. After 2.6.17, the implementation
uses hrtimers for tighter precision, though the actual resolution will be
arch-dependent. In either case, if the "randomize" component is
given, then the random value will be added to the interval before any rounding
occurs.
Profiling timers are also available to provide probes that execute on all CPUs
at the rate of the system tick (CONFIG_HZ) or at a given frequency (hz). On
some kernels, this is a one-concurrent-user-only or disabled facility,
resulting in error -16 (EBUSY) during probe registration.
timer.profile.tick
timer.profile.freq.hz(N)
Full context information of the interrupted process is available, making this
probe suitable for a time-based sampling profiler.
It is recommended to use the tapset probe
timer.profile rather than
timer.profile.tick. This probe point behaves identically to timer.profile.tick
when the underlying functionality is available, and falls back to using
perf.sw.cpu_clock on some recent kernels which lack the corresponding profile
timer facility.
Profiling timers with specified frequencies are only accurate up to around 100
hz. You may need to provide a larger value to achieve the desired rate.
Note that if a timer probe is set to fire at a very high rate and if the probe
body is complex, succeeding timer probes can get skipped, since the time for
them to run has already passed. Normally systemtap reports missed probes, but
it will not report these skipped probes.
This family of probe points uses symbolic debugging information for the target
kernel/module/program, as may be found in unstripped executables, or the
separate
debuginfo packages. They allow placement of probes logically
into the execution path of the target program, by specifying a set of points
in the source or object code. When a matching statement executes on any
processor, the probe handler is run in that context.
Probe points in the DWARF family can be identified by the target kernel module
(or user process), source file, line number, function name, or some
combination of these.
Here is a list of DWARF probe points currently supported:
kernel.function(PATTERN)
kernel.function(PATTERN).call
kernel.function(PATTERN).callee(PATTERN)
kernel.function(PATTERN).callee(PATTERN).return
kernel.function(PATTERN).callee(PATTERN).call
kernel.function(PATTERN).callees(DEPTH)
kernel.function(PATTERN).return
kernel.function(PATTERN).inline
kernel.function(PATTERN).label(LPATTERN)
module(MPATTERN).function(PATTERN)
module(MPATTERN).function(PATTERN).call
module(MPATTERN).function(PATTERN).callee(PATTERN)
module(MPATTERN).function(PATTERN).callee(PATTERN).return
module(MPATTERN).function(PATTERN).callee(PATTERN).call
module(MPATTERN).function(PATTERN).callees(DEPTH)
module(MPATTERN).function(PATTERN).return
module(MPATTERN).function(PATTERN).inline
module(MPATTERN).function(PATTERN).label(LPATTERN)
kernel.statement(PATTERN)
kernel.statement(PATTERN).nearest
kernel.statement(ADDRESS).absolute
module(MPATTERN).statement(PATTERN)
process("PATH").function("NAME")
process("PATH").statement("*@FILE.c:123")
process("PATH").library("PATH").function("NAME")
process("PATH").library("PATH").statement("*@FILE.c:123")
process("PATH").library("PATH").statement("*@FILE.c:123").nearest
process("PATH").function("*").return
process("PATH").function("myfun").label("foo")
process("PATH").function("foo").callee("bar")
process("PATH").function("foo").callee("bar").return
process("PATH").function("foo").callee("bar").call
process("PATH").function("foo").callees(DEPTH)
process(PID).function("NAME")
process(PID).function("myfun").label("foo")
process(PID).plt("NAME")
process(PID).plt("NAME").return
process(PID).statement("*@FILE.c:123")
process(PID).statement("*@FILE.c:123").nearest
process(PID).statement(ADDRESS).absolute
(See the USER-SPACE section below for more information on the process probes.)
The list above includes multiple variants and modifiers which provide additional
functionality or filters. They are:
- .function
- Places a probe near the beginning of the named function, so
that parameters are available as context variables.
- .return
- Places a probe at the moment after the return from
the named function, so the return value is available as the
"$return" context variable.
- .inline
- Filters the results to include only instances of inlined
functions. Note that inlined functions do not have an identifiable return
point, so .return is not supported on .inline probes.
- .call
- Filters the results to include only non-inlined functions
(the opposite set of .inline)
- .exported
- Filters the results to include only exported
functions.
- .statement
- Places a probe at the exact spot, exposing those local
variables that are visible there.
- .statement.nearest
- Places a probe at the nearest available line number for
each line number given in the statement.
- .callee
- Places a probe on the callee function given in the
.callee modifier, where the callee must be a function called by the
target function given in .function. The advantage of doing this
over directly probing the callee function is that this probe point is run
only when the callee is called from the target function (add the
-DSTAP_CALLEE_MATCHALL directive to override this when calling
stap(1)).
Note that only callees that can be statically determined are available. For
example, calls through function pointers are not available. Additionally,
calls to functions located in other objects (e.g. libraries) are not
available (instead use another probe point). This feature will only work
for code compiled with GCC 4.7+.
- .callees
- Shortcut for .callee("*"), which places a
probe on all callees of the function.
-
.callees(DEPTH)
- Recursively places probes on callees. For example,
.callees(2) will probe both callees of the target function, as well
as callees of those callees. And .callees(3) goes one level deeper,
etc... A callee probe at depth N is only triggered when the N callers in
the callstack match those that were statically determined during analysis
(this also may be overridden using -DSTAP_CALLEE_MATCHALL).
In the above list of probe points, MPATTERN stands for a string literal that
aims to identify the loaded kernel module of interest. For in-tree kernel
modules, the name suffices (e.g. "btrfs"). The name may also include
the "*", "[]", and "?" wildcards to match
multiple in-tree modules. Out-of-tree modules are also supported by specifying
the full path to the ko file. Wildcards are not supported. The file must
follow the convention of being named <module_name>.ko (characters ','
and '-' are replaced by '_').
LPATTERN stands for a source program label. It may also contain "*",
"[]", and "?" wildcards. PATTERN stands for a string
literal that aims to identify a point in the program. It is made up of three
parts:
- •
- The first part is the name of a function, as would appear
in the nm program's output. This part may use the "*" and
"?" wildcarding operators to match multiple names.
- •
- The second part is optional and begins with the
"@" character. It is followed by the path to the source file
containing the function, which may include a wildcard pattern, such as
mm/slab*. If it does not match as is, an implicit "*/" is
optionally added before the pattern, so that a script need only
name the last few components of a possibly long source directory
path.
- •
- Finally, the third part is optional if the file name part
was given, and identifies the line number in the source file preceded by a
":" or a "+". The line number is assumed to be an
absolute line number if preceded by a ":", or relative to the
declaration line of the function if preceded by a "+". All the
lines in the function can be matched with ":*". A range of lines
x through y can be matched with ":x-y". Ranges and specific
lines can be mixed using commas, e.g. ":x,y-z".
As an alternative, PATTERN may be a numeric constant, indicating an address.
Such an address may be found from symbol tables of the appropriate kernel /
module object file. It is verified against known statement code boundaries,
and will be relocated for use at run time.
In guru mode only, absolute kernel-space addresses may be specified with the
".absolute" suffix. Such an address is considered already relocated,
as if it came from
/proc/kallsyms, so it cannot be checked against
statement/instruction boundaries.
Many of the source-level context variables, such as function parameters, locals,
globals visible in the compilation unit, may be visible to probe handlers.
They may refer to these variables by prefixing their name with "$"
within the scripts. In addition, a special syntax allows limited traversal of
structures, pointers, and arrays. More syntax allows pretty-printing of
individual variables or their groups. See also
@cast. Note that
variables may be inaccessible due to them being paged out, or for a few other
reasons. See also man
error::fault(7stap).
Functions called from DWARF class probe points and from process.mark probes may
also refer to context variables.
- $var
- refers to an in-scope variable or thread local storage
variable "var". If it's an integer-like type, it will be cast to
a 64-bit int for systemtap script use. String-like pointers (char *) may
be copied to systemtap string values using the kernel_string or
user_string functions.
- @var("varname")
- an alternative syntax for $varname
- @var("varname","module")
- The global variable or global thread local storage variable
in scope of the given module already loaded into the current probed
process. Useful to get an exported variable in a shared library loaded
into the process being probed, or a global variable in a process while a
shared library probe is being executed. For user-space modules only. For
example:
@var("_r_debug","/lib/ld-linux.so.2")
- @var("varname@src/file.c")
- refers to the global (either file local or external)
variable varname defined when the file src/file.c was
compiled. The CU in which the variable is resolved is the first CU in the
module of the probe point which matches the given file name at the end and
has the shortest file name path (e.g. given
@var("foo@bar/baz.c") and CUs with file name paths
src/sub/module/bar/baz.c and src/bar/baz.c the second CU
will be chosen to resolve the (file) global variable foo
- @var("varname@src/file.c","module")
- The global variable in scope of the given CU, defined in
the given module, even if the variable is static (so the name is not
unique without the CU name).
- $var->field traversal via a structure's or a pointer's
field. This
- generalized indirection operator may be repeated to follow
more levels. Note that the . operator is not used for plain
structure members, only -> for both purposes. (This is because
"." is reserved for string concatenation.) Also note that for
direct dereferencing of $var pointer {kernel,user}_{char,int,...}($var)
should be used. (Refer to stapfuncs(5) for more details.)
- $return
- is available in return probes only for functions that are
declared with a return value, which can be determined using
@defined($return).
- $var[N]
- indexes into an array. The index given with a literal
number or even an arbitrary numeric expression.
A number of operators exist for such basic context variable expressions:
- $$vars
- expands to a character string that is equivalent to
sprintf("parm1=%x ... parmN=%x var1=%x ... varN=%x",
parm1, ..., parmN, var1, ..., varN)
for each variable in scope at the probe point. Some values may be printed as
=? if their run-time location cannot be found.
- $$locals
- expands to a subset of $$vars for only local
variables.
- $$parms
- expands to a subset of $$vars for only function
parameters.
- $$return
- is available in return probes only. It expands to a string
that is equivalent to sprintf("return=%x", $return) if the
probed function has a return value, or else an empty string.
- & $EXPR
- expands to the address of the given context variable
expression, if it is addressable.
- @defined($EXPR)
- expands to 1 or 0 iff the given context variable expression
is resolvable, for use in conditionals such as
@defined($foo->bar) ? $foo->bar : 0
- @probewrite($VAR)
- see the PROBES section of stap(1).
- $EXPR$
- expands to a string with all of $EXPR's members, equivalent
to
sprintf("{.a=%i, .b=%u, .c={...}, .d=[...]}",
$EXPR->a, $EXPR->b)
- $EXPR$$
- expands to a string with all of $var's members and
submembers, equivalent to
sprintf("{.a=%i, .b=%u, .c={.x=%p, .y=%c}, .d=[%i, ...]}",
$EXPR->a, $EXPR->b, $EXPR->c->x, $EXPR->c->y, $EXPR->d[0])
- @errno
- expands to the last value the C library global variable
errno was set to.
For the kernel ".return" probes, only a certain fixed number of
returns may be outstanding. The default is a relatively small number, on the
order of a few times the number of physical CPUs. If many different threads
concurrently call the same blocking function, such as
futex(2) or
read(2),
this limit could be exceeded, and skipped "kretprobes" would be
reported by "stap -t". To work around this, specify a
probe FOO.return.maxactive(NNN)
suffix, with a large enough NNN to cover all expected concurrently blocked
threads. Alternately, use the
stap command line macro setting to override the default for all
".return" probes.
For ".return" probes, context variables other than the
"$return" may be accessible, as a convenience for a script
programmer wishing to access function parameters. These values are
snapshots taken at the time of function entry. (Local variables within
the function are
not generally accessible, since those variables did
not exist in allocated/initialized form at the snapshot moment.) These
entry-snapshot variables should be accessed via
@entry($var).
In addition, arbitrary entry-time expressions can also be saved for
".return" probes using the
@entry(expr) operator. For
example, one can compute the elapsed time of a function:
probe kernel.function("do_filp_open").return {
println( get_timeofday_us() - @entry(get_timeofday_us()) )
}
The following table summarizes how values related to a function parameter
context variable, a pointer named
addr, may be accessed from a
.return probe.
at-entry value |
past-exit value |
|
|
|
|
$addr |
not available |
|
$addr->x->y |
@cast(@entry($addr),"struct zz")->x->y |
|
$addr[0] |
{kernel,user}_{char,int,...}(& $addr[0]) |
|
In absence of debugging information, entry & exit points of kernel &
module functions can be probed using the "kprobe" family of probes.
However, these do not permit looking up the arguments / local variables of the
function. Following constructs are supported :
kprobe.function(FUNCTION)
kprobe.function(FUNCTION).call
kprobe.function(FUNCTION).return
kprobe.module(NAME).function(FUNCTION)
kprobe.module(NAME).function(FUNCTION).call
kprobe.module(NAME).function(FUNCTION).return
kprobe.statement(ADDRESS).absolute
Probes of type
function are recommended for kernel functions, whereas
probes of type
module are recommended for probing functions of the
specified module. In case the absolute address of a kernel or module function
is known,
statement probes can be utilized.
Note that
FUNCTION and
MODULE names
must not contain
wildcards, or the probe will not be registered. Also, statement probes must be
run under guru-mode only.
Support for user-space probing is available for kernels that are configured with
the utrace extensions, or have the uprobes facility in linux 3.5. (Various
kernel build configuration options need to be enabled; systemtap will advise
if these are missing.)
There are several forms. First, a non-symbolic probe point:
process(PID).statement(ADDRESS).absolute
is analogous to kernel.statement(ADDRESS).absolute in that both use raw
(unverified) virtual addresses and provide no $variables. The target PID
parameter must identify a running process, and ADDRESS should identify a valid
instruction address. All threads of that process will be probed.
Second, non-symbolic user-kernel interface events handled by utrace may be
probed:
process(PID).begin
process("FULLPATH").begin
process.begin
process(PID).thread.begin
process("FULLPATH").thread.begin
process.thread.begin
process(PID).end
process("FULLPATH").end
process.end
process(PID).thread.end
process("FULLPATH").thread.end
process.thread.end
process(PID).syscall
process("FULLPATH").syscall
process.syscall
process(PID).syscall.return
process("FULLPATH").syscall.return
process.syscall.return
process(PID).insn
process("FULLPATH").insn
process(PID).insn.block
process("FULLPATH").insn.block
A
process.begin probe gets called when new process described by PID or
FULLPATH gets created. In addition, it is called once from the context of each
preexisting process, at systemtap script startup. This is useful to track live
processes. A
process.thread.begin probe gets called when a new thread
described by PID or FULLPATH gets created. A
process.end probe gets
called when process described by PID or FULLPATH dies. A
process.thread.end probe gets called when a thread described by PID or
FULLPATH dies. A
process.syscall probe gets called when a thread
described by PID or FULLPATH makes a system call. The system call number is
available in the
$syscall context variable, and the first 6 arguments
of the system call are available in the
$argN (ex. $arg1, $arg2, ...)
context variable. A
process.syscall.return probe gets called when a
thread described by PID or FULLPATH returns from a system call. The system
call number is available in the
$syscall context variable, and the
return value of the system call is available in the
$return context
variable. A
process.insn probe gets called for every single-stepped
instruction of the process described by PID or FULLPATH. A
process.insn.block probe gets called for every block-stepped
instruction of the process described by PID or FULLPATH.
If a process probe is specified without a PID or FULLPATH, all user threads will
be probed. However, if systemtap was invoked with the
-c or
-x
options, then process probes are restricted to the process hierarchy
associated with the target process. If a process probe is unspecified (i.e.
without a PID or FULLPATH), but with the
-c option, the PATH of the
-c cmd will be heuristically filled into the process PATH. In that
case, only command parameters are allowed in the
-c command (i.e. no
command substitution allowed and no occurrences of any of these characters:
'|&;<>(){}').
Third, symbolic static instrumentation compiled into programs and shared
libraries may be probed:
process("PATH").mark("LABEL")
process("PATH").provider("PROVIDER").mark("LABEL")
process(PID).mark("LABEL")
process(PID).provider("PROVIDER").mark("LABEL")
A
.mark probe gets called via a static probe which is defined in the
application by STAP_PROBE1(PROVIDER,LABEL,arg1), which are macros defined in
sys/sdt.h. The PROVIDER is an arbitrary application identifier, LABEL
is the marker site identifier, and arg1 is the integer-typed argument.
STAP_PROBE1 is used for probes with 1 argument, STAP_PROBE2 is used for probes
with 2 arguments, and so on. The arguments of the probe are available in the
context variables $arg1, $arg2, ... An alternative to using the STAP_PROBE
macros is to use the dtrace script to create custom macros. Additionally, the
variables $$name and $$provider are available as parts of the probe point
name. The
sys/sdt.h macro names DTRACE_PROBE* are available as aliases
for STAP_PROBE*.
Finally, full symbolic source-level probes in user-space programs and shared
libraries are supported. These are exactly analogous to the symbolic
DWARF-based kernel/module probes described above. They expose the same sorts
of context $variables for function parameters, local variables, and so on.
process("PATH").function("NAME")
process("PATH").statement("*@FILE.c:123")
process("PATH").plt("NAME")
process("PATH").library("PATH").plt("NAME")
process("PATH").library("PATH").function("NAME")
process("PATH").library("PATH").statement("*@FILE.c:123")
process("PATH").function("*").return
process("PATH").function("myfun").label("foo")
process("PATH").function("foo").callee("bar")
process("PATH").plt("NAME").return
process(PID).function("NAME")
process(PID).statement("*@FILE.c:123")
process(PID).plt("NAME")
Note that for all process probes,
PATH names refer to executables that
are searched the same way shells do: relative to the working directory if they
contain a "/" character, otherwise in
$PATH. If PATH names
refer to scripts, the actual interpreters (specified in the script in the
first line after the #! characters) are probed.
Tapset process probes placed in the special directory
$prefix/share/systemtap/tapset/PATH/ with relative paths will have their
process parameter prefixed with the location of the tapset. For example,
process("foo").function("NAME")
expands to
process("/usr/bin/foo").function("NAME")
when placed in $prefix/share/systemtap/tapset/PATH/usr/bin/
If PATH is a process component parameter referring to shared libraries then all
processes that map it at runtime would be selected for probing. If PATH is a
library component parameter referring to shared libraries then the process
specified by the process component would be selected. Note that the PATH
pattern in a library component will always apply to libraries statically
determined to be in use by the process. However, you may also specify the full
path to any library file even if not statically needed by the process.
A .plt probe will probe functions in the program linkage table corresponding to
the rest of the probe point. .plt can be specified as a shorthand for
.plt("*"). The symbol name is available as a $$name context
variable; function arguments are not available, since PLTs are processed
without debuginfo. A .plt.return probe places a probe at the moment
after the return from the named function.
If the PATH string contains wildcards as in the MPATTERN case, then standard
globbing is performed to find all matching paths. In this case, the
$PATH environment variable is not used.
If systemtap was invoked with the
-c or
-x options, then process
probes are restricted to the process hierarchy associated with the target
process.
Support for probing Java methods is available using Byteman as a backend.
Byteman is an instrumentation tool from the JBoss project which systemtap can
use to monitor invocations for a specific method or line in a Java program.
Systemtap does so by generating a Byteman script listing the probes to
instrument and then invoking the Byteman
bminstall utility.
This Java instrumentation support is currently a prototype feature with major
limitations. Moreover, Java probing currently does not work across users; the
stap script must run (with appropriate permissions) under the same user that
the Java process being probed. (Thus a stap script under root currently cannot
probe Java methods in a non-root-user Java process.)
The first probe type refers to Java processes by the name of the Java process:
java("PNAME").class("CLASSNAME").method("PATTERN")
java("PNAME").class("CLASSNAME").method("PATTERN").return
The PNAME argument must be a pre-existing jvm pid, and be identifiable via a jps
listing.
The PATTERN parameter specifies the signature of the Java method to probe. The
signature must consist of the exact name of the method, followed by a
bracketed list of the types of the arguments, for instance
"myMethod(int,double,Foo)". Wildcards are not supported.
The probe can be set to trigger at a specific line within the method by
appending a line number with colon, just as in other types of probes:
"myMethod(int,double,Foo):245".
The CLASSNAME parameter identifies the Java class the method belongs to, either
with or without the package qualification. By default, the probe only triggers
on descendants of the class that do not override the method definition of the
original class. However, CLASSNAME can take an optional caret prefix, as in
^org.my.MyClass, which specifies that the probe should also trigger on
all descendants of MyClass that override the original method. For instance,
every method with signature foo(int) in program org.my.MyApp can be probed at
once using
java("org.my.MyApp").class("^java.lang.Object").method("foo(int)")
The second probe type works analogously, but refers to Java processes by PID:
java(PID).class("CLASSNAME").method("PATTERN")
java(PID).class("CLASSNAME").method("PATTERN").return
(PIDs for an already running process can be obtained using the
jps(1)
utility.)
Context variables defined within java probes include
$arg1 through
$arg10 (for up to the first 10 arguments of a method), represented as
character-pointers for the
toString() form of each actual argument. The
arg1 through
arg10 script variables provide access to these as
ordinary strings, fetched via
user_string_warn().
Prior to systemtap version 3.1,
$arg1 through
$arg10 could contain
either integers or character pointers, depending on the types of the objects
being passed to each particular java method. This previous behaviour may be
invoked with the
stap --compatible=3.0 flag.
These probe points allow procfs "files" in /proc/systemtap/MODNAME to
be created, read and written using a permission that may be modified using the
proper umask value. Default permissions are 0400 for read probes, and 0200 for
write probes. If both a read and write probe are being used on the same file,
a default permission of 0600 will be used. Using procfs.umask(0040).read would
result in a 0404 permission set for the file. (
MODNAME is the name of
the systemtap module). The
proc filesystem is a pseudo-filesystem which
is used as an interface to kernel data structures. There are several probe
point variants supported by the translator:
procfs("PATH").read
procfs("PATH").umask(UMASK).read
procfs("PATH").read.maxsize(MAXSIZE)
procfs("PATH").umask(UMASK).maxsize(MAXSIZE)
procfs("PATH").write
procfs("PATH").umask(UMASK).write
procfs.read
procfs.umask(UMASK).read
procfs.read.maxsize(MAXSIZE)
procfs.umask(UMASK).read.maxsize(MAXSIZE)
procfs.write
procfs.umask(UMASK).write
Note that there are a few differences when procfs probes are used in the stapbpf
runtime. FIFO special files are used instead of proc filesystem files. These
files are created in /var/tmp/systemtap-USER/MODNAME. (USER is the name of the
user). Additionally, users cannot create both read and write probes on the
same file.
PATH is the file name (relative to /proc/systemtap/MODNAME or
/var/tmp/systemtap-USER/MODNAME) to be created. If no
PATH is specified
(as in the last two variants above),
PATH defaults to
"command". The file name "__stdin" is used internally by
systemtap for input probes and should not be used as a
PATH for procfs
probes; see the input probe section below.
When a user reads /proc/systemtap/MODNAME/PATH (normal runtime) or
/var/tmp/systemtap-USER/MODNAME (stapbpf runtime), the corresponding procfs
read probe is triggered. The string data to be read should be assigned
to a variable named
$value, like this:
procfs("PATH").read { $value = "100\n" }
When a user writes into /proc/systemtap/MODNAME/PATH (normal runtime) or
/var/tmp/systemtap-USER/MODNAME (stapbpf runtime), the corresponding procfs
write probe is triggered. The data the user wrote is available in the
string variable named
$value, like this:
procfs("PATH").write { printf("user wrote: %s", $value) }
MAXSIZE is the size of the procfs read buffer. Specifying
MAXSIZE
allows larger procfs output. If no
MAXSIZE is specified, the procfs
read buffer defaults to
STP_PROCFS_BUFSIZE (which defaults to
MAXSTRINGLEN, the maximum length of a string). If setting the procfs
read buffers for more than one file is needed, it may be easiest to override
the
STP_PROCFS_BUFSIZE definition. Here's an example of using
MAXSIZE:
procfs.read.maxsize(1024) {
$value = "long string..."
$value .= "another long string..."
$value .= "another long string..."
$value .= "another long string..."
}
These probe points make input from stdin available to the script during runtime.
The translator currently supports two variants of this family:
input.char is triggered each time a character is read from stdin. The
current character is available in the string variable named
char. There
is no newline buffering; the next character is read from stdin as soon as it
becomes available.
input.line causes all characters read from stdin to be buffered until a
newline is read, at which point the probe will be triggered. The current line
of characters (including the newline) is made available in a string variable
named
line. Note that no more than MAXSTRINGLEN characters will be
buffered. Any additional characters will not be included in
line.
Input probes are aliases for
procfs("__stdin").write. Systemtap
reconfigures stdin if the presence of this procfs probe is detected, therefore
"__stdin" should not be used as a path argument for procfs probes.
Additionally, input probes will not work with the -F and --remote options.
These probe points allow observation of network packets using the netfilter
mechanism. A netfilter probe in systemtap corresponds to a netfilter hook
function in the original netfilter probes API. It is probably more convenient
to use
tapset::netfilter(3stap), which wraps the primitive netfilter
hooks and does the work of extracting useful information from the context
variables.
There are several probe point variants supported by the translator:
netfilter.hook("HOOKNAME").pf("PROTOCOL_F")
netfilter.pf("PROTOCOL_F").hook("HOOKNAME")
netfilter.hook("HOOKNAME").pf("PROTOCOL_F").priority("PRIORITY")
netfilter.pf("PROTOCOL_F").hook("HOOKNAME").priority("PRIORITY")
PROTOCOL_F is the protocol family to listen for, currently one of
NFPROTO_IPV4, NFPROTO_IPV6, NFPROTO_ARP, or
NFPROTO_BRIDGE.
HOOKNAME is the point, or 'hook', in the protocol stack at which to
intercept the packet. The available hook names for each protocol family are
taken from the kernel header files <linux/netfilter_ipv4.h>,
<linux/netfilter_ipv6.h>, <linux/netfilter_arp.h> and
<linux/netfilter_bridge.h>. For instance, allowable hook names for
NFPROTO_IPV4 are
NF_INET_PRE_ROUTING, NF_INET_LOCAL_IN,
NF_INET_FORWARD, NF_INET_LOCAL_OUT, and
NF_INET_POST_ROUTING.
PRIORITY is an integer priority giving the order in which the probe point
should be triggered relative to any other netfilter hook functions which
trigger on the same packet. Hook functions execute on each packet in order
from smallest priority number to largest priority number. If no
PRIORITY is specified (as in the first two probe point variants above),
PRIORITY defaults to "0".
There are a number of predefined priority names of the form
NF_IP_PRI_*
and
NF_IP6_PRI_* which are defined in the kernel header files
<linux/netfilter_ipv4.h> and <linux/netfilter_ipv6.h>
respectively. The script is permitted to use these instead of specifying an
integer priority. (The probe points for
NFPROTO_ARP and
NFPROTO_BRIDGE currently do not expose any named hook priorities to the
script writer.) Thus, allowable ways to specify the priority include:
priority("255")
priority("NF_IP_PRI_SELINUX_LAST")
A script using guru mode is permitted to specify any identifier or number as the
parameter for hook, pf, and priority. This feature should be used with
caution, as the parameter is inserted verbatim into the C code generated by
systemtap.
The netfilter probe points define the following context variables:
- $hooknum
- The hook number.
- $skb
- The address of the sk_buff struct representing the packet.
See <linux/skbuff.h> for details on how to use this struct, or
alternatively use the tapset tapset::netfilter(3stap) for easy
access to key information.
- $in
- The address of the net_device struct representing the
network device on which the packet was received (if any). May be 0 if the
device is unknown or undefined at that stage in the protocol stack.
- $out
- The address of the net_device struct representing the
network device on which the packet will be sent (if any). May be 0 if the
device is unknown or undefined at that stage in the protocol stack.
- $verdict
- (Guru mode only.) Assigning one of the verdict values
defined in <linux/netfilter.h> to this variable alters the further
progress of the packet through the protocol stack. For instance, the
following guru mode script forces all ipv6 network packets to be dropped:
probe netfilter.pf("NFPROTO_IPV6").hook("NF_IP6_PRE_ROUTING") {
$verdict = 0 /* nf_drop */
}
For convenience, unlike the primitive probe points discussed here, the probes
defined in
tapset::netfilter(3stap) export the lowercase names of the
verdict constants (e.g. NF_DROP becomes nf_drop) as local variables.
This family of probe points hooks up to static probing tracepoints inserted into
the kernel or modules. As with markers, these tracepoints are special macro
calls inserted by kernel developers to make probing faster and more reliable
than with DWARF-based probes, and DWARF debugging information is not required
to probe tracepoints. Tracepoints have an extra advantage of more
strongly-typed parameters than markers.
Tracepoint probes look like:
kernel.trace("name"). The
tracepoint name string, which may contain the usual wildcard characters, is
matched against the names defined by the kernel developers in the tracepoint
header files. To restrict the search to specific subsystems (e.g. sched, ext3,
etc...), the following syntax can be used:
kernel.trace("system:name"). The tracepoint system string may
also contain the usual wildcard characters.
The handler associated with a tracepoint-based probe may read the optional
parameters specified at the macro call site. These are named according to the
declaration by the tracepoint author. For example, the tracepoint probe
kernel.trace("sched:sched_switch") provides the parameters
$prev and
$next. If the parameter is a complex type, as in a
struct pointer, then a script can access fields with the same syntax as DWARF
$target variables. Also, tracepoint parameters cannot be modified, but in
guru-mode a script may modify fields of parameters.
The subsystem and name of the tracepoint are available in
$$system and
$$name and a string of name=value pairs for all parameters of the
tracepoint is available in
$$vars or
$$parms.
This family of probe points hooks up to an older style of static probing markers
inserted into older kernels or modules. These markers are special STAP_MARK
macro calls inserted by kernel developers to make probing faster and more
reliable than with DWARF-based probes. Further, DWARF debugging information is
not required to probe markers.
Marker probe points begin with
kernel. The next part names the marker
itself:
mark("name"). The marker name string, which may
contain the usual wildcard characters, is matched against the names given to
the marker macros when the kernel and/or module was compiled. Optionally, you
can specify
format("format"). Specifying the marker format
string allows differentiation between two markers with the same name but
different marker format strings.
The handler associated with a marker-based probe may read the optional
parameters specified at the macro call site. These are named
$arg1
through
$argNN, where NN is the number of parameters supplied by the
macro. Number and string parameters are passed in a type-safe manner.
The marker format string associated with a marker is available in
$format. And also the marker name string is available in
$name.
This family of probes is used to set hardware watchpoints for a given
(global) kernel symbol. The probes take three components as inputs :
1. The
virtual address / name of the kernel symbol to be traced is
supplied as argument to this class of probes. ( Probes for only data segment
variables are supported. Probing local variables of a function cannot be
done.)
2. Nature of access to be probed : a.
.write probe gets triggered when a
write happens at the specified address/symbol name. b.
rw probe is
triggered when either a read or write happens.
3.
.length (optional) Users have the option of specifying the address
interval to be probed using "length" constructs. The user-specified
length gets approximated to the closest possible address length that the
architecture can support. If the specified length exceeds the limits imposed
by architecture, an error message is flagged and probe registration fails.
Wherever 'length' is not specified, the translator requests a hardware
breakpoint probe of length 1. It should be noted that the "length"
construct is not valid with symbol names.
Following constructs are supported :
probe kernel.data(ADDRESS).write
probe kernel.data(ADDRESS).rw
probe kernel.data(ADDRESS).length(LEN).write
probe kernel.data(ADDRESS).length(LEN).rw
probe kernel.data("SYMBOL_NAME").write
probe kernel.data("SYMBOL_NAME").rw
This set of probes make use of the debug registers of the processor, which is a
scarce resource. (4 on x86 , 1 on powerpc ) The script translation flags a
warning if a user requests more hardware breakpoint probes than the limits set
by architecture. For example,a pass-2 warning is flashed when an input script
requests 5 hardware breakpoint probes on an x86 system while x86 architecture
supports a maximum of 4 breakpoints. Users are cautioned to set probes
judiciously.
This family of probe points interfaces to the kernel "perf event"
infrastructure for controlling hardware performance counters. The events being
attached to are described by the "type", "config" fields
of the
perf_event_attr structure, and are sampled at an interval
governed by the "sample_period" and "sample_freq" fields.
These fields are made available to systemtap scripts using the following syntax:
probe perf.type(NN).config(MM).sample(XX)
probe perf.type(NN).config(MM).hz(XX)
probe perf.type(NN).config(MM)
probe perf.type(NN).config(MM).process("PROC")
probe perf.type(NN).config(MM).counter("COUNTER")
probe perf.type(NN).config(MM).process("PROC").counter("NAME")
The systemtap probe handler is called once per XX increments of the underlying
performance counter when using the .sample field or at a frequency in hertz
when using the .hz field. When not specified, the default behavior is to
sample at a count of 1000000. The range of valid type/config is described by
the
perf_event_open(2) system call, and/or the
linux/perf_event.h file. Invalid combinations or exhausted hardware
counter resources result in errors during systemtap script startup. Systemtap
does not sanity-check the values: it merely passes them through to the kernel
for error- and safety-checking. By default the perf event probe is systemwide
unless .process is specified, which will bind the probe to a specific task. If
the name is omitted then it is inferred from the stap -c argument. A perf
event can be read on demand using .counter. The body of the perf probe handler
will not be invoked for a .counter probe; instead, the counter is read in a
user space probe via:
-
process("PROC").statement("func@file") {stat <<<
@perf("NAME")}
-
Support for probing python 2 and python 3 function is available with the help of
an extra python support module. Note that the debuginfo for the version of
python being probed is required. To run a python script with the extra python
support module you'd add the '-m HelperSDT' option to your python command,
like this:
stap foo.stp -c "python -m HelperSDT foo.py"
Python probes look like the following:
python2.module("MPATTERN").function("PATTERN")
python2.module("MPATTERN").function("PATTERN").call
python2.module("MPATTERN").function("PATTERN").return
python3.module("MPATTERN").function("PATTERN")
python3.module("MPATTERN").function("PATTERN").call
python3.module("MPATTERN").function("PATTERN").return
The list above includes multiple variants and modifiers which provide additional
functionality or filters. They are:
- .function
- Places a probe at the beginning of the named function by
default, unless modified by PATTERN. Parameters are available as context
variables.
- .call
- Places a probe at the beginning of the named function.
Parameters are available as context variables.
- .return
- Places a probe at the moment before the return from
the named function. Parameters and local/global python variables are
available as context variables.
PATTERN stands for a string literal that aims to identify a point in the python
program. It is made up of three parts:
- •
- The first part is the name of a function (e.g.
"foo") or class method (e.g. "bar.baz"). This part may
use the "*" and "?" wildcarding operators to match
multiple names.
- •
- The second part is optional and begins with the
"@" character. It is followed by the path to the source file
containing the function, which may include a wildcard pattern. The python
path is searched for a matching filename.
- •
- Finally, the third part is optional if the file name part
was given, and identifies the line number in the source file preceded by a
":" or a "+". The line number is assumed to be an
absolute line number if preceded by a ":", or relative to the
declaration line of the function if preceded by a "+". All the
lines in the function can be matched with ":*". A range of lines
x through y can be matched with ":x-y". Ranges and specific
lines can be mixed using commas, e.g. ":x,y-z".
In the above list of probe points, MPATTERN stands for a python module or script
name that names the python module of interest. This part may use the
"*" and "?" wildcarding operators to match multiple names.
The python path is searched for a matching filename.
Here are some example probe points, defining the associated events.
- begin, end, end
- refers to the startup and normal shutdown of the session.
In this case, the handler would run once during startup and twice during
shutdown.
- timer.jiffies(1000).randomize(200)
- refers to a periodic interrupt, every 1000 +/- 200
jiffies.
- kernel.function("*init*"),
kernel.function("*exit*")
- refers to all kernel functions with "init" or
"exit" in the name.
- kernel.function("*@kernel/time.c:240")
- refers to any functions within the
"kernel/time.c" file that span line 240. Note that this is
not a probe at the statement at that line number. Use the
kernel.statement probe instead.
- kernel.trace("sched_*")
- refers to all scheduler-related (really, prefixed)
tracepoints in the kernel.
- kernel.mark("getuid")
- refers to an obsolete STAP_MARK(getuid, ...) macro call in
the kernel.
- module("usb*").function("*sync*").return
- refers to the moment of return from all functions with
"sync" in the name in any of the USB drivers.
- kernel.statement(0xc0044852)
- refers to the first byte of the statement whose compiled
instructions include the given address in the kernel.
- kernel.statement("*@kernel/time.c:296")
- refers to the statement of line 296 within
"kernel/time.c".
- kernel.statement("bio_init@fs/bio.c+3")
- refers to the statement at line bio_init+3 within
"fs/bio.c".
- kernel.data("pid_max").write
- refers to a hardware breakpoint of type "write"
set on pid_max
- syscall.*.return
- refers to the group of probe aliases with any name in the
third position
stap(1),
probe::*(3stap),
tapset::*(3stap)