lamssi_rpi - overview of LAM's RPI SSI modules
The "kind" for RPI SSI modules is "rpi". Specifically, the
string "rpi" (without the quotes) should be used to specify which
RPI should be used on the
mpirun command line with the
-ssi
switch. For example:
- mpirun -ssi rpi tcp C my_mpi_program
- Specifies to use the tcp RPI (and to launch a single copy
of the executable "foo" on each node).
The "rpi" string is also used as a prefix send parameters to specific
RPI modules. For example:
- mpirun -ssi rpi tcp -ssi rpi_tcp_short 131072 C
my_mpi_program
- Specifies to use the tcp RPI, and to pass in the value of
131072 (128K) as the short message length for TCP messages. See each RPI
section below for a full description of parameters that are accepted by
each RPI.
LAM currently supports five different RPI SSI modules: gm, lamd, tcp, sysv,
usysv.
Only one RPI module may be selected per command execution. The selection of
which module occurs during MPI_INIT, and is used for the duration of the MPI
process. It is erroneous to select different RPI modules for different
processes.
The kind for selecting an RPI is "rpi". For example:
- mpriun -ssi rpi tcp C my_mpi_program
- Selects to use the tcp RPI and run a single copy of the foo
exectuable on each node.
As with all SSI modules, it is possible to pass parameters at run time. This
section discusses the built-in LAM RPI modules, as well as the run-time
parameters that they accept.
In the discussion below, the parameters are discussed in terms of
kind
and
name. The
kind and
name may be specified as command
line arguments to the
mpirun command with the
-ssi switch, or
they may be set in environment variables of the form
LAM_MPI_SSI_
name=
value. Note that using the
-ssi command
line switch will take precendence over any environment variables.
If the RPI that is selected is unable to run (e.g., attempting to use the gm RPI
when gm support was not compiled into LAM, or if no gm hardware is available
on the nodes), an appropriate error message will be printed and execution will
abort.
The crtcp RPI is a checkpoint/restart-able version of the tcp RPI (see below).
It is separate from the tcp RPI because the current implementation imposes a
slight performance penalty to enable the ability to checkpoint and restart MPI
jobs. Its tunable parameters are the same as the tcp RPI. This RPI probably
only needs to be used when the ability to checkpoint and restart MPI jobs is
required.
See the LAM/MPI User's Guide for more details on the crtcp RPI as well as the
checkpoint/restart capabilities of LAM/MPI. The
lamssi_cr(7) manual page also
contains additional information.
The gm RPI is used with native Myrinet networks. Please note that the gm RPI
exists, but has not yet been optimized. It gives significantly better
performance than TCP over Myrinet networks, but has not yet been properly
tuned and instrumented in LAM.
That being said, there are several tunable parameters in the gm RPI:
- rpi_gm_maxport N
- If rpi_gm_port is not specified, LAM will attempt to
find an open GM port to use for MPI communications starting with port 1
and ending with the N value speified by the rpi_gm_maxport
parameter. If unspecified, LAM will try all existing GM ports.
- rpi_gm_port N
- LAM will attempt to use gm port N for MPI
communications.
- rpi_gm_tinymsglen N
- Specifies the maximum message size (in bytes) for
"tiny" messages (i.e., messages that are sent entirely in one gm
message). Tiny messages are memcpy'ed into the header before it is sent to
the destination, and memcpy'ed out of the header into the destination
buffer on the receiver. Hence, it is not advisable to make this value too
large.
- rpi_gm_fast 1
- Specifies to use the "fast" protocol for sending
short gm messages. Unreliable in the presence of GM errors or timeouts;
this parameter is not advised for MPI applications that essentially do not
make continual progress within MPI.
- rpi_gm_cr 1
- Enable checkpoint/restart behavior for gm. This can
only be enabled if the gm rpi module was compiled with support for
the gm_get() function, which is disabled by default. See the LAM
Installation and User's Guides for more information on this parameter
before you use it.
The lamd RPI uses LAM's "out-of-band" communication mechanism for
passing MPI messages. Specifically, MPI messages are sent from the user
process to the local LAM daemon, then to the remote LAM daemon (if the
destination process is on a different node), and then to the destination
process.
While this adds latency to message passing because of the extra hops that each
message must travel, it allows for true asynchronous message passing. Since
the LAM daemon is running in its own execution space, it can make progress on
message passing regardless of the state / status of the user's program. This
can be an overall net savings in performance and execution time for some
classes of MPI programs.
It is expected that this RPI will someday become obsolete when LAM becomes
multi-threaded and allows progress to be made on message passing in separate
threads rather than in separate processes.
The lamd RPI has no tunable parameters.
The tcp RPI uses pure TCP for all MPI message passing. TCP sockets are opened
between MPI processes and are used for all MPI traffic.
The tcp RPI has one tunable parameter:
- rpi_tcp_short <bytes>
- Tells the tcp RPI the smallest size (in bytes) for a
message to be considered "long". Short messages are sent eagerly
(even if the receiving side is not expecting them). Long messages use a
rendevouz protocol (i.e., a three-way handshake) such that the message is
not actually sent until the receiver is expecting it. This value defaults
to 64k.
The sysv RPI uses shared memory for communication between MPI processes on the
same node, and TCP sockets for communication between MPI processes on
different nodes. System V semaphores are used to lock the shared memory pools.
This RPI is best used when running multiple MPI processes on uniprocessors (or
oversubscribed SMPs) because of the blocking / yielding nature of semaphores.
The sysv RPI has the following tunable parameters:
- rpi_tcp_short <bytes>
- Since the sysv RPI uses parts of the tcp RPI for off-node
communication, this parameter also has relevance to the sysv RPI. The
meaning of this parameter is discussed in the tcp RPI section.
- rpi_sysv_short <bytes>
- Tells the sysv RPI the smallest size (in bytes) for a
message to be considered "long". Short shared memory messages
are sent using a small "postbox" protocol; long messages use a
more general shared memory pool method. This value defaults to 8k.
- rpi_sysv_pollyield <bool>
- If set to a nonzero number, force the use of a system call
to yield the processor. The system call will be yield(), sched_yield(), or
select() (with a 1ms timeout), depending what LAM's configure script finds
at configuration time. This value defaults to 1.
- rpi_sysv_shmpoolsize <bytes>
- The size of the shared memory pool that is used for long
message transfers. It is allocated once on each node for each MPI parallel
job. Specifically, if multiple MPI processes from the same parallel job
are spawned on a single node, this pool will only be allocated once.
The configure script will try to determine a default size for the pool if
none is explicitly specified (you should always check this to see if it is
reasonable). Larger values should improve performance especially when an
application passes large messages, but will also increase the system
resources used by each task.
- rpi_sysv_shmmaxalloc <bytes>
- To prevent a single large message transfer from
monopolizing the global pool, allocations from the pool are actually
restricted to a maximum of rpi_sysv_shmmaxalloc bytes each. Even
with this restriction, it is possible for the global pool to temporarily
become exhausted. In this case, the transport will fall back to using the
postbox area to transfer the message. Performance will be degraded, but
the application will progress.
The configure script will try to determine a default size for the maximum
atomic transfer size if none is explicitly specified (you should always
check this to see if it is reasonable). Larger values should improve
performance especially when an application passes large messages, but will
also increase the system resources used by each task.
The usysv RPI uses shared memory for communication between MPI processes on the
same node, and TCP sockets for communication between MPI processes on
different nodes. Spin locks are used to lock the shared memory pools. This RPI
is best used when the multiple of MPI processes on a single node is less than
or equal to the number of processors because it allows LAM to fully occupy the
processor while waiting for a message and never be swapped out.
The usysv RPI has many of the same tunable parameters as the sysv RPI:
- rpi_tcp_short <bytes>
- Same meaning as in the sysv RPI.
- rpi_usysv_short <bytes>
- Same meaning as rpi_sysv_short in the sysv RPI.
- rpi_usysv_pollyield <bool>
- Same meaning as rpi_sysv_pollyield in the sysv
RPI.
- rpi_usysv_shmpoolsize <bytes>
- Same meaning as rpi_sysv_shmpoolsize in the sysv
RPI.
- rpi_usysv_shmmaxalloc <bytes>
- Same meaning as rpi_sysv_shmmaxalloc in the sysv
RPI.
- rpi_usysv_readlockpoll <iterations>
- Number of iterations to spin before yielding the processor
while waiting to read. This value defaults to 10,000.
- rpi_usysv_writelockpoll <iterations>
- Number of iterations to spin before yielding the processor
while waiting to write. This value defaults to 10.
lamssi(7),
lamssi_cr(7),
mpirun(1), LAM User's Guide