samtools-sort - sorts SAM/BAM/CRAM files
samtools sort [
-l level] [
-u] [
-m maxMem]
[
-o out.bam] [
-O format] [
-M] [
-K
kmerLen] [
-n] [
-t tag] [
-T
tmpprefix] [
-@ threads]
[
in.sam|
in.bam|
in.cram]
Sort alignments by leftmost coordinates, or by read name when
-n is used.
An appropriate
@HD-SO sort order header tag will be added or an
existing one updated if necessary.
The sorted output is written to standard output by default, or to the specified
file (
out.bam) when
-o is used. This command will also create
temporary files
tmpprefix.%d.bam as needed when
the entire alignment data cannot fit into memory (as controlled via the
-m option).
Consider using
samtools collate instead if you need name collated data
without a full lexicographical sort.
Note that if the sorted output file is to be indexed with
samtools index,
the default coordinate sort must be used. Thus the
-n and
-t
options are incompatible with
samtools index.
-
-K INT
- Sets the kmer size to be used in the -M option.
[20]
-
-l INT
- Set the desired compression level for the final output
file, ranging from 0 (uncompressed) or 1 (fastest but minimal compression)
to 9 (best compression but slowest to write), similarly to
gzip(1)'s compression level setting.
- If -l is not used, the default compression level
will apply.
- -u
- Set the compression level to 0, for uncompressed output.
This is a synonym for -l 0.
-
-m INT
- Approximately the maximum required memory per thread,
specified either in bytes or with a K, M, or G
suffix. [768 MiB]
- To prevent sort from creating a huge number of temporary
files, it enforces a minimum value of 1M for this setting.
- -M
- Sort unmapped reads (those in chromosome "*") by
their sequence minimiser (Schleimer et al., 2003; Roberts et al., 2004),
also reverse complementing as appropriate. This has the effect of
collating some similar data together, improving the compressibility of the
unmapped sequence. The minimiser kmer size is adjusted using the -K
option. Note data compressed in this manner may need to be name collated
prior to conversion back to fastq.
- Mapped sequences are sorted by chromosome and
position.
- -n
- Sort by read names (i.e., the QNAME field) rather
than by chromosomal coordinates.
-
-t TAG
- Sort first by the value in the alignment tag TAG, then by
position or name (if also using -n).
-
-o FILE
- Write the final sorted output to FILE, rather than
to standard output.
-
-O FORMAT
- Write the final output as sam, bam, or
cram.
By default, samtools tries to select a format based on the -o
filename extension; if output is to standard output or no format can be
deduced, bam is selected.
-
-T PREFIX
- Write temporary files to
PREFIX.nnnn.bam, or if the specified
PREFIX is an existing directory, to
PREFIX/samtools.mmm.mmm.tmp.nnnn.bam,
where mmm is unique to this invocation of the sort
command.
- By default, any temporary files are written alongside the
output file, as out.bam.tmp.nnnn.bam, or if
output is to standard output, in the current directory as
samtools.mmm.mmm.tmp.nnnn.bam.
-
-@ INT
- Set number of sorting and compression threads. By default,
operation is single-threaded.
- --no-PG
- Do not add a @PG line to the header of the output
file.
- --template-coordinate
- Sorts by template-coordinate, whereby the sort order (@HD
SO) is unsorted, the group order (GO) is query, and the
sub-sort (SS) is template-coordinate.
Ordering Rules
The following rules are used for ordering records.
If option
-t is in use, records are first sorted by the value of the
given alignment tag, and then by position or name (if using
-n). For
example, “-t RG” will make read group the primary sort key. The
rules for ordering by tag are:
- •
- Records that do not have the tag are sorted before ones
that do.
- •
- If the types of the tags are different, they will be sorted
so that single character tags (type A) come before array tags (type B),
then string tags (types H and Z), then numeric tags (types f and i).
- •
- Numeric tags (types f and i) are compared by value. Note
that comparisons of floating-point values are subject to issues of
rounding and precision.
- •
- String tags (types H and Z) are compared based on the
binary contents of the tag using the C strcmp(3) function.
- •
- Character tags (type A) are compared by binary character
value.
- •
- No attempt is made to compare tags of other types —
notably type B array values will not be compared.
When the
-n option is present, records are sorted by name. Names are
compared so as to give a “natural” ordering — i.e.
sections consisting of digits are compared numerically while all other
sections are compared based on their binary representation. This means
“a1” will come before “b1” and “a9”
will come before “a10”. Records with the same name will be
ordered according to the values of the READ1 and READ2 flags (see
flags).
When the
--template-coordinate option is in use, the reads are sorted by:
- 1.
- The earlier unclipped 5' coordinate of the template.
- 2.
- The higher unclipped 5' coordinate of the template.
- 3.
- The library (from the read group).
- 4.
- The molecular identifier (MI tag if present).
- 5.
- The read name.
- 6.
- If unpaired, or if R1 has the lower coordinates of the
pair.
When none of the above options are in use, reads are sorted by reference
(according to the order of the @SQ header records), then by position in the
reference, and then by the REVERSE flag.
Note
Historically
samtools sort also accepted a less flexible way of
specifying the final and temporary output filenames:
- samtools sort [-f] [-o] in.bam
out.prefix
This has now been removed. The previous
out.prefix argument (and
-f option, if any) should be changed to an appropriate combination of
-T PREFIX and
-o FILE. The previous
-o
option should be removed, as output defaults to standard output.
Written by Heng Li from the Sanger Institute with numerous subsequent
modifications.
samtools(1),
samtools-collate(1),
samtools-merge(1)
Samtools website: <
http://www.htslib.org/>