bamsort - sort BAM files by coordinate or query name
bamsort [options]
bamsort reads a BAM, SAM or CRAM file, sorts it by coordinate (lexicographical
by reference sequence id and position on reference sequence), query name
(possibly including the HI aux tag for ordering alignments featuring the same
query name), hash value computed for the query name or an aux tag value and
writes the sorted file in BAM, SAM or CRAM format.
Lexicographical order denotes that pairs (a,b) and (c,d) will be ordered such
that (a,b) < (c,d) if either a < c or a = c and b < d. For
coordinates this means that the alignments are first grouped by reference
sequence id (i.e. all alignments for one chromosome appear in one block) and
within the block for each reference sequence the alignments are ordered by the
start position on this sequence.
The order by query name decomposes the read names into parts containing numbers
and such containing no number. A read name A15_30_C50 will for instance be
split into the components A, 15, _, 30, _C and 50. The comparison of read
names is performed lexicographically along this decomposition, where number
fields are compared as numbers. As an example we have A15<B12 as A<B and
A9<A12 as A=A and 9<12 (where 9 and 12 are considered as numbers and not
as the sequences of their digits).
The order by hash value computes a hash value (effectively random number) for
each read name and order the alignments by this number in increasing order.
Alignments assigned the same hash value are ordered by query name.
The order by aux tag compares alignments by the value of a given aux field
storing a string value. This string comparison follows the same order used for
comparing query names stated above. Alignments with the same aux value are
sorted by coordinate order.
If the memory buffer given is not sufficiently large to process the input file,
then the program writes intermediate results to a temporary file. This file
can be large and depending on the compression of the input file larger than
the input itself.
The following key=value pairs can be given:
SO=<coordinate|queryname|hash|tag|tagonly|queryname_HI|queryname_lexicographic>:
set the sort order. Valid values are
- coordinate:
- sort alignments by coordinate
- queryname
- sort alignments by query name
- hash
- sort alignments by (Murmur3) hash of query name. This
effectively puts them in a random order.
- tag
- sort alignments by string aux field. The tag of the aux
fields need to be provided using the sorttag key. Entries with identical
tag are sorted by coordinate.
- tagonly
- sort alignments by string aux field. The tag of the aux
fields need to be provided using the sorttag key. Entries with identical
tag are left in the same order as they were in the input.
- queryname_HI
- sort alignments by query name. Alignments with identical
query name are sorted by the value of their HI aux field.
- queryname_lexicographic
- sort alignments by query name using a purely lexicographic
comparison instead of the more sophisticated version described above.
level=<-1|0|1|9|11>: set compression level of the output BAM file.
Valid values are
- -1:
- zlib/gzip default compression level
- 0:
- uncompressed
- 1:
- zlib/gzip level 1 (fast) compression
- 9:
- zlib/gzip level 9 (best) compression
If libmaus has been compiled with support for igzip (see
https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data)
then an additional valid value is
- 11:
- igzip compression
verbose=<1>: Valid values are
- 1:
- print progress report on standard error
- 0:
- do not print progress report
blockmb=<1024>: set size of the internal memory sorting buffer in
megabytes. The default buffer size is one gigabyte.
tmpfile=<filename>: set the prefix for temporary file names
disablevalidation=<0|1>: sets whether input validation is
performed. Valid values are
- 0:
- validation is enabled (default)
- 1:
- validation is disabled
md5=<0|1>: md5 checksum creation for output file. This option can
only be given if outputformat=bam. Then valid values are
- 0:
- do not compute checksum. This is the default.
- 1:
- compute checksum. If the md5filename key is set, then the
checksum is written to the given file. If md5filename is unset, then no
checksum will be computed.
md5filename file name for md5 checksum if md5=1.
index=<0|1>: compute BAM index for output file. This option can
only be given if outputformat=bam. Then valid values are
- 0:
- do not compute BAM index. This is the default.
- 1:
- compute BAM index. If the indexfilename key is set, then
the BAM index is written to the given file. If indexfilename is unset,
then no BAM index will be computed.
indexfilename file name for output BAM index if index=1.
inputformat=<bam>: input file format. All versions of bamsort come
with support for the BAM input format. If the program in addition is linked to
the io_lib package, then the following options are valid:
- bam:
- BAM (see http://samtools.sourceforge.net/SAM1.pdf)
- sam:
- SAM (see http://samtools.sourceforge.net/SAM1.pdf)
- cram:
- CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)
outputformat=<bam>: output file format. All versions of bamsort
come with support for the BAM output format. If the program in addition is
linked to the io_lib package, then the following options are valid:
- bam:
- BAM (see http://samtools.sourceforge.net/SAM1.pdf)
- sam:
- SAM (see http://samtools.sourceforge.net/SAM1.pdf)
- cram:
- CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit).
This format is not advisable for data sorted by query name.
I=<[stdin]>: input filename, standard input if unset.
O=<[stdout]>: output filename, standard output if unset.
inputthreads=<[1]>: input helper threads, only valid for
inputformat=bam.
sortthreads=<[1]>: number of threads used for sorting.
outputthreads=<[1]>: output helper threads, only valid for
outputformat=bam.
reference=<[]>: reference FastA file for inputformat=cram and
outputformat=cram. An index file (.fai) is required.
range=<>: input range to be processed. This option is only valid if
the input is a coordinate sorted and indexed BAM file
fixmates=<0|1>: fix mate information as bamfixmateinformation would
do. Input is assumed to be collated by query name (no changes will be applied
to mates which are not adjacent in the input stream). By default this option
is disabled.
calmdnm=<0|1>: calculate the MD and NM fields as a side effect. By
default the fields are not calculated. Calculation is only performed if
sorting is performed by coordinate. If calmdnm=1, then the parameter
calmdnmreference in required. The supported file formats can be found in the
manual page for bammdnm.
calmdnmreference=<[]>: name of reference sequence file if
calmdnm=1.
calmdnmrecompindetonly=<0|1>: compute MD/NM fields in the presence
of indeterminate (N) bases only. This option is only relevant if calmdnm=1. By
default the fields are computed for all mapped alignments if calmdnm=1.
calmdnmwarnchange=<0|1>: warn if MD/NM field which was computed is
differing from a previously existing field. By default no warnings are
produced.
adddupmarksupport=<0|1>: add information required for streaming
duplicate marking in the aux fields MS and MC. Input is assumed to be collated
by query name. This option is ignored unless fixmates=1. By default it is
disabled.
markduplicates=<[0]>: mark duplicate read pairs and reads. This
option can only be used when a name collated file (all reads for a name are
consecutive in the input) is sorted into coordinate order. In addition the
input is required not to contain orphan reads (pair ends such that the other
end of the pair is not contained in the file). Setting markduplicates=1
implies adddupmarksupport=1. The temporarily added auxiliary fields are
removed during output generation. The markduplicates option is disabled by
default.
rmdup=<[0]>: remove the duplicates marked by the markduplicates
option. As this requires markduplicates=1, the requirements stated for
markduplicates also apply for rmdup.
tag=<tag> name of auxiliary field storing tag information for
duplicate marking in string form. Read fragments or pairs with different tags
will not be considered as duplicates, even they would be according to their
mapping coordinates. For pairs the tag field information of the first and
second mate are concatenated to obtain the tag of the pair.
nucltag=<tag> this option works like the tag option but is
restricted to sequences of nucleotides (A,C,G or T) as tags. The length of
each tag sequence is not allowed to exceed 15 bases. All tags are required to
have the same length. Each non nucleotide symbol is mapped to A. In contrast
to the tag option, nucltag uses less memory for processing and can be expected
to be faster.
M=<stderr>: name of the metrics file for duplicate marking (metrics
are written to standard error if not set)
streaming=<0|1>: do not open input file(s) multiple times if set to
1. When given multiple input files bamsort concatenates the files on the fly
and computes a merged header before starting the data processing. Computing
the header of the output file requires opening each input file. If each input
file can only be opened once (as it may take the form of a pipe or socket
connection), then bamsort will keep all the files open at the same time.
Otherwise the files will be opened only as needed to keep the number of open
file descriptors lower.
sorttag=: tag of aux field used for comparison when SO=tag.
hash=<crc32prod>: hash used for producing bamseqchksum type header
fields in sorted output.
Written by German Tischler.
Report bugs to <
[email protected]>
Copyright © 2009-2016 German Tischler, © 2011-2014 Genome Research
Limited. License GPLv3+: GNU GPL version 3
<
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO
WARRANTY, to the extent permitted by law.