bamsormadup - sort name collated SAM or BAM file by coordinate and mark
duplicates or sort SAM or BAM file by query name
bamsormadup [options]
bamsormadup has two modes of operation depending on the value of the SO
parameter. If SO=coordinate or if the SO key is not given, then bamsormadup
reads a name collated BAM or SAM file from standard input, runs a fix mate
process, sorts the contained alignments by coordinate, marks duplicate
alignments and writes the sorted alignments to standard output in BAM format.
An alignment file is name collated if all the alignments for one read name
appear consecutively in the file. If SO=queryname then the program reads a BAM
or SAM file from standard input, sorts it by queryname and writes the sorted
file on standard output in BAM format.
The following key=value pairs can be given:
level=<-1|0|1|9|11>: set compression level of the output BAM file.
Valid values are
- -1:
- zlib/gzip default compression level
- 0:
- uncompressed
- 1:
- zlib/gzip level 1 (fast) compression
- 9:
- zlib/gzip level 9 (best) compression
If libmaus has been compiled with support for igzip (see
https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data)
then an additional valid value is
- 11:
- igzip compression
inputformat=<bam>: set the input file format. This can be either
bam or sam (see
http://samtools.sourceforge.net/SAM1.pdf)
threads=<[1]>: number of threads used
M=<stderr>: name of the metrics file for duplicate marking (metrics
are written to standard error if not set)
tmpfile=<bamsormadup_hostname_pid_starttime>: prefix for temporary
files. By default the temporary files are created in the current directory.
Set tmpfile=mem:tmp_ to store temporary files in RAM instead of on disk. Note
that this may require very large amounts of RAM depending on the input.
SO=<coordinate|queryname>: set the sort order. Valid values are
- coordinate
- sort alignments by coordinate. Input is assumed to be name
collated.
- queryname
- sort alignments by query name. No assumption is made on the
order of the input.
reference=<>: name of reference FastA file when writing CRAM. This
file will be used for filling missing UR and M5 fields of SQ header lines. It
may refer to a local file or a file stored on an http or ftp server. The file
is uncompressed on the fly if the file name ends on .gz . If the REF_CACHE
environment variable is set to the name of an existing directory, then
normalised cache files will be written to this directory for each reference
sequence. The file names are constructed from the directory name and the MD5
checksum of each reference sequence. This writing of cached files is omitted
however, if a previously existing file is found in the list of read only cache
locations given by the REF_PATH environment variable.
optminpixeldif=<100>: distance (x and y inside same tile) inside
which reads are considered as optical duplicates
rcsupport==<0>: if 1 then create rc aux field (unclipped
coordinate) for mapped reads when sorting from query to coordinate order
numerical==<>: store numerical index in the given file. By default
numerical index is not stored.
numericalindexmod=1024: use this block size for producing numerical index
fragmergepar=1: number of threads used for merging fragment lists in
duplicate marking. The run-time will generally benefit from an increased
number here, but parallel merging requires a large number of simultaneously
open files, which will cause problems on some systems.
crammode=: CRAM encoding profile. See the documentation for the scramble
program for possible options.
Written by German Tischler.
Report bugs to <
[email protected]>
Copyright © 2009-2015 German Tischler, © 2011-2015 Genome Research
Limited. License GPLv3+: GNU GPL version 3
<
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO
WARRANTY, to the extent permitted by law.