NAME

bamsormadup - sort name collated SAM or BAM file by coordinate and mark duplicates or sort SAM or BAM file by query name

SYNOPSIS

bamsormadup [options]

DESCRIPTION

bamsormadup has two modes of operation depending on the value of the SO parameter. If SO=coordinate or if the SO key is not given, then bamsormadup reads a name collated BAM or SAM file from standard input, runs a fix mate process, sorts the contained alignments by coordinate, marks duplicate alignments and writes the sorted alignments to standard output in BAM format. An alignment file is name collated if all the alignments for one read name appear consecutively in the file. If SO=queryname then the program reads a BAM or SAM file from standard input, sorts it by queryname and writes the sorted file on standard output in BAM format.
The following key=value pairs can be given:
level=<-1|0|1|9|11>: set compression level of the output BAM file. Valid values are
-1:
zlib/gzip default compression level
0:
uncompressed
1:
zlib/gzip level 1 (fast) compression
9:
zlib/gzip level 9 (best) compression
If libmaus has been compiled with support for igzip (see https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data) then an additional valid value is
11:
igzip compression
inputformat=<bam>: set the input file format. This can be either bam or sam (see http://samtools.sourceforge.net/SAM1.pdf)
threads=<[1]>: number of threads used
M=<stderr>: name of the metrics file for duplicate marking (metrics are written to standard error if not set)
tmpfile=<bamsormadup_hostname_pid_starttime>: prefix for temporary files. By default the temporary files are created in the current directory. Set tmpfile=mem:tmp_ to store temporary files in RAM instead of on disk. Note that this may require very large amounts of RAM depending on the input.
SO=<coordinate|queryname>: set the sort order. Valid values are
coordinate
sort alignments by coordinate. Input is assumed to be name collated.
queryname
sort alignments by query name. No assumption is made on the order of the input.
reference=<>: name of reference FastA file when writing CRAM. This file will be used for filling missing UR and M5 fields of SQ header lines. It may refer to a local file or a file stored on an http or ftp server. The file is uncompressed on the fly if the file name ends on .gz . If the REF_CACHE environment variable is set to the name of an existing directory, then normalised cache files will be written to this directory for each reference sequence. The file names are constructed from the directory name and the MD5 checksum of each reference sequence. This writing of cached files is omitted however, if a previously existing file is found in the list of read only cache locations given by the REF_PATH environment variable.
optminpixeldif=<100>: distance (x and y inside same tile) inside which reads are considered as optical duplicates
rcsupport==<0>: if 1 then create rc aux field (unclipped coordinate) for mapped reads when sorting from query to coordinate order
numerical==<>: store numerical index in the given file. By default numerical index is not stored.
numericalindexmod=1024: use this block size for producing numerical index
fragmergepar=1: number of threads used for merging fragment lists in duplicate marking. The run-time will generally benefit from an increased number here, but parallel merging requires a large number of simultaneously open files, which will cause problems on some systems.
crammode=: CRAM encoding profile. See the documentation for the scramble program for possible options.

AUTHOR

Written by German Tischler.

REPORTING BUGS

Report bugs to <[email protected]> Copyright © 2009-2015 German Tischler, © 2011-2015 Genome Research Limited. License GPLv3+: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
 
This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.