bamstreamingmarkduplicates - mark duplicate reads
bamstreamingmarkduplicates [options]
bamstreamingmarkduplicates reads a coordinate sorted BAM, SAM or CRAM file,
which has been previously processed by bamsort using the options fixmates=1
and adddupmarksupport=1, marks duplicate read pairs and reads and writes the
resulting file in BAM, SAM or CRAM format. The preprocessing of the file using
bamsort with the stated options is mandatory, i.e. bamstreamingmarkduplicates
will fail without it. In contrast to bammarkduplicates and bammarkduplicates2
the streaming variant bamstreamingmarkduplicates processes the file in a
single pass. bamstreamingmarkduplicates cannot handle files containing orphan
pair ends (pairs where one of the two ends is missing in the file).
The following key=value pairs can be given:
M=<>: file name for metrics data. By default the metrics data is
written on the standard error channel.
level=<-1|0|1|9|11>: set compression level of the output BAM file.
Valid values are
- -1:
- zlib/gzip default compression level
- 0:
- uncompressed
- 1:
- zlib/gzip level 1 (fast) compression
- 9:
- zlib/gzip level 9 (best) compression
If libmaus has been compiled with support for igzip (see
https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data)
then an additional valid value is
- 11:
- igzip compression
verbose=<1>: Valid values are
- 1:
- print progress report on standard error
- 0:
- do not print progress report
tmpfile=<filename>: set the prefix for temporary file names
disablevalidation=<0|1>: sets whether input validation is
performed. Valid values are
- 0:
- validation is enabled (default)
- 1:
- validation is disabled
md5=<0|1>: md5 checksum creation for output file. This option can
only be given if outputformat=bam. Then valid values are
- 0:
- do not compute checksum. This is the default.
- 1:
- compute checksum. If the md5filename key is set, then the
checksum is written to the given file. If md5filename is unset, then no
checksum will be computed.
md5filename file name for md5 checksum if md5=1.
index=<0|1>: compute BAM index for output file. This option can
only be given if outputformat=bam. Then valid values are
- 0:
- do not compute BAM index. This is the default.
- 1:
- compute BAM index. If the indexfilename key is set, then
the BAM index is written to the given file. If indexfilename is unset,
then no BAM index will be computed.
indexfilename file name for output BAM index if index=1.
inputformat=<bam>: input file format. All versions of
bamstreamingmarkduplicates come with support for the BAM input format. If the
program in addition is linked to the io_lib package, then the following
options are valid:
- bam:
- BAM (see http://samtools.sourceforge.net/SAM1.pdf)
- sam:
- SAM (see http://samtools.sourceforge.net/SAM1.pdf)
- cram:
- CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)
outputformat=<bam>: output file format. All versions of
bamstreamingmarkduplicates come with support for the BAM output format. If the
program in addition is linked to the io_lib package, then the following
options are valid:
- bam:
- BAM (see http://samtools.sourceforge.net/SAM1.pdf)
- sam:
- SAM (see http://samtools.sourceforge.net/SAM1.pdf)
- cram:
- CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit).
This format is not advisable for data sorted by query name.
I=<[stdin]>: input filename, standard input if unset.
O=<[stdout]>: output filename, standard output if unset.
inputthreads=<[1]>: input helper threads, only valid for
inputformat=bam.
outputthreads=<[1]>: output helper threads, only valid for
outputformat=bam.
reference=<[]>: reference FastA file for inputformat=cram and
outputformat=cram. An index file (.fai) is required.
tag=<tag> name of auxiliary field storing tag information in string
form. Read fragments or pairs with different tags will not be considered as
duplicates, even they would be according to their mapping coordinates. For
pairs the tag field information of the first and second mate are concatenated
to obtain the tag of the pair.
nucltag=<tag> this option works like the tag option but is
restricted to sequences of nucleotides (A,C,G or T) as tags. The length of
each tag sequence is not allowed to exceed 15 bases. All tags are required to
have the same length. Each non nucleotide symbol is mapped to A. In contrast
to the tag option, nucltag uses less memory for processing and can be expected
to be faster.
filterdupmarktags=<[0]>: remove the auxiliary fields MC, MQ, MS,
and MT used for streaming duplicate marking when producing the output file. By
default the fields are not removed.
Written by German Tischler.
Report bugs to <
[email protected]>
Copyright © 2009-2014 German Tischler, © 2011-2014 Genome Research
Limited. License GPLv3+: GNU GPL version 3
<
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO
WARRANTY, to the extent permitted by law.