bamcollate2 - collate reads in a SAM, BAM or CRAM file by name
bamcollate2 [options]
bamcollate2 reads a SAM, BAM or CRAM file from standard input, collates the
contained reads/alignments by name and writes the resulting data to standard
output in BAM format.
The following key=value pairs can be given:
collate=<0|1|2|3>: Valid values are
- 3:
- collate read pairs and attach post ranks (line numbers of
alignments in output file) to each read. For pairs this add the prefix
a_b_ to a pair when the first read of the pair appears in line a and the
second one in line b of the output file, e.g. the name HS5 is changed to
20_21_HS5 for both ends if read 1 appears in line 20 and read 2 in line
21. For single end reads it add the prefix a_ to the name where a is the
rank (line number) of the read in the output file. The pre rank (line
number in the input file) is attached to each read by putting it in the zz
auxiliary field as an eight byte number array similar to the funcionality
of bamrank.
- 2:
- collate read pairs and attach ranks (line numbers of
alignments in source file) to each read. For pairs this add the prefix
a_b_ to a pair when the first read of the pair appears in line a and the
second one in line b of the source file, e.g. the name HS5 is changed to
25_32_HS5 for both ends if read 1 appears in line 25 and read 2 in line
32. For single end reads it add the prefix a_ to the name where a is the
rank (line number) of the read in the source file.
- 1:
- collate read pairs
- 0:
- do not collate, keep reads in the original order
filename=<stdin>: input file name (data is read from standard input
if this option is not given)
inputformat=<bam>: input file format All versions of bamcollate2
come with support for the BAM input format. If the program in addition is
linked to the io_lib package, then the following options are valid:
- bam:
- BAM (see http://samtools.sourceforge.net/SAM1.pdf)
- sam:
- SAM (see http://samtools.sourceforge.net/SAM1.pdf)
- cram:
- CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)
level=<-1|0|1|9|11>: set compression level of the output BAM file.
Valid values are
- -1:
- zlib/gzip default compression level
- 0:
- uncompressed
- 1:
- zlib/gzip level 1 (fast) compression
- 9:
- zlib/gzip level 9 (best) compression
If libmaus has been compiled with support for igzip (see
https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data)
then an additional valid value is
- 11:
- igzip compression
exclude=<SECONDARY>: Do not include reads in the output that have
any of the given flags set. The flags are given separated by commas. Valid
flags are:
- PAIRED:
- read was paired in sequencing
- PROPER_PAIR:
- read has been mapped as part of a proper pair
- UNMAP:
- read was not mapped
- MUNMAP:
- mate of read was not mapped
- REVERSE:
- read was mapped to the reverse strand
- MREVERSE:
- mate of read was mapped to the reverse strand
- READ1:
- read was first read of a pair during sequencing
- READ2:
- read was second read of a pair during sequencing
- SECONDARY:
- alignment is secondary, i.e. an alternative mapping to the
primary alignment in the same file
- QCFAIL:
- read as marked as having failed quality control
- DUP:
- read is marked as a duplicate of another read in the same
file (see bammarkduplicates)
- SUPPLEMENTARY:
- read is marked as supplementary alignment
disablevalidation=<0>: Valid values are
- 0:
- run input file validation on alignments (this is the
default)
- 1:
- do not check the validity of the input file (this may help
for some broken input files, but it is a security risk as it can lead to
the execution of arbitrary code through a forged input file).
colhlog=<18> base two logarithm of the size of the hash table used
for collation (the default value is 18 and should work reasonably well for
most input files. Please see the biobambam paper at arxiv.org/abs/1306.0836
for details).
colsbs=<128M> size of hash table overflow list in bytes (the
default is 128MB and should work reasonably well for most input files. Please
see the biobambam paper at arxiv.org/abs/1306.0836 for details).
T=<bamcollate2_hostname_pid_time> file name of temporary file used
for collation
ranges=<>: coordinate ranges selected from input. This option is
only available for input files in BAM and CRAM format which have a
corresponding index file (.bai for BAM, .crai for CRAM) and if input is via
file (i.e. the filename argument is set). Valid ranges consist of either
- whole reference sequence:
- a whole reference sequence (e.g. "chr1")
- half open interval on reference sequence:
- an interval on a reference sequence half open on the right
(e.g. "chr1:50000" which means alignments overlapping chr1 from
position 50000 to the end of chr1)
- interval on reference sequence:
- an interval on a reference sequence (e.g.
"chr1:50000-60000" which means alignments overlapping positions
50000 to 60000 on chr1)
For BAM input multiple ranges are separated by space characters (e.g.
ranges="chr1:10000-20000 chr1:30000-40000"). CRAM input supports a
single range only.
reference=: file name of the reference for CRAM input files. If this key
is unset, then the CRAM file header will be scanned for obtaining a reference
file name.
md5=<0|1>: md5 checksum creation for output file. Valid values are
- 0:
- do not compute checksum. This is the default.
- 1:
- compute checksum. If the md5filename key is set, then the
checksum is written to the given file. If md5filename is unset, then no
checksum will be computed.
md5filename file name for md5 checksum if md5=1.
index=<0|1>: compute BAM index for output file. Valid values are
- 0:
- do not compute BAM index. This is the default.
- 1:
- compute BAM index. If the indexfilename key is set, then
the BAM index is written to the given file. If indexfilename is unset,
then no BAM index will be computed.
indexfilename file name for BAM index if index=1.
readgroups comma separated list of read group identifiers to be kept. If
not given all records will be kept. Read group filtering is only available if
collate=0 and collate=1 (i.e. this key is ignored for collate=2 and
collate=3).
mapqthres mapping quality threshold. This option is only available for
collate=1 (i.e. it is ignored for collate=0 and collate>1). If this key is
set, reads are kept if the mapping quality field is at least the given value.
For paired end reads it is sufficient for a read or its mate to have a mapping
quality above the threshold.
reset reduce alignments to an unmapped state (see bamreset). This key is
only valid for collate=0, collate=1 or collate=3. The default value is 0 for
collate=0 and collate=1 and 1 for collate=3.
classes types of alignment lines to be kept. This key is only valid for
collate=1. By default all alignments are kept. The value for this key is a
comma separated list consisting of a subset of the following options:
- F:
- keep first mates of complete pairs
- F2:
- keep second mates of complete pairs
- O:
- keep first mates of orphaned pairs (i.e. such that the
other mate is not in the input file)
- O2:
- keep second mates of orphaned pairs (i.e. such that the
other mate is not in the input file)
- S:
- keep single end reads
resetheadertext file name for replacement SAM header. By default the
header of the input SAM/BAM/CRAM file is used (and filtered in case of
reset=1).
resetaux=<0|1>: remove auxiliary fields if resetaux=1. This key is
only available for reset=1. If reset=1 then the default is to remove all aux
fields.
auxfilter=<>: comma separated list of aux tags to be kept if
reset=1 and resetaux=0. If the key is not set then all tags are kept.
outputformat=<bam>: output file format. All versions of bamcollate2
come with support for the BAM output format. If the program in addition is
linked to the io_lib package, then the following options are valid:
- bam:
- BAM (see http://samtools.sourceforge.net/SAM1.pdf)
- sam:
- SAM (see http://samtools.sourceforge.net/SAM1.pdf)
- cram:
- CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit).
This format is not advisable for data not sorted by coordinate.
O=<[stdout]>: output filename, standard output if unset.
outputthreads=<[1]>: output helper threads, only valid for
outputformat=bam.
verbose=<1>: Valid values are
- 1:
- print progress report on standard error
- 0:
- do not print progress report
replacereadgroupnames=<>: file name containing a list of read group
mappings. Each line in the file corresponds to one read group ID replacement
and contains two columns separated by the tab symbol (ASCII code 9). The first
column contains the source identifier which will be replaced by the value of
the second column in the output file. This option is only valid for
collate<2. By default no read group identifier mapping is performed.
Written by German Tischler.
Report bugs to <
[email protected]>
Copyright © 2009-2015 German Tischler, © 2011-2015 Genome Research
Limited. License GPLv3+: GNU GPL version 3
<
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO
WARRANTY, to the extent permitted by law.