bamdownsamplerandom - downsample a SAM, BAM or CRAM file
bamdownsamplerandom [options]
bamdownsamplerandom reads a SAM, BAM or CRAM file from standard input, randomly
discards reads and writes the remaining reads to standard output in BAM
format. For a pair of reads either both ends are discarded or both ends are
kept. The order of reads in the output file may be different from the order in
the input if the reads in the input file are not collated by their read name.
The following key=value pairs can be given:
p=<1>: probability for a pair of reads or a single end read to be
kept. By default all reads are kept.
seed=<>: seed used for the random number generator. By default the
current time is used, i.e. each run of the program will select a different
subset of reads from an input file. If the behaviour of the program needs to
be reproducible a fixed number can be used as the random seed.
I=<stdin>: input file name (data is read from standard input if
this option is not given)
inputformat=<bam>: input file format All versions of bamtofastq
come with support for the BAM input format. If the program in addition is
linked to the io_lib package, then the following options are valid:
- bam:
- BAM (see http://samtools.sourceforge.net/SAM1.pdf)
- sam:
- SAM (see http://samtools.sourceforge.net/SAM1.pdf)
- cram:
- CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)
level=<-1|0|1|9|11>: set compression level of the output BAM file.
Valid values are
- -1:
- zlib/gzip default compression level
- 0:
- uncompressed
- 1:
- zlib/gzip level 1 (fast) compression
- 9:
- zlib/gzip level 9 (best) compression
If libmaus has been compiled with support for igzip (see
https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data)
then an additional valid value is
- 11:
- igzip compression
exclude=<SECONDARY,SUPPLEMENTARY>: Do not include reads in the
output that have any of the given flags set. The flags are given separated by
commas. Valid flags are:
- PAIRED:
- read was paired in sequencing
- PROPER_PAIR:
- read has been mapped as part of a proper pair
- UNMAP:
- read was not mapped
- MUNMAP:
- mate of read was not mapped
- REVERSE:
- read was mapped to the reverse strand
- MREVERSE:
- mate of read was mapped to the reverse strand
- READ1:
- read was first read of a pair during sequencing
- READ2:
- read was second read of a pair during sequencing
- SECONDARY:
- alignment is secondary, i.e. an alternative mapping to the
primary alignment in the same file
- QCFAIL:
- read as marked as having failed quality control
- DUP:
- read is marked as a duplicate of another read in the same
file (see bammarkduplicates)
- SUPPLEMENTARY:
- read is marked as supplementary alignment
disablevalidation=<0>: Valid values are
- 0:
- run input file validation on alignments (this is the
default)
- 1:
- do not check the validity of the input file (this may help
for some broken input files, but it is a security risk as it can lead to
the execution of arbitrary code through a forged input file).
colhlog=<18> base two logarithm of the size of the hash table used
for collation (the default value is 18 and should work reasonably well for
most input files. Please see the biobambam paper at arxiv.org/abs/1306.0836
for details).
colsbs=<128M> size of hash table overflow list in bytes (the
default is 128MB and should work reasonably well for most input files. Please
see the biobambam paper at arxiv.org/abs/1306.0836 for details).
T=<bamdownsamplerandom_hostname_pid_time> file name of temporary
file used for collation
ranges=<>: coordinate ranges selected from input. This option is
only available for input files in BAM format which have a corresponding index
(.bai file) and if input is via file (i.e. the I argument is set). Valid
ranges consist either of
- whole reference sequence:
- a whole reference sequence (e.g. "chr1")
- half open interval on reference sequence:
- an interval on a reference sequence half open on the right
(e.g. "chr1:50000" which means alignments overlapping chr1 from
position 50000 to the end of chr1)
- interval on reference sequence:
- an interval on a reference sequence (e.g.
"chr1:50000-60000" which means alignments overlapping positions
50000 to 60000 on chr1)
Multiple ranges are separated by space characters (e.g.
ranges="chr1:10000-20000 chr1:30000-40000").
reference=: file name of the reference for CRAM input files. If this key
is unset, then the CRAM file header will be scanned for obtaining a reference
file name.
tmpfile=<filename>: prefix for temporary files. By default the
temporary files are created in the current directory
outputformat=<bam>: output file format. All versions of bamsort
come with support for the BAM output format. If the program in addition is
linked to the io_lib package, then the following options are valid:
- bam:
- BAM (see http://samtools.sourceforge.net/SAM1.pdf)
- sam:
- SAM (see http://samtools.sourceforge.net/SAM1.pdf)
- cram:
- CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit).
This format is not advisable for data sorted by query name.
O=<[stdout]>: output filename, standard output if unset.
outputthreads=<[1]>: output helper threads, only valid for
outputformat=bam.
md5=<0|1>: md5 checksum creation for output file. This option can
only be given if outputformat=bam. Then valid values are
- 0:
- do not compute checksum. This is the default.
- 1:
- compute checksum. If the md5filename key is set, then the
checksum is written to the given file. If md5filename is unset, then no
checksum will be computed.
md5filename file name for md5 checksum if md5=1.
index=<0|1>: compute BAM index for output file. This option can
only be given if outputformat=bam. Then valid values are
- 0:
- do not compute BAM index. This is the default.
- 1:
- compute BAM index. If the indexfilename key is set, then
the BAM index is written to the given file. If indexfilename is unset,
then no BAM index will be computed.
indexfilename file name for output BAM index if index=1.
hash=<0|1>: use hash of query name instead of a random number for
selection. This makes the output depend on how random the hashes produced for
the query names are, but it has the advantage of not requiring collation to
keep pairs together. In contast the order of retained reads does not change
for hash=1.
Written by German Tischler.
Report bugs to <
[email protected]>
Copyright © 2009-2014 German Tischler, © 2011-2014 Genome Research
Limited. License GPLv3+: GNU GPL version 3
<
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO
WARRANTY, to the extent permitted by law.