bamadapterfind - find adapter contamination in sequencing reads
bamdapterfind [options]
bamdapterfind scans a BAM file for contaminations by sequencing adapters. It
uses two separate methods for this detection:
- list:
- each read is matched against a predefined list of adapter
sequences. A sequence is considered as matching if there is an overlap of
a least adpmatchminscore bases, the overlap covers at least a
factor of adpmatchminfrac of the adapter's length and the indel
free local alignment between the adapter and the read covers at least a
factor of adpmatchminpfrac of the length of the possible overlap
between the two. If such a match is found, then the auxiliary field
as is filled with the length of the match, af is filled with
the fraction of the adapter sequence matched and aa is filled with
the name of the matched adapter sequence.
- overlap:
- the two mates need to have a match similar to the following
two lines
s0s1s2s3s4s5s6s7s8s9s10s11s12s13s14s15s16t0t1t2t3
x3x2x1x0s0s1s2s3s4s5s6s7s8s9s10s11s12s13s14s15s16
where an infix s0s1s2... of the first read matches a suffix of the reverse
complement of the second read. In this case it is likely that the first
read has been sequenced beyond the end of the payload sequence and into
the attached adapter. This overlap needs to be at least MIN_OVERLAP
bases long to be considered. If such an overlap is found, then the
adjacent sequences are checked for a match, where in the example x3x2x1x0
needs to be the reverse complement of t0t1t2t3. The adjacent sequences are
checked up to a limit of ADAPTER_MATCH base pairs. If such a match
is found then the auxiliary field ah is set to 1 and a3 is
used to store the length of the suspected adapter sequence.
The following key=value pairs can be given at the program start:
level=<-1|0|1|9|11>: set compression level of the output BAM file.
Valid values are
- -1:
- zlib/gzip default compression level
- 0:
- uncompressed
- 1:
- zlib/gzip level 1 (fast) compression
- 9:
- zlib/gzip level 9 (best) compression
If libmaus has been compiled with support for igzip (see
https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data)
then an additional valid value is
- 11:
- igzip compression
verbose=<1>: Valid values are
- 1:
- print progress report on standard error
- 0:
- do not print progress report
mod=<1048576>: if verbose=1 then this sets the frequency of
progress reports, i.e. a report is given for each mod'th input read/alignment
adaptersbam=<>: file name of the BAM file containing the list of
adapter used for the adapter matching described above under list. The program
contains an internal list which is used if this key is not given.
SEED_LENGTH=<12>: length of the seed used for detecting overlaps in
overlap based matching (see overlap above, default value is 12 base pairs).
PCT_MISMATCH=<10>: percentage of mismatches allowed for overlap
matching. This only includes the overlap, not the suspected attached adapter
sequence. The default value is 10.
MAX_SEED_MISMATCHES=<SEED_LENGTH*PCT_MISMATCH>: maximum number of
mismatches allowed in the seed. By default this value is computed as
SEED_LENGTH*PCT_MISMATCH.
MIN_OVERLAP=<32>: minimum length of overlap for overlap matching in
base pairs (see above). The default value is 32.
ADAPTER_MATCH=<12>: maximum number of base pairs to check for
matching adapters in overlap based matching. The default value is 12.
adpmatchminscore=<16> minimum score for list based adapter matching
(see above, default value is 16)
adpmatchminfrac=<0.75> minimum fraction of adapter sequence which
needs to match (see above, default value is 0.75=75%)
adpmatchminpfrac=<0.8> minimum fraction of overlap for adapter list
matching (see above, default value is 0.8=80%)
clip=<0> clip the adapters off and move the corresponding sequence
part to the qs auxiliary field and the corresponding quality string part to
the qq auxiliary field
reflen=<3000000000> length of reference sequence/genome
pA=<0.25> relative frequency of base A in reference sequence/genome
pC=<0.25> relative frequency of base C in reference sequence/genome
pG=<0.25> relative frequency of base G in reference sequence/genome
pT=<0.25> relative frequency of base T in reference sequence/genome
Written by German Tischler.
Report bugs to <
[email protected]>
Copyright © 2009-2013 German Tischler, © 2011-2013 Genome Research
Limited. License GPLv3+: GNU GPL version 3
<
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO
WARRANTY, to the extent permitted by law.