NAME
axe - axe DocumentationAXE TUTORIAL
In this tutorial, we'll use Axe to demultiplex some paired-end, combinatorially-index Genotyping-by-Sequencing reads. The data for this tutorial is available from figshare: https://figshare.com/articles/axe-tutorial_tar/6143720 .Step 0: Download the trial data
This will download the trial data, and extract it on the fly:curl -LS https://ndownloader.figshare.com/files/11094782 | tar xv
Step 1: prepare a key file
The key file associates index sequences with sample names. A key file can be prepared in a spreadsheet editor, like LibreOffice Calc, or Excel. The format is quite strict, and is described in detail in the online usage documentation.head axe-keyfile.tsv
Step 2: Demultiplex with Axe
In this step, we will demultiplex our interleaved input file to per-sample interleaved output files. To see a full range of Axe's options, please run axe-demux -h, or inspect the online usage documentation.zcat axe-tutorial.fastq.gz | head -n 8
mkdir -p output
axe-demux -i axe-tutorial.fastq.gz -I output/ \ -c -b axe-keyfile.tsv -t demux-stats.tsv -z 1
AXE USAGE
NOTE:For arcane reasons, the name of the axe
binary changed to axe-demux with version 0.3.0. Apologies for the
inconvenience, this was required to make axe installable in Debian and
its derivatives. Command-line usage did not change.
USAGE: axe-demux [-mzc2pt] -b (-f [-r] | -i) (-F [-R] | -I) axe-demux -h axe-demux -v OPTIONS: -m, --mismatch Maximum hamming distance mismatch. [int, default 1] -z, --ziplevel Gzip compression level, or 0 for plain text [int, default 0] -c, --combinatorial Use combinatorial barcode matching. [flag, default OFF] -p, --permissive Don't error on barcode mismatch confict, matching only exactly for conficting barcodes. [flag, default OFF] -2, --trim-r2 Trim barcode from R2 read as well as R1. [flag, default OFF] -b, --barcodes Barcode file. See --help for example. [file] -f, --fwd-in Input forward read. [file] -F, --fwd-out Output forward read prefix. [file] -r, --rev-in Input reverse read. [file] -R, --rev-out Output reverse read prefix. [file] -i, --ilfq-in Input interleaved paired reads. [file] -I, --ilfq-out Output interleaved paired reads prefix. [file] -t, --table-file Output a summary table of demultiplexing statistics to file. [file] -h, --help Print this usage plus additional help. -V, --version Print version string. -v, --verbose Be more verbose. Additive, -vv is more vebose than -v. -q, --quiet Be very quiet.
Inputs and Outputs
Regardless of read mode, three input and output schemes are supported: single-end reads, paired reads (separate R1 and R2 files) and interleaved paired reads (one file, with R1 and R2 as consecutive reads). If single end reads are inputted, they must be output as single end reads. If either paired or interleaved paired reads are read, they can be output as either paired reads or interleaved paired reads. This applies to both successfully de-multiplexed reads and reads that could not be de-multiplexed.- The corresponding CLI flags are:
- •
- -f and -F: Single end or paired R1 file input and output respectively.
- •
- -r and -R: Paired R2 file input and output.
- •
- -i and -I: Interleaved paired input and output.
The index file
The index file is a tab-separated file with an optional header. It is mandatory, and is always supplied using the -b command line flag. The exact format is dependent on indexing mode, and is described further in the sections below. If a header is present, the header line must start with either Barcode or index, or it will be interpreted as a index line, leading to a parsing error. Any line starting with ';' or '#' is ignored, allowing comments to be added in line with indexes. Please ensure that the software used to produce the index uses ASCII encoding, and does not insert a Byte-order Mark (BoM) as many text editors can silently use Unicode-based encoding schemes. I recommend the use of LibreOffice Calc (part of a free and open source office suite) to generate index tables; Microsoft Excel can also be used.Mismatch level selection
Independent of index mode, the -m flag is used to select the maximum allowable hamming distance between a read's prefix and a index to be considered as a match. As "mutated" indexes must be unique, a hamming distance of one is the default as typically indexes are designed to differ by a hamming distance of at least two. Optionally, (using the -p flag), axe will allow selective mismatch levels, where, if clashes are observed, the index will only be matched exactly. This allows one to process datasets with indexes that don't have a sufficiently high distance between them.Single index mode
Single index mode is the default mode of operation. Barcodes are matched against read one (hereafter the forward read), and the index is trimmed from only the forward read, unless the -2 command line flag is given, in which case a prefix the same length as the matched index is also trimmed from the second or reverse read. Note that sequence of this second read is not checked before trimming.Combinatorial index mode
Combinatorial index mode is activated by giving the -c flag on the command line. Forward read indexes are matched against the forward read, and reverse read indexes are matched against the reverse read. The optimal indexes are selected independently, and the index pair is selected from these two indexes. The respective indexes are trimmed from both reads; the -2 command line flag has no effect in combinatorial index mode.The Demultiplexing Statistics File
The -t option allows the output of per-sample read counts to a tab-separated file. The file will have a header describing its format, and includes a line for reads which could not be demultiplexed.AXE'S MATCHING ALGORITHM
Axe uses an algorithm based on longest-prefix-in-trie matching to match a variable length from the start of each read against a set of 'mutated' indexes.Hamming distance matching
While for most applications in high-throughput sequencing hamming distances are a frowned-upon metric, it is typical for HTS read indexes to be designed to tolerate a certain level of hamming mismatches. Given these sequences are short and typically occur at the 5' end of reads, insertions and deletions rarely need be considered, and the increased rate of assignment of reads with many errors is offset by the risk of falsely assigning indexes to an incorrect sample. In any case, reads with more than 1-2 sequencing errors in their first several bases are likely to be poor quality, and will simply be filtered out during downstream quality control.Hamming mismatch tries
Typically, reads are matched to a set of indexes by calculating the hamming distance between the index, and the first l bases of a read for a index of length l. The "correct" index is then selected by recording either the index with the lowest hamming distance to the read (competitive matching) or by simply accepting the first index with a hamming distance below a certain threshold. These approaches are both very computationally expensive, and can have lower accuracy than the algorithm I propose. Additionally, implementations of these methods rarely handle indexes of differing length and combinatorial indexing well, if at all.- •
- Index
AUTHOR
Kevin MurrayCOPYRIGHT
2022, Kevin MurraySeptember 18, 2022 | 0.3.3 |