CollapseSeq.py

emoves duplicate sequences from FASTA/FASTQ files

Updated on October 29, 2024 · 2 minute read

NAME

CollapseSeq.py - emoves duplicate sequences from FASTA/FASTQ files

DESCRIPTION

usage: CollapseSeq.py [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]

[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]: [--outname OUT_NAME] [--log LOG_FILE] [--failed] [--fasta] [--delim DELIMITER DELIMITER DELIMITER] [-n MAX_MISSING] [--uf UNIQ_FIELDS [UNIQ_FIELDS ...]] [--cf COPY_FIELDS [COPY_FIELDS ...]] [--act {min,max,sum,set} [{min,max,sum,set} ...]] [--inner] [--keepmiss] [--maxf MAX_FIELD | --minf MIN_FIELD]

Removes duplicate sequences from FASTA/FASTQ files

help:

--version: show program's version number and exit

-h, --help: show this help message and exit

standard arguments:

-s SEQ_FILES [SEQ_FILES ...]: A list of FASTA/FASTQ files containing sequences to process. (default: None)

-o OUT_FILES [OUT_FILES ...]: Explicit output file name(s). Note, this argument cannot be used with the --failed, --outdir, or --outname arguments. If unspecified, then the output filename will be based on the input filename(s). (default: None)

--outdir OUT_DIR: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified. (default: None)

--outname OUT_NAME: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files. (default: None)

--log LOG_FILE: Specify to write verbose logging to a file. May not be specified with multiple input files. (default: None)

--failed: If specified create files containing records that fail processing. (default: False)

--fasta: Specify to force output as FASTA rather than FASTQ. (default: None)

--delim DELIMITER DELIMITER DELIMITER: A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively. (default: ('|', '=', ','))

collapse arguments:

-n MAX_MISSING: Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides. (default: 0)

--uf UNIQ_FIELDS [UNIQ_FIELDS ...]: Specifies a set of annotation fields that must match for sequences to be considered duplicates. (default: None)

--cf COPY_FIELDS [COPY_FIELDS ...]: Specifies a set of annotation fields to copy into the unique sequence output. (default: None)

--act {min,max,sum,set} [{min,max,sum,set} ...]: List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions "min", "max", "sum" perform the corresponding mathematical operation on numeric annotations. The action "set" collapses annotations into a comma delimited list of unique values. (default: None)

--inner: If specified, exclude consecutive missing characters at either end of the sequence. (default: False)

--keepmiss: If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file. (default: False)

--maxf MAX_FIELD: Specify the field whose maximum value determines the retained sequence; mutually exclusive with --minf. (default: None)

--minf MIN_FIELD: Specify the field whose minimum value determines the retained sequence; mutually exclusive with --minf. (default: None)

output files:

: collapse-unique

: unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.

: collapse-duplicate

: raw reads which are duplicates of the sequences retained in the collapse-unique file.

: collapse-undetermined

: raw reads which were excluded from consideration due to having too many N characters in the sequence.

output annotation fields:

: DUPCOUNT

: total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.

: <user defined>

: annotation fields specified by the --cf parameter.

AUTHOR

This manpage was written by Andreas Tille for the Debian distribution and
can be used for any other usage of the program.

May 2020

CollapseSeq.py 0.6.0

Questions & Answers

Helpful answers and articles about CollapseSeq.py you may found on these sites:

Network Engineering