NAME

leaff - sequence library utilities and applications

SYNOPSIS

leaff [-f fasta-file] [options]

DESCRIPTION

LEAFF (Let's Extract Anything From Fasta) is a utility program for working with multi-fasta files. In addition to providing random access to the base level, it includes several analysis functions.

OPTIONS

SOURCE FILES
-f file: use sequence in 'file' (-F is also allowed for historical reasons)
-A file: read actions from 'file'
 
SOURCE FILE EXAMINATION
-d: print the number of sequences in the fasta
-i name: print an index, labelling the source 'name'
 
OUTPUT OPTIONS
-6 <#>: insert a newline every 60 letters
(if the next arg is a number, newlines are inserted every
n letters, e.g., -6 80. Disable line breaks with -6 0,
or just don't use -6!)
-e beg end: Print only the bases from position 'beg' to position 'end'
(space based, relative to the FORWARD sequence!) If
beg == end, then the entire sequence is printed. It is an
error to specify beg > end, or beg > len, or end > len.
-ends n Print n bases from each end of the sequence. One input
sequence generates two output sequences, with '_5' or '_3'
appended to the ID. If 2n >= length of the sequence, the
sequence itself is printed, no ends are extracted (they
overlap).
-C: complement the sequences
-H: DON'T print the defline
-h: Use the next word as the defline ("-H -H" will reset to the
original defline
-R: reverse the sequences
-u: uppercase all bases
 
SEQUENCE SELECTION
-G n s l: print n randomly generated sequences, 0 < s <= length <= l
-L s l: print all sequences such that s <= length < l
-N l h: print all sequences such that l <= % N composition < h
(NOTE 0.0 <= l < h < 100.0)
(NOTE that you cannot print sequences with 100% N
This is a useful bug).
-q file: print sequences from the seqid list in 'file'
-r num: print 'num' randomly picked sequences
-s seqid: print the single sequence 'seqid'
-S f l: print all the sequences from ID 'f' to 'l' (inclusive)
-W: print all sequences (do the whole file)
 
LONGER HELP
-help analysis
-help examples
 
ANALYSIS FUNCTIONS
--findduplicates a.fasta
Reports sequences that are present more than once. Output
is a list of pairs of deflines, separated by a newline.
 

--mapduplicates a.fasta b.fasta
Builds a map of IIDs from a.fasta and b.fasta that have
identical sequences. Format is "IIDa <-> IIDb"
 

--md5 a.fasta:
Don't print the sequence, but print the md5 checksum
(of the entire sequence) followed by the entire defline.
 

--partition prefix [ n[gmk]bp | n ] a.fasta
--partitionmap [ n[gmk]bp | n ] a.fasta
Partition the sequences into roughly equal size pieces of
size nbp, nkbp, nmbp or ngbp; or into n roughly equal sized
partitions. Sequences larger that the partition size are
in a partition by themself. --partitionmap writes a
description of the partition to stdout; --partiton creates
a fasta file 'prefix-###.fasta' for each partition.
Example: -F some.fasta --partition parts 130mbp
-F some.fasta --partition parts 16
 

--segment prefix n a.fasta
Splits the sequences into n files, prefix-###.fasta.
Sequences are not reordered; the first n sequences are in
the first file, the next n in the second file, etc.
 

--gccontent a.fasta
Reports the GC content over a sliding window of
3, 5, 11, 51, 101, 201, 501, 1001, 2001 bp.
 

--testindex a.fasta
Test the index of 'file'. If index is up-to-date, leaff
exits successfully, else, leaff exits with code 1. If an
index file is supplied, that one is tested, otherwise, the
default index file name is used.
 

--dumpblocks a.fasta
Generates a list of the blocks of N and non-N. Output
format is 'base seq# beg end len'. 'N 84 483 485 2' means
that a block of 2 N's starts at space-based position 483
in sequence ordinal 84. A '.' is the end of sequence
marker.
 

--errors L N C P a.fasta
For every sequence in the input file, generate new
sequences including simulated sequencing errors.
L -- length of the new sequence. If zero, the length
of the original sequence will be used.
N -- number of subsequences to generate. If L=0, all
subsequences will be the same, and you should use
C instead.
C -- number of copies to generate. Each of the N
subsequences will have C copies, each with different
errors.
P -- probability of an error.
 

HINT: to simulate ESTs from genes, use L=500, N=10, C=10
-- make C=10 sequencer runs of N=10 EST sequences
of length 500bp each.
to simulate mRNA from genes, use L=0, N=10, C=10
to simulate reads from genomes, use L=800, N=10, C=1
-- of course, N= should be increased to give the
appropriate depth of coverage
 

--stats a.fasta
Reports size statistics; number, N50, sum, largest.
 

--seqstore out.seqStore
Converts the input file (-f) to a seqStore file (for instance,
for use with the Celera assembler or sim4db).

NOTES

Please note that options are ORDER DEPENDENT. Sequences are printed whenever a SEQUENCE SELECTION option occurs on the command line. OUTPUT OPTIONS are not reset when a sequence is printed.
SEQUENCES are numbered starting at ZERO, not one!

EXAMPLES

1. Print the first 10 bases of the fourth sequence in file 'genes':
 
leaff -f genes -e 0 10 -s 3
 
2. Print the first 10 bases of the fourth and fifth sequences:
 
leaff -f genes -e 0 10 -s 3 -s 4
 
3. Print the fourth and fifth sequences reverse complemented, and the sixth
sequence forward. The second set of -R -C toggle off reverse-complement:
 
leaff -f genes -R -C -s 3 -s 4 -R -C -s 5
 
4. Convert file 'genes' to a seqStore 'genes.seqStore'.
 
leaff -f genes --seqstore genes.seqStore

SEE ALSO

README.leaff
 
http://kmer.sourceforge.net/wiki/index.php/LEAFF_User%27s_Guide
 
http://kmer.sourceforge.net/wiki/index.php/LEAFF_Programming_Example