psa - biological sequence alignment file format
psa is an output format used by the
pftools package to describe
alignments between biological sequences (DNA or protein) and
PROSITE
profiles.
psa is apparented to the widely used biological sequence file format
fasta. Nevertheless it does not only describe a biological sequence, it
is especially used to include information of alignments between a motif
descriptor like a
PROSITE profile and a given sequence. This
information is included in the header and reflected in the structure of the
sequence following the header line.
Each sequence in a
psa alignment file or output must be preceded by a
fasta header line.
The general syntax of such a
fasta header line is as follows:
>seq_id [ free_text
]
The header must start with a '
>' character which is directly followed
by the
seq_id field. This field is interpreted by most programs as the
sequence's
identifier and/or
accession number. It ends at the
first encountered whitespace character.
The
pftools programs will use the
free_text to add information
about the match score, position and description of the sequence or motif.
Please refer to the man page of the corresponding programs for further
information about the output formats.
The header can only extend over one line. The following lines up to a new line
starting with a '
>' character or the end of the file are interpreted
as sequence data.
The line following the header, starts the alignment data between a sequence and
a
PROSITE profile. This data can span over several lines of different
length.
The data is formed by
upper or
lower-case characters of the
corresponding sequence alphabet (DNA or protein). The gap characters
'
.' and '
-' are also supported.
The alignment always has at least the length of the matching profile. Insertions
or deletions detected during the motif/sequence alignment step will vary the
length of the data reported, and can be identified using the following
conventions:
- upper-case character
- Any upper-case character of the sequence alphabet
identifies a match position between the sequence and the motif
descriptor.
- lower-case character
- A lower-case character of the sequence alphabet is used to
symbolize an insertion in the sequence compared to the motif
descriptor.
- '-' (dash) character
- A '-' character in the output identifies the
presence of a deletion in the sequence compared to the motif
descriptor.
- (1)
- >YD28_SCHPO 556 pos. 291 - 332 sp|Q10256|YD28_SCHPO
PTDPGlnsKIAQLVSMGFDPLEAAQALDAANGDLDVAASFLL--
This is an example of the output produced by pfsearch(1) using the
'-x' (i.e. psa output) option. The first line starting with the
'>' character is the fasta header. It also contains
information about the raw score of the alignment as well as its position
in the input sequence.
On the next line you find the alignment proper. Starting at position 6, we
can find an insertion of the 'lns' residues in the sequence
compared to the motif. The last two positions of the motif are not present
in the sequence (i.e. they are deleted). This is indicated by the
presence of two '-' (dash) characters at the end of the
alignment.
- (1)
- The xpsa(5) format defines a more strict syntax of
the header line, allowing the exchange of information between different
sequence analysis tools. It uses keyword=value pairs to
annotate the current match between a sequence and a motif descriptor. This
syntax can be easily parsed and extended, according to the needs of
bioinformatic tools.
- (2)
- The current implementation of the pftools package
does not use the '.' (dot) character in the psa output.
Nevertheless psa2msa(1) will read it and interpret it in the same
manner as the '-' (dash) character.
xpsa(5),
pfsearch(1),
pfscan(1),
pfw(1),
pfmake(1),
psa2msa(1)
This manual page was originally written by Volker Flegel.
The
pftools package was developed by Philipp Bucher.
Any comments or suggestions should be addressed to <
[email protected]>.