Ace::Sequence - Examine ACeDB Sequence Objects
# open database connection and get an Ace::Object sequence
use Ace::Sequence;
$db = Ace->connect(-host => 'stein.cshl.org',-port => 200005);
$obj = $db->fetch(Predicted_gene => 'ZK154.3');
# Wrap it in an Ace::Sequence object
$seq = Ace::Sequence->new($obj);
# Find all the exons
@exons = $seq->features('exon');
# Find all the exons predicted by various versions of "genefinder"
@exons = $seq->features('exon:genefinder.*');
# Iterate through the exons, printing their start, end and DNA
for my $exon (@exons) {
print join "\t",$exon->start,$exon->end,$exon->dna,"\n";
}
# Find the region 1000 kb upstream of the first exon
$sub = Ace::Sequence->new(-seq=>$exons[0],
-offset=>-1000,-length=>1000);
# Find all features in that area
@features = $sub->features;
# Print its DNA
print $sub->dna;
# Create a new Sequence object from the first 500 kb of chromosome 1
$seq = Ace::Sequence->new(-name=>'CHROMOSOME_I',-db=>$db,
-offset=>0,-length=>500_000);
# Get the GFF dump as a text string
$gff = $seq->gff;
# Limit dump to Predicted_genes
$gff_genes = $seq->gff(-features=>'Predicted_gene');
# Return a GFF object (using optional GFF.pm module from Sanger)
$gff_obj = $seq->GFF;
Ace::Sequence, and its allied classes Ace::Sequence::Feature and
Ace::Sequence::FeatureList, provide a convenient interface to the ACeDB
Sequence classes and the GFF sequence feature file format.
Using this class, you can define a region of the genome by using a landmark
(sequenced clone, link, superlink, predicted gene), an offset from that
landmark, and a distance. Offsets and distances can be positive or negative.
This will return an
Ace::Sequence object. Once a region is defined, you
may retrieve its DNA sequence, or query the database for any features that may
be contained within this region. Features can be returned as objects (using
the
Ace::Sequence::Feature class), as GFF text-only dumps, or in the
form of the GFF class defined by the Sanger Centre's GFF.pm module.
This class builds on top of Ace and Ace::Object. Please see their manual pages
before consulting this one.
$seq = Ace::Sequence->new($object);
$seq = Ace::Sequence->new(-source => $object,
-offset => $offset,
-length => $length,
-refseq => $reference_sequence);
$seq = Ace::Sequence->new(-name => $name,
-db => $db,
-offset => $offset,
-length => $length,
-refseq => $reference_sequence);
In order to create an
Ace::Sequence you will need an active
Ace
database accessor. Sequence regions are defined using a "source"
sequence, an offset, and a length. Optionally, you may also provide a
"reference sequence" to establish the coordinate system for all
inquiries. Sequences may be generated from existing
Ace::Object
sequence objects, from other
Ace::Sequence and
Ace::Sequence::Feature objects, or from a sequence name and a database
handle.
The class method named
new() is the interface to these facilities. In its
simplest, one-argument form, you provide
new() with a
previously-created
Ace::Object that points to Sequence or sequence-like
object (the meaning of "sequence-like" is explained in more detail
below.) The
new() method will return an
Ace::Sequence object
extending from the beginning of the object through to its natural end.
In the named-parameter form of
new(), the following arguments are
recognized:
- -source
- The sequence source. This must be an Ace::Object of
the "Sequence" class, or be a sequence-like object containing
the SMap tag (see below).
- -offset
- An offset from the beginning of the source sequence. The
retrieved Ace::Sequence will begin at this position. The offset can
be any positive or negative integer. Offsets are 0-based.
- -length
- The length of the sequence to return. Either a positive or
negative integer can be specified. If a negative length is given, the
returned sequence will be complemented relative to the source
sequence.
- -refseq
- The sequence to use to establish the coordinate system for
the returned sequence. Normally the source sequence is used to establish
the coordinate system, but this can be used to override that choice. You
can provide either an Ace::Object or just a sequence name for this
argument. The source and reference sequences must share a common ancestor,
but do not have to be directly related. An attempt to use a disjunct
reference sequence, such as one on a different chromosome, will fail.
- -name
- As an alternative to using an Ace::Object with the
-source argument, you may specify a source sequence using
-name and -db. The Ace::Sequence module will use the
provided database accessor to fetch a Sequence object with the specified
name. new() will return undef is no Sequence by this name is
known.
- -db
- This argument is required if the source sequence is
specified by name rather than by object reference.
If
new() is successful, it will create an
Ace::Sequence object and
return it. Otherwise it will return undef and return a descriptive message in
Ace->
error(). Certain programming errors, such as a failure to
provide required arguments, cause a fatal error.
When retrieving information from an
Ace::Sequence, the coordinate system
is based on the sequence segment selected at object creation time. That is,
the "+1" strand is the natural direction of the
Ace::Sequence
object, and base pair 1 is its first base pair. This behavior can be
overridden by providing a reference sequence to the
new() method, in
which case the orientation and position of the reference sequence establishes
the coordinate system for the object.
In addition to the reference sequence, there are two other sequences used by
Ace::Sequence for internal bookeeping. The "source" sequence
corresponds to the smallest ACeDB sequence object that completely encloses the
selected sequence segment. The "parent" sequence is the smallest
ACeDB sequence object that contains the "source". The parent is used
to derive the length and orientation of source sequences that are not directly
associated with DNA objects.
In many cases, the source sequence will be identical to the sequence initially
passed to the
new() method. However, there are exceptions to this rule.
One common exception occurs when the offset and/or length cross the boundaries
of the passed-in sequence. In this case, the ACeDB database is searched for
the smallest sequence that contains both endpoints of the
Ace::Sequence
object.
The other common exception occurs in Ace 4.8, where there is support for
"sequence-like" objects that contain the "SMap"
("Sequence Map") tag. The "SMap" tag provides genomic
location information for arbitrary object -- not just those descended from the
Sequence class. This allows ACeDB to perform genome map operations on objects
that are not directly related to sequences, such as genetic loci that have
been interpolated onto the physical map. When an "SMap"-containing
object is passed to the
Ace::Sequence new() method, the module
will again choose the smallest ACeDB Sequence object that contains both
end-points of the desired region.
If an
Ace::Sequence object is used to create a new
Ace::Sequence
object, then the original object's source is inherited.
Once an
Ace::Sequence object is created, you can query it using the
following methods:
$name = $seq->asString;
Returns a human-readable identifier for the sequence in the form
Source/start-end, where "Source" is the name of the source
sequence, and "start" and "end" are the endpoints of the
sequence relative to the source (using 1-based indexing). This method is
called automatically when the
Ace::Sequence is used in a string
context.
$source = $seq->source_seq;
Return the source of the
Ace::Sequence.
$parent = $seq->parent_seq;
Return the immediate ancestor of the sequence. The parent of the top-most
sequence (such as the CHROMOSOME link) is itself. This method is used
internally to ascertain the length of source sequences which are not
associated with a DNA object.
NOTE: this procedure is a trifle funky and cannot reliably be used to traverse
upwards to the top-most sequence. The reason for this is that it will return
an
Ace::Sequence in some cases, and an
Ace::Object in others.
Use
get_parent() to traverse upwards through a uniform series of
Ace::Sequence objects upwards.
$refseq = $seq->refseq;
Returns the reference sequence, if one is defined.
$seq->refseq($new_ref);
Set the reference sequence. The reference sequence must share the same ancestor
with $seq.
$start = $seq->start;
Start of this sequence, relative to the source sequence, using 1-based indexing.
$end = $seq->end;
End of this sequence, relative to the source sequence, using 1-based indexing.
$offset = $seq->offset;
Offset of the beginning of this sequence relative to the source sequence, using
0-based indexing. The offset may be negative if the beginning of the sequence
is to the left of the beginning of the source sequence.
$length = $seq->length;
The length of this sequence, in base pairs. The length may be negative if the
sequence's orientation is reversed relative to the source sequence. Use
abslength() to obtain the absolute value of the sequence length.
$length = $seq->abslength;
Return the absolute value of the length of the sequence.
$strand = $seq->strand;
Returns +1 for a sequence oriented in the natural direction of the genomic
reference sequence, or -1 otherwise.
Returns true if the segment is reversed relative to the canonical genomic
direction. This is the same as $seq->strand < 0.
$dna = $seq->dna;
Return the DNA corresponding to this sequence. If the sequence length is
negative, the reverse complement of the appropriate segment will be returned.
ACeDB allows Sequences to exist without an associated DNA object (which
typically happens during intermediate stages of a sequencing project. In such
a case, the returned sequence will contain the correct number of "-"
characters.
$name = $seq->name;
Return the name of the source sequence as a string.
$parent = $seq->parent;
Return the immediate ancestor of this
Ace::Sequence (i.e., the sequence
that contains this one). The return value is a new
Ace::Sequence or
undef, if no parent sequence exists.
@children = $seq->get_children();
Returns all subsequences that exist as independent objects in the ACeDB
database. What exactly is returned is dependent on the data model. In older
ACeDB databases, the only subsequences are those under the catchall
Subsequence tag. In newer ACeDB databases, the objects returned correspond to
objects to the right of the S_Child subtag using a tag[2] syntax, and may
include Predicted_genes, Sequences, Links, or other objects. The return value
is a list of
Ace::Sequence objects.
@features = $seq->features;
@features = $seq->features('exon','intron','Predicted_gene');
@features = $seq->features('exon:GeneFinder','Predicted_gene:hand.*');
features() returns an array of
Sequence::Feature objects. If
called without arguments,
features() returns all features that cross
the sequence region. You may also provide a filter list to select a set of
features by type and subtype. The format of the filter list is:
type:subtype
Where
type is the class of the feature (the "feature" field of
the GFF format), and
subtype is a description of how the feature was
derived (the "source" field of the GFF format). Either of these
fields can be absent, and either can be a regular expression. More advanced
filtering is not supported, but is provided by the Sanger Centre's GFF module.
The order of the features in the returned list is not specified. To obtain
features sorted by position, use this idiom:
@features = sort { $a->start <=> $b->start } $seq->features;
my $list = $seq->feature_list();
This method returns a summary list of the features that cross the sequence in
the form of a Ace::Feature::List object. From the Ace::Feature::List object
you can obtain the list of feature names and the number of each type. The
feature list is obtained from the ACeDB server with a single short
transaction, and therefore has much less overhead than
features().
See Ace::Feature::List for more details.
This returns a list of Ace::Sequence::Transcript objects, which are
specializations of Ace::Sequence::Feature. See Ace::Sequence::Transcript for
details.
This returns a list of Ace::Sequence::Feature objects containing reconstructed
clones. This is a nasty hack, because ACEDB currently records clone ends, but
not the clones themselves, meaning that we will not always know both ends of
the clone. In this case the missing end has a synthetic position of
-99,999,999 or +99,999,999. Sorry.
$gff = $seq->gff();
$gff = $seq->gff(-abs => 1,
-features => ['exon','intron:GeneFinder']);
This method returns a GFF file as a scalar. The following arguments are
optional:
- -abs
- Ordinarily the feature entries in the GFF file will be
returned in coordinates relative to the start of the Ace::Sequence
object. Position 1 will be the start of the sequence object, and the
"+" strand will be the sequence object's natural orientation.
However if a true value is provided to -abs, the coordinate system
used will be relative to the start of the source sequence, i.e. the native
ACeDB Sequence object (usually a cosmid sequence or a link).
If a reference sequence was provided when the Ace::Sequence was
created, it will be used by default to set the coordinate system. Relative
coordinates can be re-enabled by providing a false value to -abs.
Ordinarily the coordinate system manipulations automatically "do what
you want" and you will not need to adjust them. See also the
abs() method described below.
- -features
- The -features argument filters the features
according to a list of types and subtypes. The format is identical to the
one described for the features() method. A single filter may be
provided as a scalar string. Multiple filters may be passed as an array
reference.
See also the
GFF() method described next.
$gff_object = $seq->gff;
$gff_object = $seq->gff(-abs => 1,
-features => ['exon','intron:GeneFinder']);
The
GFF() method takes the same arguments as
gff() described
above, but it returns a
GFF::GeneFeatureSet object from the GFF.pm
module. If the GFF module is not installed, this method will generate a fatal
error.
$abs = $seq->absolute;
$abs = $seq->absolute(1);
This method controls whether the coordinates of features are returned in
absolute or relative coordinates. "Absolute" coordinates are
relative to the underlying source or reference sequence. "Relative"
coordinates are relative to the
Ace::Sequence object. By default,
coordinates are relative unless
new() was provided with a reference
sequence. This default can be examined and changed using
absolute().
$merge = $seq->automerge;
$seq->automerge(0);
This method controls whether groups of features will automatically be merged
together by the
features() call. If true (the default), then the left
and right end of clones will be merged into "clone" features,
introns, exons and CDS entries will be merged into Ace::Sequence::Transcript
objects, and similarity entries will be merged into
Ace::Sequence::GappedAlignment objects.
$db = $seq->db;
Returns the Ace database accessor associated with this sequence.
Ace, Ace::Object, Ace::Sequence::Feature, Ace::Sequence::FeatureList, GFF
Lincoln Stein <
[email protected]> with extensive help from Jean Thierry-Mieg
<
[email protected]>
Many thanks to David Block <
[email protected]> for finding and fixing
the nasty off-by-one errors.
Copyright (c) 1999, Lincoln D. Stein
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself. See DISCLAIMER.txt for disclaimers of
warranty.