Chemistry::OpenSMILES - OpenSMILES format reader and writer
use Chemistry::OpenSMILES::Parser;
my $parser = Chemistry::OpenSMILES::Parser->new;
my @moieties = $parser->parse( 'C#C.c1ccccc1' );
$\ = "\n";
for my $moiety (@moieties) {
# $moiety is a Graph::Undirected object
print scalar $moiety->vertices;
print scalar $moiety->edges;
}
use Chemistry::OpenSMILES::Writer qw(write_SMILES);
print write_SMILES( \@moieties );
Chemistry::OpenSMILES provides support for SMILES chemical identifiers
conforming to OpenSMILES v1.0 specification
(<
http://opensmiles.org/opensmiles.html>).
Chemistry::OpenSMILES::Parser reads in SMILES strings and returns them parsed to
arrays of Graph::Undirected objects. Each atom is represented by a hash.
Chemistry::OpenSMILES::Writer performs the inverse operation. Generated SMILES
strings are by no means optimal.
Disconnected parts of a compound are represented as separate Graph::Undirected
objects. Atoms are represented as vertices, and bonds are represented as
edges.
Atoms
Atoms, or vertices of a molecular graph, are represented as hash references:
{
"symbol" => "C",
"isotope" => 13,
"chirality" => "@@",
"hcount" => 3,
"charge" => 1,
"class" => 0,
"number" => 0,
}
Except for "symbol", "class" and "number", all
keys of hash are optional. Per OpenSMILES specification, default values for
"hcount" and "class" are 0.
For chiral atoms, the order of its neighbours in input is preserved in an array
added as value for "chirality_neighbours" key of the atom hash.
Bonds
Bonds, or edges of a molecular graph, rely completely on Graph::Undirected
internal representation. Bond orders other than single ("-", which
is also a default) are represented as values of edge attribute
"bond". They correspond to the symbols used in OpenSMILES
specification.
"parse" accepts the following options for key-value pairs in an
anonymous hash for its second parameter:
- "max_hydrogen_count_digits"
- In OpenSMILES specification the number of attached hydrogen
atoms for atoms in square brackets is limited to 9. IUPAC SMILES+ has
increased this number to 99. With the value of
"max_hydrogen_count_digits" the parser could be instructed to
allow other than 1 digit for attached hydrogen count.
- "raw"
- With "raw" set to anything evaluating to true,
the parser will not convert neither implicit nor explicit hydrogen atoms
in square brackets to atom hashes of their own. Moreover, it will not
attempt to unify the representations of chirality. It should be noted,
though, that many of subroutines of Chemistry::OpenSMILES expect non-raw
data structures, thus processing raw output may produce distorted
results.
Element symbols in square brackets are not limited to the ones known to
chemistry. Currently any single or two-letter symbol is allowed.
Deprecated charge notations ("--" and "++") are supported.
OpenSMILES specification mandates a strict order of ring bonds and branches:
branched_atom ::= atom ringbond* branch*
Chemistry::OpenSMILES::Parser supports both the mandated, and inverted
structure, where ring bonds follow branch descriptions.
Whitespace is not supported yet. SMILES descriptors must be cleaned of it before
attempting reading with Chemistry::OpenSMILES::Parser.
The derivation of implicit hydrogen counts for aromatic atoms is not
unambiguously defined in the OpenSMILES specification. Thus only aromatic
carbon is accounted for as if having valence of 3.
Chiral atoms with three neighbours are interpreted as having a lone pair of
electrons as the fourth chiral neighbour. The lone pair is always understood
as being the second in the order of neighbour enumeration, except when the
atom with the lone pair starts a chain. In that case lone pair is the first.
perl(1)
Andrius Merkys, <
[email protected]>