NAME

xtract - NCBI Entrez Direct XML conversion and transformation tool

SYNOPSIS

xtract [ -help] [ -strict] [ -mixed] [ -self] [ -accent] [ -ascii] [ -compress] [ -stops] [ -input  filename] [ -transform filename] [ -aliases filename] [ -pattern expr] [ -group expr] [ -block expr] [ -subset  expr] [ -path path] [ -if  expr [constraint]] [ -unless  expr [constraint]] [ -and  condition] [ -or condition] [ -else] [ -position pos] [ -equals str] [ -contains str] [ -includes  str] [ -is-within str] [ -starts-with str] [ -ends-with str] [ -is-not str] [ -is-before str] [ -is-after str] [ -matches str] [ -resembles  str] [ -is-equal-to expr] [ -differs-from expr] [ -gt N] [ -ge  N] [ -lt N] [ -le  N] [ -eq N] [ -ne  N] [ -ret str] [ -tab  str] [ -sep str] [ -pfx  str] [ -sfx str] [ -rst] [ -clr] [ -pfc str] [ -deq  str] [ -def str] [ -lbl  str] [ -set tag] [ -rec  tag] [ -wrp tag] [ -enc  tag] [ -plg str] [ -elg  str] [ -pkg tag] [ -fwd  str] [ -awd str] [ -element  element] [ -first element] [ -last element] [ -backward element] [ -NAME] [ --STATS] [ -num element] [ -len  element] [ -sum element] [ -acc  element] [ -min element] [ -max  element] [ -inc element] [ -dec  element] [ -sub element] [ -avg  element] [ -dev element] [ -med  element] [ -mul element] [ -div  element] [ -mod element] [ -bin  element] [ -oct element] [ -hex  element] [ -bit element] [ -pad  element] [ -encode element] [ -upper element] [ -lower element] [ -chain element] [ -title element] [ -mirror element] [ -alnum element] [ -basic element] [ -plain element] [ -simple element] [ -author element] [ -prose element] [ -terms element] [ -words element] [ -pairs element] [ -order element] [ -reverse element] [ -letters element] [ -clauses element] [ -year element] [ -month element] [ -date element] [ -page element] [ -auth element] [ -initials element] [ -jour element] [ -trim element] [ -wct element] [ -doi element] [ -translate element] [ -classify element] [ -replace -reg  target -exp replacement] [ -revcomp] [ -nucleic] [ -fasta] [ -ncbi2na] [ -ncbi4na] [ -molwt] [ -0-based element] [ -1-based element] [ -ucsc-based element] [ -insd arg ...] [ -histogram] [ -e2index [ extras]] [ -indices element] [ -article element] [ -abstract element] [ -paragraph element] [ -stemmed element] [ -head str] [ -tail  str] [ -hd str] [ -tl  str] [ -select condition] [ -in  filename] [ -sort[-fwdelement] [ -sort-rev element] [ -format fmt [-unicode style]] [ -verify] [ -outline] [ -synopsis] [ -contour [delimiter]] [ -examples] [ -unix] [ -version]

DESCRIPTION

xtract converts an XML document into a table of data values according to user-specified rules.

OPTIONS

Processing Flags

-strict
Remove HTML and MathML tags.
-mixed
Allow mixed content XML.
-self
Allow detection of empty self-closing tags.
-accent
Delete Unicode accents and diacritical marks.
-ascii
Convert Unicode to numeric HTML character entities.
-compress
Compress runs of spaces.
-stops
Retain stop words in selected phrases.

Data Source

-input filename
Read XML from file instead of standard input.
-transform filename
File of substitutions for -translate.
-aliases filename
Mappings file for -classify operation.

Exploration Argument Hierarchy

-pattern expr
-group expr
-block expr
-subset expr
Name of record within set. Use of different argument names allows command-line control of nested looping.

Path Navigation

-path path
Explore by list of adjacent object names.

Exploration Constructs

Object
DateRevised
Parent/Child
Book/AuthorList
Path
MedlineCitation/Article/Journal/JournalIssue/PubDate
Heterogeneous
"PubmedArticleSet/*"
Exhaustive
"History/**"
Nested
"*/Taxon"

Conditional Execution

-if expr [constraint]
Element (or @attribute) must exist and satisfy any specified constraint.
-unless expr [constraint]
Skip if element matches.
-and condition
Preceding and following tests must both pass.
-or condition
Any passing test suffices.
-else
Execute if conditional test failed.
-position pos
first/last/outer/inner/even/odd/all.

String Constraints

-equals str
String must match exactly.
-contains str
Substring must be present.
-includes str
Substring must match at word boundaries.
-is-within str
String must be present.
-starts-with str
Substring must be at beginning.
-ends-with str
Substring must be at end.
-is-not str
String must not match.
-is-before str
First string < second string.
-is-after str
First string > second string.
-matches str
Matches without commas or semicolons.
-resembles str
Requires all words, but in any order.

Object Constraints

-is-equal-to expr
Object values must match.
-differs-from expr
Object values must differ.

Numeric Constraints

-gt N
Greater than.
-ge N
Greater than or equal to.
-lt N
Less than to.
-le N
Less than or equal to.
-eq N
Equal to.
-ne N
Not equal to.

Format Customization

-ret str
Override line break between patterns.
-tab str
Replace tab character between fields.
-sep str
Separator between group members.
-pfx str
Prefix to print before group.
-sfx str
Suffix to print after group.
-rst
Reset -sep through -elg.
-clr
Clear queued tab separator.
-pfc str
Preface combines -clr and -pfx.
-deq str
Delete and replace queued tab separator.
-def str
Default placeholder for missing fields.
-lbl str
Insert arbitrary text.

XML Generation

-set tag
XML tag for entire set.
-rec tag
XML tag for each record.
-wrp tag
Wrap elements in XML object.
-enc tag
Encase instance in XML object.
-plg str
Prologue to print before instance.
-elg str
Epilogue to print after instance.
-pkg tag
Package subset in XML object.
-fwd str
Foreword to print before subset.
-awd str
Afterword to print after subset.

Element Selection

-element element
Print all items that match tag name.
-first element
Only print value of first item.
-last element
Only print value of last item.
-backward element
Print values in reverse order.
-NAME
Record value in named variable.
--STATS
Accumulate values into variable.

-element Constructs

Tag
Caption
Group
Initials,LastName
Parent/Child
MedlineCitation/PMID
Recursive
"**/Gene-commentary_accession"
Unrestricted
PubDate/*
Attribute
DescriptorName@MajorTopicYN
Range
MedlineDate[1:4]
Substring
"Title[phospholipase | rattlesnake]"
Object Count
"#Author"
Item Length
"%Title"
Element Depth
"^PMID"
Variable
"&NAME"

Special -element Operations

Parent Index
"+"
Object Name
"?"
Object Value
"~"
XML Subtree
"*"
Children
"$"
Attributes
"@"
ASN.1 Record
"."
JSON Record
"%"

Numeric Processing

-num element
Count.
-len element
Length.
-sum element
Sum.
-acc element
Accumulator.
-min element
Minimum.
-max element
Maximum.
-inc element
Increment.
-dec element
Decrement.
-sub element
Difference.
-avg element
Average.
-dev element
Deviation.
-med element
Median.
-mul element
Product.
-div element
Quotient.
-mod element
Remainder.
-bin element
Binary.
-oct element
Octal.
-hex element
Hexadecimal.
-bit element
Bit count.
-pad element
Zero-pad to eight digits.

Character Processing

-encode element
XML-encode <, >, &, ", and ' characters.
-upper element
Convert text to uppercase.
-lower element
Convert text to lowercase.
-chain element
Change spaces to underscores.
-title element
Capitalize initial letters of words.
-mirror element
Reverse order of letters.
-alnum element
Non-alphanumeric characters to space.

String Processing

-basic element
Convert superscripts and subscripts.
-plain element
Remove embedded mixed-content markup tags.
-simple element
Normalize accented letters; spell Greek letters.
-author element
Multi-step author cleanup.
-prose element
Text conversion to ASCII.

Text Processing

-terms element
Partition text at spaces.
-words element
Split at punctuation marks.
-pairs element
Adjacent informative words.
-order element
Rearrange words in sorted order.
-reverse element
Reverse words in string.
-letters element
Separate individual letters.
-clauses element
Break at phrase separators.

Citation Functions

-year element
Extract first 4-digit year from string.
-month element
Match first month name and return a corresponding integer.
-date element
YYYY/MM/DD from -unit "PubDate" -date "*"
-page element
Get digits (and letters) of first page number.
-auth element
Change GenBank authors to Medline form.
-initials element
Parse initials from forename or given name.
-jour element
Clean up journal name punctuation.
-trim element
Remove extra spaces and leading zeros.
-wct element
Count number of -words in a string.
-doi element
Add https://doi.org/ prefix, URL encode.

Value Transformation

-translate element
Substitute values with -transform table.
-classify element
Substring word or phrase matches to -aliases table.

Regular Expression

-replace
Substitute text using regular expressions.
-reg target
Target expression.
-exp pattern
Replacement pattern.

Sequence Processing

-revcomp
Reverse complement nucleotide sequence.
-nucleic
Subrange determines forward or revcomp.
-fasta
Split sequence into blocks of 70 uppercase letters.
-ncbi2na
Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)
-ncbi4na
Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)
-molwt
Calculate molecular weight of peptide.

Sequence Coordinates

-0-based element
Zero-based.
-1-based element
One-based.
-ucsc-based element
Half-open.

Command Generator

-insd arg ...
Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
Descriptor(s)
INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
Completeness
complete/partial
Feature(s)
CDS/mRNA/...[,...]
Qualifier(s)
INSDFeature_key/"#INSDInterval"/gene/product/feat_location/sub_sequence/... [...]

Frequency Table

-histogram
Collects data for sort-uniq-count(1) on entire set of records.

Entrez Indexing

-e2index [extras]
Create Entrez index XML. extras (true or false; false by default) indicates whether to index extra fields.
-indices element
Index normalized words.
-article element
Title positional index.
-abstract element
Abstract positional index.
-paragraph element
Index text paragraphs.
-stemmed element
Apply Porter2 algorithm.

Output Organization

-head str
Print before everything else.
-tail str
Print after everything else.
-hd str
Print before each record.
-tl str
Print after each record.

Record Selection

-select condition
Select record subset by conditions.
-in filename
File of identifiers to use for selection.

Record Rearrangement

-sort[-fwdelement
Element to use as sort key.
-sort-rev element
Sort records in reverse order.

Reformatting

-format fmt
copy
Fast block copy (still applies processing flags).
compact
Compress runs of spaces.
flush
Suppress line indentation.
indent
Indent according to nesting depth.
expand
Place each attribute on a separate line.

Validation

-verify
Report XML data integrity problems.

Summary

-outline
Display outline of XML structure.
-synopsis
Display individual XML paths.
-contour [delimiter]
Display XML paths to leaf nodes (delimited by / by default).

Documentation

-help
Print usage information and some example argument combinations.
-examples
Complete usage examples, involving additional Entrez Direct tools.
-unix
Illustrate common Unix command arguments.
-version
Print version number.

NOTES

String constraints use case-insensitive comparisons.
 
Numeric constraints and selection arguments use integer values.
 
-num and -len selections are synonyms for Object Count ( #) and Item Length ( %).
 
-words, -pairs, and -indices convert to lower case.

SEE ALSO

archive-pmc(1), archive-pubmed(1), custom-index(1), disambiguate-nucleotides(1), download-ncbi-data(1), ds2pme(1), esample(1), fetch-pmc(1), fetch-pubmed(1), find-in-gene(1), fuse-segments(1), gene2range(1), hgvs2spdi(1), index-extras(1), index-pubmed(1), pma2pme(1), rchive(1), snp2hgvs(1), snp2tbl(1), sort-uniq-count(1), spdi2tbl(1), tbl2prod(1), transmute(1), uniq-table(1), xml2fsa(1), xml2tbl(1), xy-plot(1).