xtract - NCBI Entrez Direct XML conversion and transformation tool
xtract [
-help] [
-strict] [
-mixed] [
-self]
[
-accent] [
-ascii] [
-compress] [
-stops] [
-input filename] [
-transform filename] [
-aliases filename] [
-pattern expr]
[
-group expr] [
-block expr] [
-subset expr] [
-path path] [
-if expr [
constraint]] [
-unless expr [
constraint]] [
-and condition] [
-or condition] [
-else] [
-position pos] [
-equals str] [
-contains str] [
-includes str] [
-is-within str] [
-starts-with str] [
-ends-with str]
[
-is-not str] [
-is-before str] [
-is-after str] [
-matches str] [
-resembles str] [
-is-equal-to expr] [
-differs-from expr] [
-gt N] [
-ge N] [
-lt N] [
-le N] [
-eq N] [
-ne N] [
-ret str] [
-tab str] [
-sep str] [
-pfx str] [
-sfx str] [
-rst] [
-clr] [
-pfc str] [
-deq str] [
-def str] [
-lbl str] [
-set tag] [
-rec tag] [
-wrp tag] [
-enc tag] [
-plg str] [
-elg str] [
-pkg tag] [
-fwd str] [
-awd str] [
-element element] [
-first element]
[
-last element] [
-backward element] [
-NAME] [
--STATS] [
-num element] [
-len element] [
-sum element] [
-acc element] [
-min element] [
-max element] [
-inc element] [
-dec element] [
-sub element] [
-avg element] [
-dev element] [
-med element] [
-mul element] [
-div element] [
-mod element] [
-bin element] [
-oct element] [
-hex element] [
-bit element] [
-pad element] [
-encode element] [
-upper element] [
-lower element] [
-chain element] [
-title element] [
-mirror element] [
-alnum element] [
-basic element] [
-plain element] [
-simple element] [
-author element]
[
-prose element] [
-terms element]
[
-words element] [
-pairs element]
[
-order element] [
-reverse element] [
-letters element] [
-clauses element] [
-year element] [
-month element] [
-date element] [
-page element] [
-auth element] [
-initials element] [
-jour element]
[
-trim element] [
-wct element] [
-doi element] [
-translate element]
[
-classify element] [
-replace
-reg target -exp replacement] [
-revcomp] [
-nucleic] [
-fasta] [
-ncbi2na] [
-ncbi4na] [
-molwt] [
-0-based element] [
-1-based element] [
-ucsc-based element] [
-insd arg ...] [
-histogram] [
-e2index [
extras]] [
-indices element] [
-article element] [
-abstract element] [
-paragraph element] [
-stemmed element] [
-head str] [
-tail str] [
-hd str] [
-tl str] [
-select condition] [
-in filename] [
-sort[
-fwd]
element] [
-sort-rev element] [
-format fmt
[
-unicode style]] [
-verify] [
-outline] [
-synopsis] [
-contour [
delimiter]] [
-examples] [
-unix] [
-version]
xtract converts an XML document into a table of data values according to
user-specified rules.
- -strict
- Remove HTML and MathML tags.
- -mixed
- Allow mixed content XML.
- -self
- Allow detection of empty self-closing tags.
- -accent
- Delete Unicode accents and diacritical marks.
- -ascii
- Convert Unicode to numeric HTML character entities.
- -compress
- Compress runs of spaces.
- -stops
- Retain stop words in selected phrases.
-
-input filename
- Read XML from file instead of standard input.
-
-transform filename
- File of substitutions for -translate.
-
-aliases filename
- Mappings file for -classify operation.
-
-pattern expr
-
-group expr
-
-block expr
-
-subset expr
- Name of record within set. Use of different argument names
allows command-line control of nested looping.
-
-path path
- Explore by list of adjacent object names.
- Object
- DateRevised
- Parent/Child
- Book/AuthorList
- Path
- MedlineCitation/Article/Journal/JournalIssue/PubDate
- Heterogeneous
- "PubmedArticleSet/*"
- Exhaustive
- "History/**"
- Nested
- "*/Taxon"
-
-if expr [constraint]
- Element (or @attribute) must exist and
satisfy any specified constraint.
-
-unless expr [constraint]
- Skip if element matches.
-
-and condition
- Preceding and following tests must both pass.
-
-or condition
- Any passing test suffices.
- -else
- Execute if conditional test failed.
-
-position pos
-
first/last/outer/inner/even/odd/all.
-
-equals str
- String must match exactly.
-
-contains str
- Substring must be present.
-
-includes str
- Substring must match at word boundaries.
-
-is-within str
- String must be present.
-
-starts-with str
- Substring must be at beginning.
-
-ends-with str
- Substring must be at end.
-
-is-not str
- String must not match.
-
-is-before str
- First string < second string.
-
-is-after str
- First string > second string.
-
-matches str
- Matches without commas or semicolons.
-
-resembles str
- Requires all words, but in any order.
-
-is-equal-to expr
- Object values must match.
-
-differs-from expr
- Object values must differ.
-
-gt N
- Greater than.
-
-ge N
- Greater than or equal to.
-
-lt N
- Less than to.
-
-le N
- Less than or equal to.
-
-eq N
- Equal to.
-
-ne N
- Not equal to.
-
-ret str
- Override line break between patterns.
-
-tab str
- Replace tab character between fields.
-
-sep str
- Separator between group members.
-
-pfx str
- Prefix to print before group.
-
-sfx str
- Suffix to print after group.
- -rst
- Reset -sep through -elg.
- -clr
- Clear queued tab separator.
-
-pfc str
- Preface combines -clr and -pfx.
-
-deq str
- Delete and replace queued tab separator.
-
-def str
- Default placeholder for missing fields.
-
-lbl str
- Insert arbitrary text.
-
-set tag
- XML tag for entire set.
-
-rec tag
- XML tag for each record.
-
-wrp tag
- Wrap elements in XML object.
-
-enc tag
- Encase instance in XML object.
-
-plg str
- Prologue to print before instance.
-
-elg str
- Epilogue to print after instance.
-
-pkg tag
- Package subset in XML object.
-
-fwd str
- Foreword to print before subset.
-
-awd str
- Afterword to print after subset.
-
-element element
- Print all items that match tag name.
-
-first element
- Only print value of first item.
-
-last element
- Only print value of last item.
-
-backward element
- Print values in reverse order.
-
-NAME
- Record value in named variable.
-
--STATS
- Accumulate values into variable.
- Tag
- Caption
- Group
- Initials,LastName
- Parent/Child
- MedlineCitation/PMID
- Recursive
- "**/Gene-commentary_accession"
- Unrestricted
- PubDate/*
- Attribute
- DescriptorName@MajorTopicYN
- Range
- MedlineDate[1:4]
- Substring
- "Title[phospholipase | rattlesnake]"
- Object Count
- "#Author"
- Item Length
- "%Title"
- Element Depth
- "^PMID"
- Variable
- "&NAME"
- Parent Index
- "+"
- Object Name
- "?"
- Object Value
- "~"
- XML Subtree
- "*"
- Children
- "$"
- Attributes
- "@"
- ASN.1 Record
- "."
- JSON Record
- "%"
-
-num element
- Count.
-
-len element
- Length.
-
-sum element
- Sum.
-
-acc element
- Accumulator.
-
-min element
- Minimum.
-
-max element
- Maximum.
-
-inc element
- Increment.
-
-dec element
- Decrement.
-
-sub element
- Difference.
-
-avg element
- Average.
-
-dev element
- Deviation.
-
-med element
- Median.
-
-mul element
- Product.
-
-div element
- Quotient.
-
-mod element
- Remainder.
-
-bin element
- Binary.
-
-oct element
- Octal.
-
-hex element
- Hexadecimal.
-
-bit element
- Bit count.
-
-pad element
- Zero-pad to eight digits.
-
-encode element
- XML-encode <, >, &,
", and ' characters.
-
-upper element
- Convert text to uppercase.
-
-lower element
- Convert text to lowercase.
-
-chain element
- Change spaces to underscores.
-
-title element
- Capitalize initial letters of words.
-
-mirror element
- Reverse order of letters.
-
-alnum element
- Non-alphanumeric characters to space.
-
-basic element
- Convert superscripts and subscripts.
-
-plain element
- Remove embedded mixed-content markup tags.
-
-simple element
- Normalize accented letters; spell Greek letters.
-
-author element
- Multi-step author cleanup.
-
-prose element
- Text conversion to ASCII.
-
-terms element
- Partition text at spaces.
-
-words element
- Split at punctuation marks.
-
-pairs element
- Adjacent informative words.
-
-order element
- Rearrange words in sorted order.
-
-reverse element
- Reverse words in string.
-
-letters element
- Separate individual letters.
-
-clauses element
- Break at phrase separators.
-
-year element
- Extract first 4-digit year from string.
-
-month element
- Match first month name and return a corresponding
integer.
-
-date element
-
YYYY/MM/DD from -unit
"PubDate" -date "*"
-
-page element
- Get digits (and letters) of first page number.
-
-auth element
- Change GenBank authors to Medline form.
-
-initials element
- Parse initials from forename or given name.
-
-jour element
- Clean up journal name punctuation.
-
-trim element
- Remove extra spaces and leading zeros.
-
-wct element
- Count number of -words in a string.
-
-doi element
- Add https://doi.org/ prefix, URL encode.
-
-translate element
- Substitute values with -transform table.
-
-classify element
- Substring word or phrase matches to -aliases
table.
- -replace
- Substitute text using regular expressions.
-
-reg target
- Target expression.
-
-exp pattern
- Replacement pattern.
- -revcomp
- Reverse complement nucleotide sequence.
- -nucleic
- Subrange determines forward or revcomp.
- -fasta
- Split sequence into blocks of 70 uppercase letters.
- -ncbi2na
- Expand ncbi2na to IUPAC. (May need to truncate result to
actual sequence length.)
- -ncbi4na
- Expand ncbi4na to IUPAC. (May need to truncate result to
actual sequence length.)
- -molwt
- Calculate molecular weight of peptide.
-
-0-based element
- Zero-based.
-
-1-based element
- One-based.
-
-ucsc-based element
- Half-open.
-
-insd arg ...
- Generate INSDSeq extraction commands. Print them if invoked
standalone; run them if invoked as part of a pipeline. Requires one or
more arguments, which may appear in the following order:
- Descriptor(s)
-
INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/...
[...]
- Completeness
-
complete/partial
- Feature(s)
-
CDS/mRNA/...[,...]
- Qualifier(s)
-
INSDFeature_key/"#INSDInterval"/gene/product/feat_location/sub_sequence/...
[...]
- -histogram
- Collects data for sort-uniq-count(1) on entire set
of records.
-
-e2index [extras]
- Create Entrez index XML. extras (true or
false; false by default) indicates whether to index extra
fields.
-
-indices element
- Index normalized words.
-
-article element
- Title positional index.
-
-abstract element
- Abstract positional index.
-
-paragraph element
- Index text paragraphs.
-
-stemmed element
- Apply Porter2 algorithm.
-
-head str
- Print before everything else.
-
-tail str
- Print after everything else.
-
-hd str
- Print before each record.
-
-tl str
- Print after each record.
-
-select condition
- Select record subset by conditions.
-
-in filename
- File of identifiers to use for selection.
-
-sort[-fwd] element
- Element to use as sort key.
-
-sort-rev element
- Sort records in reverse order.
-
-format fmt
- copy
- Fast block copy (still applies processing flags).
- compact
- Compress runs of spaces.
- flush
- Suppress line indentation.
- indent
- Indent according to nesting depth.
- expand
- Place each attribute on a separate line.
- -verify
- Report XML data integrity problems.
- -outline
- Display outline of XML structure.
- -synopsis
- Display individual XML paths.
-
-contour [delimiter]
- Display XML paths to leaf nodes (delimited by / by
default).
- -help
- Print usage information and some example argument
combinations.
- -examples
- Complete usage examples, involving additional Entrez Direct
tools.
- -unix
- Illustrate common Unix command arguments.
- -version
- Print version number.
String constraints use case-insensitive comparisons.
Numeric constraints and selection arguments use integer values.
-num and
-len selections are synonyms for Object Count (
#)
and Item Length (
%).
-words,
-pairs, and
-indices convert to lower case.
archive-pmc(1),
archive-pubmed(1),
custom-index(1),
disambiguate-nucleotides(1),
download-ncbi-data(1),
ds2pme(1),
esample(1),
fetch-pmc(1),
fetch-pubmed(1),
find-in-gene(1),
fuse-segments(1),
gene2range(1),
hgvs2spdi(1),
index-extras(1),
index-pubmed(1),
pma2pme(1),
rchive(1),
snp2hgvs(1),
snp2tbl(1),
sort-uniq-count(1),
spdi2tbl(1),
tbl2prod(1),
transmute(1),
uniq-table(1),
xml2fsa(1),
xml2tbl(1),
xy-plot(1).