xtract

NAME

xtract - NCBI Entrez Direct XML conversion and transformation tool

xtract [ -help] [ -strict] [ -mixed] [ -self] [ -accent] [ -ascii] [ -compress] [ -stops] [ -input filename] [ -transform filename] [ -aliases filename] [ -pattern expr] [ -group expr] [ -block expr] [ -subset expr] [ -path path] [ -if expr [constraint]] [ -unless expr [constraint]] [ -and condition] [ -or condition] [ -else] [ -position pos] [ -equals str] [ -contains str] [ -includes str] [ -is-within str] [ -starts-with str] [ -ends-with str] [ -is-not str] [ -is-before str] [ -is-after str] [ -matches str] [ -resembles str] [ -is-equal-to expr] [ -differs-from expr] [ -gt N] [ -ge N] [ -lt N] [ -le N] [ -eq N] [ -ne N] [ -ret str] [ -tab str] [ -sep str] [ -pfx str] [ -sfx str] [ -rst] [ -clr] [ -pfc str] [ -deq str] [ -def str] [ -lbl str] [ -set tag] [ -rec tag] [ -wrp tag] [ -enc tag] [ -plg str] [ -elg str] [ -pkg tag] [ -fwd str] [ -awd str] [ -element element] [ -first element] [ -last element] [ -backward element] [ -NAME] [ --STATS] [ -num element] [ -len element] [ -sum element] [ -acc element] [ -min element] [ -max element] [ -inc element] [ -dec element] [ -sub element] [ -avg element] [ -dev element] [ -med element] [ -mul element] [ -div element] [ -mod element] [ -bin element] [ -oct element] [ -hex element] [ -bit element] [ -pad element] [ -encode element] [ -upper element] [ -lower element] [ -chain element] [ -title element] [ -mirror element] [ -alnum element] [ -basic element] [ -plain element] [ -simple element] [ -author element] [ -prose element] [ -terms element] [ -words element] [ -pairs element] [ -order element] [ -reverse element] [ -letters element] [ -clauses element] [ -year element] [ -month element] [ -date element] [ -page element] [ -auth element] [ -initials element] [ -jour element] [ -trim element] [ -wct element] [ -doi element] [ -translate element] [ -classify element] [ -replace -reg target -exp replacement] [ -revcomp] [ -nucleic] [ -fasta] [ -ncbi2na] [ -ncbi4na] [ -molwt] [ -0-based element] [ -1-based element] [ -ucsc-based element] [ -insd arg ...] [ -histogram] [ -e2index [ extras]] [ -indices element] [ -article element] [ -abstract element] [ -paragraph element] [ -stemmed element] [ -head str] [ -tail str] [ -hd str] [ -tl str] [ -select condition] [ -in filename] [ -sort[-fwd] element] [ -sort-rev element] [ -format fmt [-unicode style]] [ -verify] [ -outline] [ -synopsis] [ -contour [delimiter]] [ -examples] [ -unix] [ -version]

-strict: Remove HTML and MathML tags.

-mixed: Allow mixed content XML.

-self: Allow detection of empty self-closing tags.

-accent: Delete Unicode accents and diacritical marks.

-ascii: Convert Unicode to numeric HTML character entities.

-compress: Compress runs of spaces.

-stops: Retain stop words in selected phrases.

Data Source

-input filename: Read XML from file instead of standard input.

-transform filename: File of substitutions for -translate.

-aliases filename: Mappings file for -classify operation.

Exploration Argument Hierarchy

-pattern expr

-group expr

-block expr

-subset expr: Name of record within set. Use of different argument names allows command-line control of nested looping.

Path Navigation

-path path: Explore by list of adjacent object names.

Exploration Constructs

Object: DateRevised

Parent/Child: Book/AuthorList

Path: MedlineCitation/Article/Journal/JournalIssue/PubDate

Heterogeneous: "PubmedArticleSet/*"

Exhaustive: "History/**"

Nested: "*/Taxon"

Conditional Execution

-if expr [constraint]: Element (or @attribute) must exist and satisfy any specified constraint.

-unless expr [constraint]: Skip if element matches.

-and condition: Preceding and following tests must both pass.

-or condition: Any passing test suffices.

-else: Execute if conditional test failed.

-position pos: first/last/outer/inner/even/odd/all.

String Constraints

-equals str: String must match exactly.

-contains str: Substring must be present.

-includes str: Substring must match at word boundaries.

-is-within str: String must be present.

-starts-with str: Substring must be at beginning.

-ends-with str: Substring must be at end.

-is-not str: String must not match.

-is-before str: First string < second string.

-is-after str: First string > second string.

-matches str: Matches without commas or semicolons.

-resembles str: Requires all words, but in any order.

Object Constraints

-is-equal-to expr: Object values must match.

-differs-from expr: Object values must differ.

Numeric Constraints

-gt N: Greater than.

-ge N: Greater than or equal to.

-lt N: Less than to.

-le N: Less than or equal to.

-eq N: Equal to.

-ne N: Not equal to.

Format Customization

-ret str: Override line break between patterns.

-tab str: Replace tab character between fields.

-sep str: Separator between group members.

-pfx str: Prefix to print before group.

-sfx str: Suffix to print after group.

-rst: Reset -sep through -elg.

-clr: Clear queued tab separator.

-pfc str: Preface combines -clr and -pfx.

-deq str: Delete and replace queued tab separator.

-def str: Default placeholder for missing fields.

-lbl str: Insert arbitrary text.

XML Generation

-set tag: XML tag for entire set.

-rec tag: XML tag for each record.

-wrp tag: Wrap elements in XML object.

-enc tag: Encase instance in XML object.

-plg str: Prologue to print before instance.

-elg str: Epilogue to print after instance.

-pkg tag: Package subset in XML object.

-fwd str: Foreword to print before subset.

-awd str: Afterword to print after subset.

Element Selection

-element element: Print all items that match tag name.

-first element: Only print value of first item.

-last element: Only print value of last item.

-backward element: Print values in reverse order.

-NAME: Record value in named variable.

--STATS: Accumulate values into variable.

-element Constructs

Tag: Caption

Group: Initials,LastName

Parent/Child: MedlineCitation/PMID

Recursive: "**/Gene-commentary_accession"

Unrestricted: PubDate/*

Attribute: DescriptorName@MajorTopicYN

Range: MedlineDate[1:4]

Substring: "Title[phospholipase | rattlesnake]"

Object Count: "#Author"

Item Length: "%Title"

Element Depth: "^PMID"

Variable: "&NAME"

Special -element Operations

Parent Index: "+"

Object Name: "?"

Object Value: "~"

XML Subtree: "*"

Children: "$"

Attributes: "@"

ASN.1 Record: "."

JSON Record: "%"

Numeric Processing

-num element: Count.

-len element: Length.

-sum element: Sum.

-acc element: Accumulator.

-min element: Minimum.

-max element: Maximum.

-inc element: Increment.

-dec element: Decrement.

-sub element: Difference.

-avg element: Average.

-dev element: Deviation.

-med element: Median.

-mul element: Product.

-div element: Quotient.

-mod element: Remainder.

-bin element: Binary.

-oct element: Octal.

-hex element: Hexadecimal.

-bit element: Bit count.

-pad element: Zero-pad to eight digits.

Character Processing

-encode element: XML-encode <, >, &, ", and ' characters.

-upper element: Convert text to uppercase.

-lower element: Convert text to lowercase.

-chain element: Change spaces to underscores.

-title element: Capitalize initial letters of words.

-mirror element: Reverse order of letters.

-alnum element: Non-alphanumeric characters to space.

String Processing

-basic element: Convert superscripts and subscripts.

-plain element: Remove embedded mixed-content markup tags.

-simple element: Normalize accented letters; spell Greek letters.

-author element: Multi-step author cleanup.

-prose element: Text conversion to ASCII.

Text Processing

-terms element: Partition text at spaces.

-words element: Split at punctuation marks.

-pairs element: Adjacent informative words.

-order element: Rearrange words in sorted order.

-reverse element: Reverse words in string.

-letters element: Separate individual letters.

-clauses element: Break at phrase separators.

Citation Functions

-year element: Extract first 4-digit year from string.

-month element: Match first month name and return a corresponding integer.

-date element: YYYY/MM/DD from -unit "PubDate" -date "*"

-page element: Get digits (and letters) of first page number.

-auth element: Change GenBank authors to Medline form.

-initials element: Parse initials from forename or given name.

-jour element: Clean up journal name punctuation.

-trim element: Remove extra spaces and leading zeros.

-wct element: Count number of -words in a string.

-doi element: Add https://doi.org/ prefix, URL encode.

Value Transformation

-translate element: Substitute values with -transform table.

-classify element: Substring word or phrase matches to -aliases table.

Regular Expression

-replace: Substitute text using regular expressions.

-reg target: Target expression.

-exp pattern: Replacement pattern.

Sequence Processing

-revcomp: Reverse complement nucleotide sequence.

-nucleic: Subrange determines forward or revcomp.

-fasta: Split sequence into blocks of 70 uppercase letters.

-ncbi2na: Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)

-ncbi4na: Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)

-molwt: Calculate molecular weight of peptide.

Sequence Coordinates

-0-based element: Zero-based.

-1-based element: One-based.

-ucsc-based element: Half-open.

Command Generator

-insd arg ...: Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order: