TregexPattern (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.trees.tregex.TregexPattern

All Implemented Interfaces:: java.io.Serializable

Direct Known Subclasses:: DescriptionPattern

public abstract class TregexPattern
extends java.lang.Object
implements java.io.Serializable

A TregexPattern is a regular expression-like pattern that is designed to match node configurations within a Tree where the nodes are labeled with symbols, rather than a character string. The Tregex language follows but slightly expands the tree pattern languages pioneered by tgrep and tgrep2. However, unlike these tree pattern matching systems, but like Unix grep, there is no pre-indexing of the data to be searched. Rather there is a linear scan through the trees where matches are sought. As a result, matching is slower, but a TregexPattern can be applied to an arbitrary set of trees at runtime in a processing pipeline without pre-indexing. TregexPattern instances can be matched against instances of the Tree class. The main(java.lang.String[]) method can be used to find matching nodes of a treebank from the command line.

Getting Started

Suppose we want to find all examples of subtrees where the label of the root of the subtree starts with MW and it has a child node with the label IN. That is, we want any subtree whose root is labeled MWV, MWN, etc. that has an IN child. The first thing to do is figure out what pattern to use. Since we want to match anything starting with MW, we use a regular expression pattern for the top node and then also check for the child. The pattern is: /^MW/ < IN. We then create a pattern, find matches in a given tree, and process those matches as follows:


   // Create a reusable pattern object
   TregexPattern patternMW = TregexPattern.compile("/^MW/ < IN");
   // Run the pattern on one particular tree
   TregexMatcher matcher = patternMW.matcher(tree);
   // Iterate over all of the subtrees that matched
   while (matcher.findNextMatchingNode()) {
     Tree match = matcher.getMatch();
     // do what we want to do with the subtree
     match.pennPrint();
   }

Tregex pattern language

The currently supported node-node relations and their symbols are:

Symbol	Meaning
A << B	A dominates B
A >> B	A is dominated by B
A < B	A immediately dominates B
A > B	A is immediately dominated by B
A <<< B	A dominates B and B is a leaf
A $ B	A is a sister of B (and not equal to B)
A .. B	A precedes B
A . B	A immediately precedes B
A ,, B	A follows B
A , B	A immediately follows B
A <<, B	B is a leftmost descendant of A
A <<- B	B is a rightmost descendant of A
A >>, B	A is a leftmost descendant of B
A >>- B	A is a rightmost descendant of B
A <, B	B is the first child of A
A >, B	A is the first child of B
A <- B	B is the last child of A
A >- B	A is the last child of B
A <` B	B is the last child of A
A >` B	A is the last child of B
A <i B	B is the ith child of A (i > 0)
A >i B	A is the ith child of B (i > 0)
A <-i B	B is the ith-to-last child of A (i > 0)
A >-i B	A is the ith-to-last child of B (i > 0)
A <<<i B	B is the ith leaf of A
A <<<-i B	B is the ith-to-last leaf of A
A <: B	B is the only child of A
A >: B	A is the only child of B
A <<: B	A dominates B via an unbroken chain (length > 0) of unary local trees.
A >>: B	A is dominated by B via an unbroken chain (length > 0) of unary local trees.
A $++ B	A is a left sister of B (same as $.. for context-free trees)
A $-- B	A is a right sister of B (same as $,, for context-free trees)
A $+ B	A is the immediate left sister of B (same as $. for context-free trees)
A $- B	A is the immediate right sister of B (same as $, for context-free trees)
A $.. B	A is a sister of B and precedes B
A $,, B	A is a sister of B and follows B
A $. B	A is a sister of B and immediately precedes B
A $, B	A is a sister of B and immediately follows B
A <+(C) B	A dominates B via an unbroken chain of (zero or more) nodes matching description C
A >+(C) B	A is dominated by B via an unbroken chain of (zero or more) nodes matching description C
A .+(C) B	A precedes B via an unbroken chain of (zero or more) nodes matching description C
A ,+(C) B	A follows B via an unbroken chain of (zero or more) nodes matching description C
A <<# B	B is a head of phrase A
A >># B	A is a head of phrase B
A <# B	B is the immediate head of phrase A
A ># B	A is the immediate head of phrase B
A == B	A and B are the same node
A <= B	A and B are the same node or A is the parent of B
A : B	[this is a pattern-segmenting operator that places no constraints on the relationship between A and B]
A <... { B ; C ; ... }	A has exactly B, C, etc as its subtree, with no other children.

Label descriptions can be literal strings, which much match labels exactly, or regular expressions in regular expression bars: /regex/. Literal string matching proceeds as String equality. In order to prevent ambiguity with other Tregex symbols, ASCII symbols (ASCII range characters that are not letters or digits) are not allowed in literal strings, and literal strings cannot begin with ASCII digits. (That is literals can be standard "identifiers" matching [a-zA-Z]([a-zA-Z0-9_-])* but also may include letters from other alphabets.) If you want to use other symbols, you can do so by using a regular expression instead of a literal string.
A disjunctive list of literal strings can be given separated by '|'. The special string '__' (two underscores) can be used to match any node. (WARNING!! Use of the '__' node description may seriously slow down search.) The special string '_ROOT_' matches only at the root of a tree. If a label description is preceded by '@', the label will match any node whose basicCategory matches the description. NB: A single '@' thus scopes over a disjunction specified by '|': @NP|VP means things with basic category NP or VP. The basicCategory is defined according to a Function mapping Strings to Strings, as provided by AbstractTreebankLanguagePack.getBasicCategoryFunction(). Note that Label description regular expressions are matched as find(), as in Perl/tgrep, not as matches(); you need to use ^ or $ to constrain matches to the ends of strings.
Chains of relations have a special non-associative semantics: In a chain of relations A op B op C ..., all relations are relative to the first node in the chain. For example, (S < VP < NP) means "an S over a VP and also over an NP". Nodes can be grouped using parentheses '(' and ')' as in S < (NP $++ VP) to match an S over an NP, where the NP has a VP as a right sister. So, if instead what you want is an S above a VP above an NP, you must write "S < (VP < NP)".

Notes on relations

Node B "follows" node A if B or one of its ancestors is a right sibling of A or one of its ancestors. Node B "immediately follows" node A if B follows A and there is no node C such that B follows C and C follows A.
Node A dominates B through an unbroken chain of unary local trees only if A is also unary. (A (B)) is a valid example that matches A <<: B
When specifying that nodes are dominated via an unbroken chain of nodes matching a description C, the description C cannot be a full Tregex expression, but only an expression specifying the name of the node. Negation of this description is allowed.
== has the same precedence as the other relations, so the expression A << B == A << C associates as (((A << B) == A) << C), not as ((A << B) == (A << C)). (Both expressions are equivalent, of course, but this is just an example.)

Boolean relational operators

Relations can be combined using the '&' and '|' operators, negated with the '!' operator, and made optional with the '?' operator. Thus (NP < NN | < NNS) will match an NP node dominating either an NN or an NNS. (NP > S & $++ VP) matches an NP that is both under an S and has a VP as a right sister. Expressions stop evaluating as soon as the result is known. For example, if the pattern is NP=a | NNP=b and the NP matches, then variable b will not be assigned even if there is an NNP in the tree. Relations can be grouped using brackets '[' and ']'. So the expression

NP [< NN | < NNS] & > S

matches an NP that (1) dominates either an NN or an NNS, and (2) is under an S. Without brackets, & takes precedence over |, and equivalent operators are left-associative. Also note that & is the default combining operator if the operator is omitted in a chain of relations, so that the two patterns are equivalent:

(S < VP < NP)
(S < VP & < NP)

As another example, (VP < VV | < NP % NP) can be written explicitly as (VP [< VV | [< NP & % NP] ] )
Relations can be negated with the '!' operator, in which case the expression will match only if there is no node satisfying the relation. For example (NP !< NNP) matches only NPs not dominating an NNP. Label descriptions can also be negated with '!': (NP < !NNP|NNS) matches NPs dominating some node that is not an NNP or an NNS.
Relations can be made optional with the '?' operator. This way the expression will match even if the optional relation is not satisfied. This is useful when used together with node naming (see below).

Basic Categories

In order to consider only the "basic category" of a tree label, i.e. to ignore functional tags or other annotations on the label, prefix that node's description with the @ symbol. For example (@NP < @/NN.?/) This can only be used for individual nodes; if you want all nodes to use the basic category, it would be more efficient to use a TreeNormalizer to remove functional tags before passing the tree to the TregexPattern.

Segmenting patterns

The ":" operator allows you to segment a pattern into two pieces. This can simplify your pattern writing. For example, the pattern

S : NP

matches only those S nodes in trees that also have an NP node.

Naming nodes

Nodes can be given names (a.k.a. handles) using '='. A named node will be stored in a map that maps names to nodes so that if a match is found, the node corresponding to the named node can be extracted from the map. For example (NP < NNP=name) will match an NP dominating an NNP and after a match is found, the map can be queried with the name to retreived the matched node using TregexMatcher.getNode(String o) with (String) argument "name" (not "=name"). Note that you are not allowed to name a node that is under the scope of a negation operator (the semantics would be unclear, since you can't store a node that never gets matched to). Trying to do so will cause a TregexParseException to be thrown. Named nodes can be put within the scope of an optionality operator. Named nodes that refer back to previous named nodes need not have a node description -- this is known as "backreferencing". In this case, the expression will match only when all instances of the same name get matched to the same tree node. For example: the pattern

(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)

matches only an NP dominating exactly the four node sequence NP , NP , -- the mother NP cannot have any other daughters. Multiple backreferences are allowed. If the node w/ no node description does not refer to a previously named node, there will be no error, the expression simply will not match anything. Another way to refer to previously named nodes is with the "link" symbol: '~'. A link is like a backreference, except that instead of having to be equal to the referred node, the current node only has to match the label of the referred to node. A link cannot have a node description, i.e. the '~' symbol must immediately follow a relation symbol.

Customizing headship and basic categories

The HeadFinder used to determine heads for the head relations <#, >#, <<#, and >>#, and also the Function mapping from labels to Basic Category tags can be chosen by using a TregexPatternCompiler.

Variable Groups

If you write a node description using a regular expression, you can assign its matching groups to variable names. If more than one node has a group assigned to the same variable name, then matching will only occur when all such groups capture the same string. This is useful for enforcing coindexation constraints. The syntax is

/ <regex-stuff> /#<group-number>%<variable-name>

For example, the pattern (designed for Penn Treebank trees)

@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))

will match only such that the WH- node under the SBAR is coindexed with the trace node that gets the name empty.

Current known bugs/shortcomings:

Tregex does not support disjunctions at the root level. For example, the pattern A | B will not work.
Using multiple variable strings in one regex may not necessarily work. For example, suppose the first two regex patterns are /(.*)/#1%foo and /(.*)/#1%bar. You might then want to write a pattern that matches the concatenation of these patterns, /(.*)(.*)/#1%foo#2%bar, but that will not work.

Author:: Galen Andrew, Roger Levy (rog@csli.stanford.edu), Anna Rafferty (filter mode), John Bauer (extensively tested and bugfixed)
See Also:: Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class TregexPattern.TRegexTreeReaderFactory

Nested Classes
Modifier and Type	Class and Description
`static class`	`TregexPattern.TRegexTreeReaderFactory`

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`static TregexPattern`	`compile(java.lang.String tregex)` Creates a pattern from the given string using the default HeadFinder and BasicCategoryFunction.
`static void`	`main(java.lang.String[] args)` Prints out all matches of a tree pattern on each tree in the path.
`TregexMatcher`	`matcher(Tree t)` Get a `TregexMatcher` for this pattern on this tree.
`TregexMatcher`	`matcher(Tree t, HeadFinder headFinder)` Get a `TregexMatcher` for this pattern on this tree.
`java.lang.String`	`pattern()`
`void`	`prettyPrint()` Print a multi-line representation of the pattern illustrating it's syntax to System.out.
`void`	`prettyPrint(java.io.PrintStream ps)` Print a multi-line representation of the pattern illustrating it's syntax.
`void`	`prettyPrint(java.io.PrintWriter pw)` Print a multi-line representation of the pattern illustrating it's syntax.
`static TregexPattern`	`safeCompile(java.lang.String tregex, boolean verbose)` Creates a pattern from the given string using the default HeadFinder and BasicCategoryFunction.
`abstract java.lang.String`	`toString()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Method Detail
  - matcher
```
public TregexMatcher matcher(Tree t)
```
    Get a TregexMatcher for this pattern on this tree.
    
    Parameters:
    
    t - a tree to match on
    
    Returns:
    
    a TregexMatcher
  - matcher
```
public TregexMatcher matcher(Tree t,
                             HeadFinder headFinder)
```
    Get a TregexMatcher for this pattern on this tree. Any Relations which use heads of trees should use the provided HeadFinder.
    
    Parameters:
    
    t - a tree to match on
    
    headFinder - a HeadFinder to use when matching
    
    Returns:
    
    a TregexMatcher
  - compile
```
public static TregexPattern compile(java.lang.String tregex)
```
    Creates a pattern from the given string using the default HeadFinder and BasicCategoryFunction. If you want to use a different HeadFinder or BasicCategoryFunction, use a TregexPatternCompiler object.
    
    Parameters:
    
    tregex - the pattern string
    
    Returns:
    
    a TregexPattern for the string.
    
    Throws:
    
    TregexParseException - if the string does not parse
  - safeCompile
```
public static TregexPattern safeCompile(java.lang.String tregex,
                                        boolean verbose)
```
    Creates a pattern from the given string using the default HeadFinder and BasicCategoryFunction. If you want to use a different HeadFinder or BasicCategoryFunction, use a TregexPatternCompiler object. Rather than throwing an exception when the string does not parse, simply returns null.
    
    Parameters:
    
    tregex - the pattern string
    
    verbose - whether to log errors when the string doesn't parse
    
    Returns:
    
    a TregexPattern for the string, or null if the string does not parse.
  - pattern
```
public java.lang.String pattern()
```
  - toString
```
public abstract java.lang.String toString()
```
    Overrides:
    
    toString in class java.lang.Object
    
    Returns:
    
    A single-line string representation of the pattern
  - prettyPrint
```
public void prettyPrint(java.io.PrintWriter pw)
```
    Print a multi-line representation of the pattern illustrating it's syntax.
  - prettyPrint
```
public void prettyPrint(java.io.PrintStream ps)
```
    Print a multi-line representation of the pattern illustrating it's syntax.
  - prettyPrint
```
public void prettyPrint()
```
    Print a multi-line representation of the pattern illustrating it's syntax to System.out.
  - main
```
public static void main(java.lang.String[] args)
                 throws java.io.IOException
```
    Prints out all matches of a tree pattern on each tree in the path. Usage: java edu.stanford.nlp.trees.tregex.TregexPattern [[-TCwfosnu] [-filter] [-h <node-name>]]* pattern filepath Arguments:
    - pattern: the tree pattern which optionally names some set of nodes (i.e., gives it the "handle") =name (for some arbitrary string "name")
    - filepath: the path to files with trees. If this is a directory, there will be recursive descent and the pattern will be run on all files beneath the specified directory.
    Options:
    - -C suppresses printing of matches, so only the number of matches is printed.
    - -w causes ONLY the whole of a tree that matches to be printed.
    - -W causes the whole of a tree that matches to be printed ALSO.
    - -f causes the filename to be printed.
    - -i <filename> causes the pattern to be matched to be read from <filename> rather than the command line. Don't specify a pattern when this option is used.
    - -o Specifies that each tree node can be reported only once as the root of a match (by default a node will be printed once for every way the pattern matches).
    - -s causes trees to be printed all on one line (by default they are pretty printed).
    - -n causes the number of the tree in which the match was found to be printed before every match.
    - -u causes only the label of each matching node to be printed, not complete subtrees.
    - -t causes only the yield (terminal words) of the selected node to be printed (or the yield of the whole tree, if the -w option is used).
    - -encoding <charset_encoding> option allows specification of character encoding of trees..
    - -h <node-handle> If a -h option is given, the root tree node will not be printed. Instead, for each node-handle specified, the node matched and given that handle will be printed. Multiple nodes can be printed by using the -h option multiple times on a single command line.
    - -hf <headfinder-class-name> use the specified HeadFinder class to determine headship relations.
    - -hfArg <string> pass a string argument in to the HeadFinder class's constructor. -hfArg can be used multiple times to pass in multiple arguments.
    - -trf <TreeReaderFactory-class-name> use the specified TreeReaderFactory class to read trees from files.
    - -e <extension> Only attempt to read files with the given extension. If not provided, will attempt to read all files.
    - -v print every tree that contains no matches of the specified pattern, but print no matches to the pattern.
    - -x Instead of the matched subtree, print the matched subtree's identifying number as defined in tgrep2:a unique identifier for the subtree and is in the form s:n, where s is an integer specifying the sentence number in the corpus (starting with 1), and n is an integer giving the order in which the node is encountered in a depth-first search starting with 1 at top node in the sentence tree.
    - -extract <tree-file> extracts the subtree s:n specified by code from the specified tree-file. Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.
    - -extractFile <code-file> <tree-file> extracts every subtree specified by the subtree codes in code-file, which must appear exactly one per line, from the specified tree-file. Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.
    - -filter causes this to act as a filter, reading tree input from stdin
    - -T causes all trees to be printed as processed (for debugging purposes). Otherwise only matching nodes are printed.
    - -macros <filename> filename with macro substitutions to use. file with tab separated lines original-tab-replacement
    Throws:
    
    java.io.IOException

Class TregexPattern

Getting Started

Tregex pattern language

Notes on relations

Boolean relational operators

Basic Categories

Segmenting patterns

Naming nodes

Customizing headship and basic categories

Variable Groups

Current known bugs/shortcomings:

Nested Class Summary

Method Summary

Methods inherited from class java.lang.Object

Method Detail

matcher

matcher

compile

safeCompile

pattern

toString

prettyPrint

prettyPrint

prettyPrint

main