public abstract class TregexPattern
extends java.lang.Object
implements java.io.Serializable
tgrep
and tgrep2
. However, unlike these
tree pattern matching systems, but like Unix grep
, there is no pre-indexing of the data to be searched.
Rather there is a linear scan through the trees where matches are sought.
As a result, matching is slower, but a TregexPattern can be applied
to an arbitrary set of trees at runtime in a processing pipeline without pre-indexing.
TregexPattern instances can be matched against instances of the Tree
class.
The main(java.lang.String[])
method can be used to find matching nodes of a treebank from the command line.
/^MW/ < IN
.
We then create a pattern, find matches in a given tree, and process
those matches as follows:
// Create a reusable pattern object
TregexPattern patternMW = TregexPattern.compile("/^MW/ < IN");
// Run the pattern on one particular tree
TregexMatcher matcher = patternMW.matcher(tree);
// Iterate over all of the subtrees that matched
while (matcher.findNextMatchingNode()) {
Tree match = matcher.getMatch();
// do what we want to do with the subtree
match.pennPrint();
}
Symbol | Meaning |
---|---|
A << B | A dominates B |
A >> B | A is dominated by B |
A < B | A immediately dominates B |
A > B | A is immediately dominated by B |
A <<< B | A dominates B and B is a leaf |
A $ B | A is a sister of B (and not equal to B) |
A .. B | A precedes B |
A . B | A immediately precedes B |
A ,, B | A follows B |
A , B | A immediately follows B |
A <<, B | B is a leftmost descendant of A |
A <<- B | B is a rightmost descendant of A |
A >>, B | A is a leftmost descendant of B |
A >>- B | A is a rightmost descendant of B |
A <, B | B is the first child of A |
A >, B | A is the first child of B |
A <- B | B is the last child of A |
A >- B | A is the last child of B |
A <` B | B is the last child of A |
A >` B | A is the last child of B |
A <i B | B is the ith child of A (i > 0) |
A >i B | A is the ith child of B (i > 0) |
A <-i B | B is the ith-to-last child of A (i > 0) |
A >-i B | A is the ith-to-last child of B (i > 0) |
A <<<i B | B is the ith leaf of A |
A <<<-i B | B is the ith-to-last leaf of A |
A <: B | B is the only child of A |
A >: B | A is the only child of B |
A <<: B | A dominates B via an unbroken chain (length > 0) of unary local trees. |
A >>: B | A is dominated by B via an unbroken chain (length > 0) of unary local trees. |
A $++ B | A is a left sister of B (same as $.. for context-free trees) |
A $-- B | A is a right sister of B (same as $,, for context-free trees) |
A $+ B | A is the immediate left sister of B (same as $. for context-free trees) |
A $- B | A is the immediate right sister of B (same as $, for context-free trees) |
A $.. B | A is a sister of B and precedes B |
A $,, B | A is a sister of B and follows B |
A $. B | A is a sister of B and immediately precedes B |
A $, B | A is a sister of B and immediately follows B |
A <+(C) B | A dominates B via an unbroken chain of (zero or more) nodes matching description C |
A >+(C) B | A is dominated by B via an unbroken chain of (zero or more) nodes matching description C |
A .+(C) B | A precedes B via an unbroken chain of (zero or more) nodes matching description C |
A ,+(C) B | A follows B via an unbroken chain of (zero or more) nodes matching description C |
A <<# B | B is a head of phrase A |
A >># B | A is a head of phrase B |
A <# B | B is the immediate head of phrase A |
A ># B | A is the immediate head of phrase B |
A == B | A and B are the same node |
A <= B | A and B are the same node or A is the parent of B |
A : B | [this is a pattern-segmenting operator that places no constraints on the relationship between A and B] |
A <... { B ; C ; ... } | A has exactly B, C, etc as its subtree, with no other children. |
AbstractTreebankLanguagePack.getBasicCategoryFunction()
.
Note that Label description regular expressions are matched as find()
,
as in Perl/tgrep, not as matches()
;
you need to use ^
or $
to constrain matches to
the ends of strings.
(S < VP < NP)
means
"an S over a VP and also over an NP".
Nodes can be grouped using parentheses '(' and ')'
as in S < (NP $++ VP)
to match an S
over an NP, where the NP has a VP as a right sister.
So, if instead what you want is an S above a VP above an NP, you must write
"S < (VP < NP)
".
B
"follows" node A
if B
or one of its ancestors is a right sibling of A
or one
of its ancestors. Node B
"immediately follows" node
A
if B
follows A
and there
is no node C
such that B
follows
C
and C
follows A
.
A
dominates B
through an unbroken
chain of unary local trees only if A
is also
unary. (A (B))
is a valid example that matches
A <<: B
C
, the description
C
cannot be a full Tregex expression, but only an
expression specifying the name of the node. Negation of this
description is allowed.
==
has the same precedence as the other relations, so the expression
A << B == A << C
associates as
(((A << B) == A) << C)
, not as
((A << B) == (A << C))
. (Both expressions are
equivalent, of course, but this is just an example.)
'&'
and '|'
operators,
negated with the '!'
operator, and made optional with the '?'
operator.
Thus (NP < NN | < NNS)
will match an NP node dominating either
an NN or an NNS. (NP > S & $++ VP)
matches an NP that
is both under an S and has a VP as a right sister.
Expressions stop evaluating as soon as the result is known. For
example, if the pattern is NP=a | NNP=b
and the NP
matches, then variable b
will not be assigned even if
there is an NNP in the tree.
Relations can be grouped using brackets '[' and ']'. So the expression
NP [< NN | < NNS] & > S
matches an NP that (1) dominates either an NN or an NNS, and (2) is under an S. Without
brackets, & takes precedence over |, and equivalent operators are
left-associative. Also note that & is the default combining operator if the
operator is omitted in a chain of relations, so that the two patterns are equivalent:
As another example,(S < VP < NP)
(S < VP & < NP)
(VP < VV | < NP % NP)
can be written explicitly as (VP [< VV | [< NP & % NP] ] )
(NP !< NNP)
matches only NPs not dominating
an NNP. Label descriptions can also be negated with '!'
:
(NP < !NNP|NNS)
matches NPs dominating some node
that is not an NNP or an NNS.
@
symbol. For example
(@NP < @/NN.?/)
This can only be used for individual nodes;
if you want all nodes to use the basic category, it would be more efficient
to use a TreeNormalizer
to remove functional
tags before passing the tree to the TregexPattern.
S : NPmatches only those S nodes in trees that also have an NP node.
(NP < NNP=name)
will match an NP dominating an NNP
and after a match is found, the map can be queried with the
name to retreived the matched node using TregexMatcher.getNode(String o)
with (String) argument "name" (not "=name").
Note that you are not allowed to name a node that is under the scope of a negation operator (the semantics would
be unclear, since you can't store a node that never gets matched to).
Trying to do so will cause a TregexParseException
to be thrown. Named nodes
can be put within the scope of an optionality operator.
Named nodes that refer back to previous named nodes need not have a node
description -- this is known as "backreferencing". In this case, the expression
will match only when all instances of the same name get matched to the same tree node.
For example: the pattern
(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
matches only an NP dominating exactly the four node sequence
NP , NP ,
-- the mother NP cannot have any other
daughters. Multiple backreferences are allowed. If the node w/ no
node description does not refer to a previously named node, there
will be no error, the expression simply will not match anything.
Another way to refer to previously named nodes is with the "link" symbol: '~'.
A link is like a backreference, except that instead of having to be equal to the
referred node, the current node only has to match the label of the referred to node.
A link cannot have a node description, i.e. the '~' symbol must immediately follow a
relation symbol.
<#
, >#
, <<#
,
and >>#
, and also
the Function mapping from labels to Basic Category tags can be
chosen by using a TregexPatternCompiler
.
/ <regex-stuff> /#<group-number>%<variable-name>
For example, the pattern (designed for Penn Treebank trees)
@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))
will match only such that the WH- node under the SBAR is coindexed with the trace node that gets the name empty
.
A | B
will not work.
/(.*)/#1%foo
and
/(.*)/#1%bar
. You might then want to write a pattern
that matches the concatenation of these patterns,
/(.*)(.*)/#1%foo#2%bar
, but that will not work.
Modifier and Type | Class and Description |
---|---|
static class |
TregexPattern.TRegexTreeReaderFactory |
Modifier and Type | Method and Description |
---|---|
static TregexPattern |
compile(java.lang.String tregex)
Creates a pattern from the given string using the default HeadFinder and
BasicCategoryFunction.
|
static void |
main(java.lang.String[] args)
Prints out all matches of a tree pattern on each tree in the path.
|
TregexMatcher |
matcher(Tree t)
Get a
TregexMatcher for this pattern on this tree. |
TregexMatcher |
matcher(Tree t,
HeadFinder headFinder)
Get a
TregexMatcher for this pattern on this tree. |
java.lang.String |
pattern() |
void |
prettyPrint()
Print a multi-line representation of the pattern illustrating
it's syntax to System.out.
|
void |
prettyPrint(java.io.PrintStream ps)
Print a multi-line representation
of the pattern illustrating it's syntax.
|
void |
prettyPrint(java.io.PrintWriter pw)
Print a multi-line representation
of the pattern illustrating it's syntax.
|
static TregexPattern |
safeCompile(java.lang.String tregex,
boolean verbose)
Creates a pattern from the given string using the default HeadFinder and
BasicCategoryFunction.
|
abstract java.lang.String |
toString() |
public TregexMatcher matcher(Tree t)
TregexMatcher
for this pattern on this tree.t
- a tree to match onpublic TregexMatcher matcher(Tree t, HeadFinder headFinder)
TregexMatcher
for this pattern on this tree. Any Relations which use heads of trees should use the provided HeadFinder.t
- a tree to match onheadFinder
- a HeadFinder to use when matchingpublic static TregexPattern compile(java.lang.String tregex)
TregexPatternCompiler
object.tregex
- the pattern stringTregexParseException
- if the string does not parsepublic static TregexPattern safeCompile(java.lang.String tregex, boolean verbose)
TregexPatternCompiler
object.
Rather than throwing an exception when the string does not parse,
simply returns null.tregex
- the pattern stringverbose
- whether to log errors when the string doesn't parsepublic java.lang.String pattern()
public abstract java.lang.String toString()
toString
in class java.lang.Object
public void prettyPrint(java.io.PrintWriter pw)
public void prettyPrint(java.io.PrintStream ps)
public void prettyPrint()
public static void main(java.lang.String[] args) throws java.io.IOException
java edu.stanford.nlp.trees.tregex.TregexPattern [[-TCwfosnu] [-filter] [-h <node-name>]]* pattern filepath
Arguments:
pattern
: the tree
pattern which optionally names some set of nodes (i.e., gives it the "handle") =name
(for some arbitrary
string "name")
filepath
: the path to files with trees. If this is a directory, there will be recursive descent and the pattern will be run on all files beneath the specified directory.
-C
suppresses printing of matches, so only the
number of matches is printed.
-w
causes ONLY the whole of a tree that matches to be printed.
-W
causes the whole of a tree that matches to be printed ALSO.
-f
causes the filename to be printed.
-i <filename>
causes the pattern to be matched to be read from <filename>
rather than the command line. Don't specify a pattern when this option is used.
-o
Specifies that each tree node can be reported only once as the root of a match (by default a node will
be printed once for every way the pattern matches).
-s
causes trees to be printed all on one line (by default they are pretty printed).
-n
causes the number of the tree in which the match was found to be
printed before every match.
-u
causes only the label of each matching node to be printed, not complete subtrees.
-t
causes only the yield (terminal words) of the selected node to be printed (or the yield of the whole tree, if the -w
option is used).
-encoding <charset_encoding>
option allows specification of character encoding of trees..
-h <node-handle>
If a -h
option is given, the root tree node will not be printed. Instead,
for each node-handle
specified, the node matched and given that handle will be printed. Multiple nodes can be printed by using the
-h
option multiple times on a single command line.
-hf <headfinder-class-name>
use the specified HeadFinder
class to determine headship relations.
-hfArg <string>
pass a string argument in to the HeadFinder
class's constructor. -hfArg
can be used multiple times to pass in multiple arguments.
-trf <TreeReaderFactory-class-name>
use the specified TreeReaderFactory
class to read trees from files.
-e <extension>
Only attempt to read files with the given extension. If not provided, will attempt to read all files.-v
print every tree that contains no matches of the specified pattern, but print no matches to the pattern.
-x
Instead of the matched subtree, print the matched subtree's identifying number as defined in tgrep2:a
unique identifier for the subtree and is in the form s:n, where s is an integer specifying
the sentence number in the corpus (starting with 1), and n is an integer giving the order
in which the node is encountered in a depth-first search starting with 1 at top node in the
sentence tree.
-extract <tree-file>
extracts the subtree s:n specified by code from the specified tree-file.
Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.
-extractFile <code-file> <tree-file>
extracts every subtree specified by the subtree codes in
code-file
, which must appear exactly one per line, from the specified tree-file
.
Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.
-filter
causes this to act as a filter, reading tree input from stdin
-T
causes all trees to be printed as processed (for debugging purposes). Otherwise only matching nodes are printed.
-macros <filename>
filename with macro substitutions to use. file with tab separated lines original-tab-replacement
java.io.IOException