TokensRegexAnnotator

About

TokensRegexAnnotator is an customizable annotator for the StanfordCoreNLP pipeline. It is part of the TokensRegex, a framework for defining patterns over text and mapping to semantic objects represented as Java objects.

By using the TokensRegexAnnotator, you can customize annotations based on regular expressions over sequences of tokens. It uses TokensRegex rules to define what patterns to match and what to annotate.

If you only want to use TokensRegex to recognize named entities using regular expression, then you should use the TokensRegexNERAnnotator instead.

Usage

To add a TokensRegexAnnotator to the pipeline:

Create rules file (see SequenceMatchRules for format of the rule file).
Example: color.rules.txt
Rules to tag color words with ner="COLOR", and normalized=hex rgb string

Configure the TokensRegexAnnotator

customAnnotatorClass.[name]=edu.stanford.nlp.pipeline.TokensRegexAnnotator
[name].rules = [path to rules file]

Example:
Configuration: color.properties

customAnnotatorClass.color=edu.stanford.nlp.pipeline.TokensRegexAnnotator
color.rules = color.rules.txt

Add the annotator to the pipeline
Example:
Input: color.input.txt
Command:
java -cp stanford-corenlp-2012-05-22.jar:stanford-corenlp-2012-05-22-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,color -properties color.properties -file color.input.txt

See TokensRegex Extraction Pipeline and TokensRegex Rules for more information on how rules are specified and expressions are matched using TokensRegex.