Software > Stanford RegexNER

RegexNER is a pattern-based (i.e., rule-based) interface for doing Named Entity Recognition (NER). It's an easy-to-use interface over TokensRegex. Here's a little example of what you can do with RegexNER. Let's start with a small file with information about Julia Gillard from Wikipedia. If it is run through Stanford CoreNLP with this command:

java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,ner' -file JuliaGillard.txt

then the output is this file (rendered using XSLT) [xml] ... or was in July 2013.

The output isn't bad, but we might want to improve it. For example, we might want to label university degrees as an entity. We can do that in a simple rule-based manner with RegexNER.

The simplest rule file has two tab-separated fields on a line. The first field has text to match and the second field has the entity category to assign. (Note that you must have a tab character between the text and the category. Other spaces will not do.) So we might wish to label the names of university degrees with a DEGREE entity label. Then our first RegexNER file might be the following (with a tab before each "DEGREE"):

Bachelor of Arts	DEGREE
Bachelor of Laws	DEGREE

We can use this file with this command, adding RegexNER to the list of annotators:

java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,ner,regexner' -file JuliaGillard.txt -regexner.mapping jg-regexner.txt

In a little more detail now, the first field is not just matched as a string, but as a sequence of one or more space-separated patterns. That is, CoreNLP divides text into tokens and each whitespace-separated pattern in the first field has to match against successive tokens in the text. Each pattern is a (standard Java) regular expression. If the regular expressions match a sequence of tokens, the tokens will be relabeled as the category in the second column. Providing that you avoid certain special characters, a pattern can just be a regular String, and so you can use RegexNER as a gazetteer. Howver, you can also do somewhat fancier things once you know that you can match regular expressions. For the example above, you might also want to match a variety of degrees. Rather than writing many strings, you can compress them into one regular expression like this:

Bachelor of (Arts|Laws|Science|Engineering|Divinity)	DEGREE

Then, if you look at the original output, you will see there are a couple of mistakes. It misrecognizes Lalor as a PERSON, when it is a LOCATION (an electoral seat). And it sometimes fails to tag Labor as an ORGANIZATION when it appears not followed by Party. To fix the first error, you need one more concept: RegexNER will not overwrite an existing entity assignment, unless you give it permission in a third tab-separated column, which contains a comma-separated list of entity types that can be overwritten. Only the non-entity O label can always be overwritten, but you can specify extra entity tags which can always be overwritten as well. This gives us this RegexNER file, which you can also download.

Bachelor of (Arts|Laws|Science|Engineering|Divinity)	DEGREE

Then the updated CoreNLP output can be found here [xml].

Of course, these last two rules for relabeling tokens are rather dangerous. If you ran with this RegexNER file on the Wikipedia page for Kieran Lalor, then it would mess up the output badly, changing his surname to a LOCATION. Similarly, the rule for Labor would be a bad idea in an article that discusses Labor unions.

The fourth column can be used to give rules a priority. If multiple rules match, the result is undefined unless you give the rules a priority. Here is a (sort of silly) extension of the last file, where we have rules with priorities. The priorities are occurring in the fourth column, so note very carefully that there are two tabs between the entity label and the priority. Rules with no explicitly given priority have priority 1.0.

Bachelor of (Arts|Laws|Science|Engineering|Divinity)	DEGREE		2.0
Bachelor of Arts	EASY_DEGREE
Bachelor of Laws	HARD_DEGREE		3.0

This tokensregex file can be downoaded here and then the updated CoreNLP output can be found here [xml].

If you want a lot of control in checking for words in the context, or checking parts-of-speech, then you want to start looking in more detail at the full TokensRegex, language. It is a more powerful (but more complex) framework for writing rules for token labeling, which RegexNER provides a simple but often sufficient front-end to. Among other things, the full TokensRegex lets you have a whole library of rule files.

You do not have to use RegexNER as a post-processor after statistical NER. For instance, we could run just the above RegexNER file like this:

java -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,regexner' -file JuliaGillard.txt -regexner.mapping jg-regexner.txt

Then the only recognized entities are the ones in our pattern file. Nevertheless, in general, writing rules that cover all cases is a difficult enterprise! This is one reason why statistical classifiers have become dominant, because they are good at integrating various sources of evidence. So, often, a tool like RegexNER is most useful as an overlay that corrects or augments the output of a statistical NLP system like Stanford NER.

This example showed usage from the command line. You can also easily use RegexNER in code. The RegexNER rules can be in a regular file, in a resource that is on your CLASSPATH, or even specified by a URL. You then specify to load RegexNER and where the RegexNER rules file is by providing an appropriate Properties object when creating Stanford CoreNLP:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", "org/foo/resources/jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);