A tokenizer divides text into a sequence of tokens, which roughly
correspond to "words". We provide a class suitable for tokenization of
English, called PTBTokenizer
. It was initially designed to largely
mimic
Penn
Treebank 3 (PTB) tokenization, hence its name, though over
time the tokenizer has added quite a few options and a fair amount of
Unicode compatibility, so in general it will work well over text encoded
in Unicode that does not require word
segmentation (such as writing systems that do not put spaces between words) or
more exotic language-particular rules (such as writing systems that use
: or ? as a character inside words, etc.). In 2017 it was upgraded to support non-Basic Multilingual
Plane Unicode, in particular, to support emoji. 😍
We also have corresponding tokenizers
FrenchTokenizer
and SpanishTokenizer
for French and Spanish. We use the
Stanford Word Segmenter for
languages like Chinese and Arabic.
An ancillary tool DocumentPreprocessor
uses this
tokenization to provide the ability to split text into sentences.
PTBTokenizer mainly targets formal English writing rather than SMS-speak.
PTBTokenizer is a an efficient, fast, deterministic tokenizer. (For the more technically inclined, it is implemented as a finite automaton, produced by JFlex.) While deterministic, it uses some quite good heuristics, so it can usually decide when single quotes are parts of words, when periods do an don't imply sentence boundaries, etc. Sentence splitting is a deterministic consequence of tokenization: a sentence ends when a sentence-ending character (., !, or ?) is found which is not grouped with other characters into a token (such as for an abbreviation or number), though the sentence may still include a few tokens that can follow a sentence ending character as part of the same sentence (such as quotes and brackets).
PTBTokenizer has been developed by Christopher Manning, Tim Grow, Teg Grenager, Jenny Finkel, and John Bauer.
The Stanford Tokenizer is not distributed separately but is included in several of our software downloads, including the Stanford Parser, Stanford Part-of-Speech Tagger, Stanford Named Entity Recognizer, and Stanford CoreNLP. Choose a tool, download it, and you're ready to go. See these software packages for details on software licenses.
The tokenizer requires Java (now, Java 8). As well as API
access, the program includes an easy-to-use
command-line interface, PTBTokenizer
. For the examples
below, we assume you have set up your CLASSPATH to find
PTBTokenizer
, for example with a command like the following
(the details depend on your operating system and shell):
You can also specify this on each command-line by addingexport CLASSPATH=stanford-parser.jar
-cp
stanford-parser.jar
after java
.
The basic operation is to convert a plain text file into a sequence of tokens, which are printed out one per line. Here is an example (on Unix):
$ cat >sample.txt "Oh, no," she's saying, "our $400 blender can't handle something this hard!" $ java edu.stanford.nlp.process.PTBTokenizer sample.txt `` Oh , no , '' she 's saying , `` our $ 400 blender ca n't handle something this hard ! '' PTBTokenizer tokenized 23 tokens at 370.97 tokens per second.
Here, we gave a filename argument which contained the text. PTBTokenizer can also read from a gzip-compressed file or a URL, or it can run as a filter, reading from stdin. There are a bunch of other things it can do, using command-line flags.
$ java edu.stanford.nlp.process.PTBTokenizer -preserveLines < sample.txt | java edu.stanford.nlp.process.PTBTokenizer -untok > roundtrip.txt $ diff sample.txt roundtrip.txt $
The output of PTBTokenizer can be post-processed to divide a text into
sentences. One way to get the output of that from the command-line is
through
calling edu.stanfordn.nlp.process.DocumentPreprocessor
.
The other is to use the sentence splitter in CoreNLP.
For example:
$ cat >sample.txt Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U.S. Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U.S. Open. H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $500 million. It's the first group action of its kind in Britain and one of only a handful of lawsuits against tobacco companies outside the U.S. A Paris lawyer last year sued France's Seita SA on behalf of two cancer-stricken smokers. Japan Tobacco Inc. faces a suit from five smokers who accuse the government-owned company of hooking them on an addictive product. $ $ java edu.stanford.nlp.process.DocumentPreprocessor sample.txt Another ex-Golden Stater , Paul Stankowski from Oxnard , is contending for a berth on the U.S. Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month 's U.S. Open . H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $ 500 million . It 's the first group action of its kind in Britain and one of only a handful of lawsuits against tobacco companies outside the U.S. . A Paris lawyer last year sued France 's Seita SA on behalf of two cancer-stricken smokers . Japan Tobacco Inc. faces a suit from five smokers who accuse the government-owned company of hooking them on an addictive product . Read in 5 sentences.
There are various ways to call the code, but here's a simple example to
get started with, showing using either PTBTokenizer
directly or
calling DocumentPreprocessor
.
import java.io.FileReader; import java.io.IOException; import java.util.List; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.ling.HasWord; import edu.stanford.nlp.process.CoreLabelTokenFactory; import edu.stanford.nlp.process.DocumentPreprocessor; import edu.stanford.nlp.process.PTBTokenizer; public class TokenizerDemo { public static void main(String[] args) throws IOException { for (String arg : args) { // option #1: By sentence. DocumentPreprocessor dp = new DocumentPreprocessor(arg); for (List<HasWord> sentence : dp) { System.out.println(sentence); } // option #2: By token PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<>(new FileReader(arg), new CoreLabelTokenFactory(), ""); while (ptbt.hasNext()) { CoreLabel label = ptbt.next(); System.out.println(label); } } } }
There are a number of options that affect how tokenization is
performed. These can be specified on the command line, with the flag
-options (or -tokenizerOptions in tools like the
Stanford Parser) or in the constructor to PTBTokenizer
or
the factory methods in PTBTokenizerFactory
. Here are the
current options. They are specified as a single string, with options
separated by commas, and values given in option=value syntax, for
instance
"americanize=false,unicodeQuotes=true,unicodeEllipsis=true"
.
For asking questions, see our support page.
PTBTokenizer is a fast compiled finite automaton. This has some disadvantages, limiting the extent to which behavior can be changed at runtime, but means that it is very fast. Here are some statistics measured on a MacBook Pro (15 inch, 2016) with a 2.7 GHz Intel Core i7 proccessor (4 cores, 256kb L2 cache per core, 8MB L3 cache) running Java 9, and for statistics involving disk, using an SSD using Stanford NLP v3.9.1. The documents used were NYT newswire from LDC English Gigaword 5.
PTBTokenizer Configuration | Tokens/second | Ave. time per Gigaword document |
---|---|---|
Tokenizing document Strings in memory | 4.51 million | 0.18 ms. |
Tokenizing from disk to disk | 3.15 million | 0.25 ms. |
For comparison, we tried to directly time the speed of the SpaCy tokenizer v.2.0.11 under Python v.3.5.4. (Note: this is SpaCy v2, not v1. We believe the figures in their speed benchmarks are still reporting numbers from SpaCy v1, which was apparently much faster than v2). Here are the timings we got:
SpaCy Configuration | Tokens/second | Ave. time per Gigaword document |
---|---|---|
Tokenizing document Strings in memory | 180 thousand | 4.7 ms. |
Tokenizing from disk to disk | 125 thousand | 6.5 ms. |
Indeed, we find that, using the stanfordcorenlp Python wrapper, you can tokenize with CoreNLP in Python in about 70% of the time that SpaCy v2 takes, even though a lot of the speed difference necessarily goes away while marshalling data into json, sending it via http and then reassembling it from json.