Software/Phrasal
From NLPWiki
Contents |
Phrasal Quick Start Guide
This guide explains how to set up and train a full Statistical Machine Translation (SMT) system using Phrasal. It offers step-by-step instructions to download, install, configure, and run the Phrasal decoder and its related support tools. Note that this guide assumes some familiarity with Unix-like command-line interpreters (or shells), such as bash for Linux and Mac OS X and Cygwin for Microsoft Windows. Note that the commands shown in this tutorial are written for bash, but it is relatively easy to adapt them for other shells.
Step 1: Download and install Phrasal and its support tools
SRILM
Phrasal requires one or more language models, which can be trained using SRILM. SRILM can be downloaded here, and installed as follows:
export SRILM=$PWD/srilm mkdir $SRILM tar -xvzf srilm.tgz -C $SRILM make World MAKE_PIC=1 -C $SRILM
If the latter command fails, this often means that SRILM was unable to automatically determine the machine type. If so, the user must define the MACHINE_TYPE variable on the command line. The directory $SRILM/common indicates which MACHINE_TYPE values are supported. For instance, the existence of a file Makefile.machine.i686-m64 indicates that i686-m64 is a supported architecture, in which case you may be able to compile SRILM as follows:
make World MAKE_PIC=1 MACHINE_TYPE=i686-m64 -C $SRILM
Some of the most frequently asked questions about SRILM are answered in this FAQ, and other questions specific to SRILM may be posted on the srilm-user mailing list (but please make sure you read the FAQ first!). See below for discussion of using other language models.
Phrasal
The Phrasal software package is available here for download. To install Phrasal, simply run:
export PHRASAL=$PWD/phrasal.Beta3 wget http://www-nlp.stanford.edu/software/phrasal/phrasal.Beta3.tar.gz tar -xzvf phrasal.Beta3.tar.gz
To use the package, you also need to install Stanford's CoreNLP tools available here. To install CoreNLP, run:
export CORENLP=$PWD/stanford-corenlp-2012-01-08 wget http://nlp.stanford.edu/software/stanford-corenlp-2012-01-08.tgz tar -xzvf stanford-corenlp-2012-01-08.tgz
To be able to build the package, as well as to be able to use TER and TERp, run the command scripts/first-build.sh. This command automatically downloads TER and TERp, two packages that are freely available on Matthew Snover's website, but that cannot be redistributed by other parties. It then builds the entire source tree on your system.
export CORENLP=path/to/corenlp/dir ./scripts/first-build.sh
After the first build, you can build the system just using ant. However, be sure to always set the environment variable CORENLP to point to where you installed CoreNLP on your system in order to satisfy dependencies within Phrasal on CoreNLP.
export CORENLP=path/to/corenlp/dir ant
For the next steps in this quick start guide, you'll need to add the Phrasal and CoreNLP jars to your classpath. If you are running a Bourne shell derivative, you can do this by sourcing the file scripts/setupenv.sh. If you are running a csh derivative, you can do this by sourcing the file scripts/setupenv.csh. Make sure both the environmental variables PHRASAL and CORENLP are set before running either of these two commands.
# Bourne shell . $PHRASAL/scripts/setupenv.sh # csh shell source $PHRASAL/scripts/setupenv.csh
Phrasal requires both that all data being translated is encoded in UTF-8 and that it's run in an environment with a UTF-8 locale.
Double check that you are using a UTF-8 locale with the locale command:
$ locale # example output LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL=
Any locale with a "UTF-8" suffix will work. At Stanford, we typically run the decoder with the locale "en_US.UTF-8". To use this locale, set the LC_ALL environment variable as follows:
# Bourne shell export LC_ALL=en_US.UTF-8 # csh setenv LC_ALL en_US.UTF-8
Step 2: Training
This section explains how to train a full SMT system from a sentence-aligned bilingual corpus (bitext). Such data is available for several European languages, and you may also acquire such data from LDC. For the purpose of this tutorial, we use a relatively small training set to train a system with Phrasal, but Phrasal was successfully used in large-scale experiments, including for NIST and GALE evaluations.
Data and supporting scripts used in our tutorial were made available for the WMT08 Shared Task. The following commands download, tokenize, and downcase a French-English bitext.
To download and extract the data, run the below commands. (If you don't have wget (for example, on Mac OS X), then you can try using curl -O instead.
mkdir $PHRASAL/work cd $PHRASAL/work wget http://www.statmt.org/wmt08/training-parallel.tar wget http://www.statmt.org/wmt08/devsets.tgz wget http://homepages.inf.ed.ac.uk/jschroe1/how-to/scripts.tgz tar -xvf training-parallel.tar training/news-commentary08.fr-en.en.gz training/news-commentary08.fr-en.fr.gz tar -xzvf devsets.tgz tar -xzvf scripts.tgz
To tokenize the data, run:
mkdir -p data/training mkdir data/dev gzip -cd training/news-commentary08.fr-en.fr.gz | scripts/tokenizer.perl -l fr | tr 'A-Z' 'a-z' > data/training/news-commentary.tok.fr gzip -cd training/news-commentary08.fr-en.en.gz | scripts/tokenizer.perl -l en | tr 'A-Z' 'a-z' > data/training/news-commentary.tok.en scripts/tokenizer.perl -l fr < dev/nc-dev2007.fr | tr 'A-Z' 'a-z' > data/dev/nc-dev2007.tok.fr scripts/tokenizer.perl -l en < dev/nc-dev2007.en | tr 'A-Z' 'a-z' > data/dev/nc-dev2007.tok.en
The first training step in our tutorial is to build a language model (LM) with SRILM:
cd $PHRASAL/work mkdir lm $SRILM/bin/*/ngram-count -order 3 -interpolate -kndiscount -unk -text data/training/news-commentary.tok.en -lm lm/news.lm.gz
In this case, we generated work/lm/news.lm.gz as a trigram LM smoothed with modified Kneser-Ney smoothing (-interpolate -kndiscount). Note that this LM was trained using only the target side of the training data (work/corpus/news-commentary.tok.en), but it is generally recommended to add large amounts of monolingual data of the target language (English in our case). ngram-count supports more options, which are described on this man page.
Now let's align the bitext data. To do this, first save the following to the alignment configuration file aligner.conf:
# aligner.conf ########################################## # Training: Defines the training regimen # ########################################## forwardModels HMM reverseModels HMM mode JOINT iters 2 ############################################### # Execution: Controls output and program flow # ############################################### execDir models create saveParams true numThreads 4 msPerLine 10000 alignTraining leaveTrainingOnDisk safeConcurrency true ################# # Language/Data # ################# foreignSuffix fr englishSuffix en lowercase # Choose the training sources, which can either be directories or files that list files/directories # Note that training on the test set does not peek at the correct answers (no cheating) trainSources data/training sentences MAX # The test sources must have hand alignments for all sentence pairs testSources maxTestSentences MAX offsetTestSentences 0 ############## # Evaluation # ############## competitiveThresholding writeGIZA
Then use the following command to run the Berkeley aligner:
../scripts/align-words 4g aligner.conf
Depending on your machine, this command may take hours or days to complete. On the data of our tutorial, it shouldn't take more than one hour (typically much less with modern computers). You may want to adjust the amount of RAM allocated to the aligner based on the amount of training data you have available and the amount of RAM on your machine. This is done by replacing 4g (which stands for 4 gigabytes) with some value that is more appropriate, e.g., 16g.
The next step extracts all phrase pairs of the training data that are consistent with the word alignment computed in the previous step. To reduce running time and memory usage, phrase extraction in Phrasal is typically done by filtering phrase tables against specific tuning or test sets (i.e., all the source-language data provided to the SMT decoder). To extract phrase pairs applicable to our tuning set ( work/tuning/nc-dev2007.tok.fr), run:
../scripts/extract-phrases 4g models models -fFilterCorpus data/dev/nc-dev2007.tok.fr
The first command-line argument (4g) again specifies the amount of memory, and the second argument (models) names the directory in which alignment and phrase-based translation models will be stored. Note that ../scripts/extract-phrases uses phrase extraction parameters that generally worked well in our experiments (on Chinese-to-English and Arabic-to-English), but that may be ineffective for some other language pairs. To see a list of extraction parameters you may be able to adjust for other language pairs, run:
../scripts/extract-phrases -help
The last step is to run minimum error rate training (MERT). First, save the following decoding model configuration file to disk. This will tell the system where to find the language model, phrase table, and reordering table just created above. It also provides a set of initial decoding weights for the first iteration of MERT training.
# filename: phrasal.conf # translation table [ttable-file] models/phrases-tm.gz # language model [lmodel-file] lm/news.lm.gz # number of translation options for each phrase in f [ttable-limit] 20 [additional-featurizers] edu.stanford.nlp.mt.decoder.feat.HierarchicalReorderingFeaturizer(models/phrases-om.gz,msd2-bidirectional-fe,LexR,hierarchical,hierarchical,bin) # reordering weights [weight-d] 1 1 1 1 1 1 1 # language model weight [weight-l] 1 # translation model weights [weight-t] 0.3 0.2 0.3 0.2 0 # word penalty weight [weight-w] 0 # maximum gap between covered spans [distortion-limit] 6 # detect processors present, and use them all [localprocs] 0
Now, let's tune the decoding model using MERT:
../scripts/phrasal-mert.pl 4g data/dev/nc-dev2007.tok.fr data/dev/nc-dev2007.tok.en bleu phrasal.conf
The above command runs MERT using four gigabytes of memory, using data/dev/nc-dev2007.tok.fr as input, data/dev/nc-dev2007.tok.en as the single reference translation, and optimizing parameters with respect to the BLEU metric. phrasal.conf is a Phrasal configuration file, which specifies decoder search parameters (e.g., stack size, distortion limit) and initial conditions for MERT. This configuration file contains comments to help users assign values that may be more suitable for their specific needs. Note that one can specify more than one reference translation by separating reference file names with commas (e.g., ref0,ref1,ref2,ref3).
Step 3: Testing
This section explains how to test the SMT system just created in step 2. As before, the data and supporting scripts used in our tutorial is taken from the WMT08 Shared Task. Testing the system involves first tokenizing the test data so that it matches the representation used to train and tune the system. The system is then used to translate the source data, with the resulting translations being evaluated using the BLEU score.
Use the following commands to tokenize the source and target test data.
mkdir data/devtest scripts/tokenizer.perl -l fr < devtest/test2007.fr | tr 'A-Z' 'a-z' > data/devtest/test2007.tok.fr scripts/tokenizer.perl -l en < devtest/test2007.en | tr 'A-Z' 'a-z' > data/devtest/test2007.tok.en
Now, we'll want to extract a phrase-table and reordering model for the phrases on the source side of the test data. The following extract-phrases command uses the same bitext alignments used above to extract phrases for the decoding model's training set, but this time it only extracts phrases that are applicable to the source phrases in data/devtest/test2007.fr. The resulting phrase table and reordering model is placed in the directory testmodels.
mkdir testmodels ../scripts/extract-phrases 4g models testmodels -fFilterCorpus data/devtest/test2007.tok.fr
To create a decoder configuration file that uses the new phrase table and reordering model, first make a copy of the trained decoding model we just learned with phrasal-mert.pl.
cp phrasal-mert/phrasal.final.ini phrasal-mert/phrasal.test.ini
Edit phrasal-mert/phrasal.test.ini so that the [ttable-file] section and the HeirachicalReorderingFeaturizer in the [additional-featurizers] section point to testmodels/phrases-tm.gz and testmodels/phrases-om.gz, respectively.
# filename: phrasal-mert/phrasal.test.ini [ttable-file] testmodels/phrases-tm.gz ... [additional-featurizers] edu.stanford.nlp.mt.decoder.feat.HierarchicalReorderingFeaturizer(testmodels/phrases-om.gz,msd2-bidirectional-fe,LexR,hierarchical,hierarchical,bin)
Translate the test source data using the new decoder configuration file.
../scripts/decode 4g phrasal-mert/phrasal.test.ini < data/devtest/test2007.tok.fr > test2007.trans.en
The quality of the translated sentences can be evaluated using the BLEU score.
../scripts/bleu data/devtest/test2007.tok.en < test2007.trans.en
Additional Information on Using Phrasal
Language Models
Phrasal works with ARPA format language models. Almost all software for building language models can produce language models in ARPA backoff format. It is the de facto standard. The most commonly used package for making language models is SRILM, and that is what we use in the Quick Start Guide. It is also what we have standardly used at Stanford with Phrasal. However, it is not free software. You might also be interested in other packages for building language models, several of which are licensed under the GPL:
- RandLM (GPL)
- IRST LM Toolkit (LGPL)
FAQ
- How do I use phrases with gaps?
You need to do two things:
1) Run extract-phrases with the extra arguments -withGaps true, which extracts both contiguous and discontinuous phrases. To prevent target-discontinuous phrases from being extracted, add the extra arguments -noTargetGaps true. Note that phrase extraction with discontinuous phrases requires much more memory than with contiguous phrases alone. The experiments of [Galley and Manning, 2010] required 16GB of RAM.
2) Add a gaps field to the decoder configuration file, for example:
[gaps] 15 7
This configures the decoder so that the maximum source side phrase span is 15 words, and the maximum target side phrase span is 7 words. As an alternative, adding --gaps 15 7 to the Phrasal command line achieves the same effect. To provide these arguments to Phrasal in MERT, instead add these extra arguments to phrasal-mert.pl: --phrasal_flags=\"--gaps 15 7\". Note that these two thresholds only affect discontinuous phrases, and do not prevent the use of contiguous phrases whose lengths happen to exceed these values.
The optimal maximum gap spans are data and language dependent. In the experiments of [Galley and Manning, 2010], the above two values represent what worked best for Chinese-to-English. However, disabling target gaps worked almost as well and made the decoder run 2 to 3 times faster. To prevent the use of phrases with target gaps, either omit the second value, or set it to 0 (in this case, target-discontinuous phrases are not even loaded into memory during decoder initialization). If target-discontinuous phrases are ignored during decoding, it is best to extract phrases using -noTargetGaps true.
For questions specific to discontinuous phrases in Stanford Phrasal, please email mgalley@acm.org.