Software/Phrasal

From NLPWiki

Revision as of 23:38, 27 February 2012 by Danielcer (Talk | contribs)
Jump to: navigation, search

Contents

Phrasal Quick Start Guide

This guide explains how to set up and train a full Statistical Machine Translation (SMT) system using Phrasal. It offers step-by-step instructions to download, install, configure, and run the Phrasal decoder and its related support tools. Note that this guide assumes some familiarity with Unix-like command-line interpreters (or shells), such as bash for Linux and Mac OS X and Cygwin for Microsoft Windows. Note that the commands shown in this tutorial are written for bash, but it is relatively easy to adapt them for other shells.

Step 1: Download and install Phrasal and its support tools

SRILM

Phrasal requires one or more language models, which can be trained using SRILM. SRILM can be downloaded here, and installed as follows:

export SRILM=$PWD/srilm
mkdir $SRILM
tar -xvzf srilm.tgz -C $SRILM
make World MAKE_PIC=1 -C $SRILM

If the latter command fails, this often means that SRILM was unable to automatically determine the machine type. If so, the user must define the MACHINE_TYPE variable on the command line. The directory $SRILM/common indicates which MACHINE_TYPE values are supported. For instance, the existence of a file Makefile.machine.i686-m64 indicates that i686-m64 is a supported architecture, in which case you may be able to compile SRILM as follows:

make World MAKE_PIC=1 MACHINE_TYPE=i686-m64 -C $SRILM

Some of the most frequently asked questions about SRILM are answered in this FAQ, and other questions specific to SRILM may be posted on the srilm-user mailing list (but please make sure you read the FAQ first!). See below for discussion of using other language models.

Phrasal

The Phrasal software package is available here for download. To install Phrasal, simply run:

export PHRASAL=$PWD/phrasal.Beta3
wget http://www-nlp.stanford.edu/software/phrasal/phrasal.Beta3.tar.gz
tar -xzvf phrasal.Beta3.tar.gz

To use the package, you also need to install Stanford's CoreNLP tools available here. To install CoreNLP, run:

export CORENLP=$PWD/stanford-corenlp-2012-01-08
wget http://nlp.stanford.edu/software/stanford-corenlp-2012-01-08.tgz
tar -xzvf stanford-corenlp-2012-01-08.tgz

To be able to build the package, as well as to be able to use TER and TERp, run the command scripts/first-build.sh. This command automatically downloads TER and TERp, two packages that are freely available on Matthew Snover's website, but that cannot be redistributed by other parties. It then builds the entire source tree on your system.

export CORENLP=path/to/corenlp/dir
./scripts/first-build.sh

After the first build, you can build the system just using ant. However, be sure to always set the environment variable CORENLP to point to where you installed CoreNLP on your system in order to satisfy dependencies within Phrasal on CoreNLP.

export CORENLP=path/to/corenlp/dir
ant

For the next steps in this quick start guide, you'll need to add the Phrasal and CoreNLP jars to your classpath. If you are running a Bourne shell derivative, you can do this by sourcing the file scripts/setupenv.sh. If you are running a csh derivative, you can do this by sourcing the file scripts/setupenv.csh. Make sure both the environmental variables PHRASAL and CORENLP are set before running either of these two commands.

# Bourne shell
. $PHRASAL/scripts/setupenv.sh

# csh shell
source $PHRASAL/scripts/setupenv.csh

Phrasal requires both that all data being translated is encoded in UTF-8 and that it's run in an environment with a UTF-8 locale.

Double check that you are using a UTF-8 locale with the locale command:

$ locale 

# example output 
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

Any locale with a "UTF-8" suffix will work. At Stanford, we typically run the decoder with the locale "en_US.UTF-8". To use this locale, set the LC_ALL environment variable as follows:

# Bourne shell
export LC_ALL=en_US.UTF-8

# csh
setenv LC_ALL en_US.UTF-8
 


Step 2: Training

This section explains how to train a full SMT system from a sentence-aligned bilingual corpus (bitext). Such data is available for several European languages, and you may also acquire such data from LDC. For the purpose of this tutorial, we use a relatively small training set to train a system with Phrasal, but Phrasal was successfully used in large-scale experiments, including for NIST and GALE evaluations.

Data and supporting scripts used in our tutorial were made available for the WMT08 Shared Task. The following commands download, tokenize, and downcase a French-English bitext.

To download and extract the data, run the below commands. (If you don't have wget (for example, on Mac OS X), then you can try using curl -O instead.

mkdir $PHRASAL/work 
cd $PHRASAL/work
wget http://www.statmt.org/wmt08/training-parallel.tar
wget http://www.statmt.org/wmt08/devsets.tgz
wget http://homepages.inf.ed.ac.uk/jschroe1/how-to/scripts.tgz
tar -xvf training-parallel.tar  training/news-commentary08.fr-en.en.gz training/news-commentary08.fr-en.fr.gz
tar -xzvf devsets.tgz
tar -xzvf scripts.tgz

To tokenize the data, run:

mkdir -p data/training
mkdir data/dev
gzip -cd training/news-commentary08.fr-en.fr.gz | scripts/tokenizer.perl -l fr | tr 'A-Z' 'a-z' > data/training/news-commentary.tok.fr
gzip -cd training/news-commentary08.fr-en.en.gz | scripts/tokenizer.perl -l en | tr 'A-Z' 'a-z' > data/training/news-commentary.tok.en
scripts/tokenizer.perl -l fr < dev/nc-dev2007.fr | tr 'A-Z' 'a-z' > data/dev/nc-dev2007.tok.fr
scripts/tokenizer.perl -l en < dev/nc-dev2007.en | tr 'A-Z' 'a-z' > data/dev/nc-dev2007.tok.en

The first training step in our tutorial is to build a language model (LM) with SRILM:

cd $PHRASAL/work
mkdir lm
$SRILM/bin/*/ngram-count -order 3 -interpolate -kndiscount -unk -text data/training/news-commentary.tok.en -lm lm/news.lm.gz

In this case, we generated work/lm/news.lm.gz as a trigram LM smoothed with modified Kneser-Ney smoothing (-interpolate -kndiscount). Note that this LM was trained using only the target side of the training data (work/corpus/news-commentary.tok.en), but it is generally recommended to add large amounts of monolingual data of the target language (English in our case). ngram-count supports more options, which are described on this man page.

Now let's align the bitext data. To do this, first save the following to the alignment configuration file aligner.conf:

# aligner.conf
##########################################
# Training: Defines the training regimen #
##########################################

forwardModels HMM
reverseModels HMM
mode JOINT
iters 2

###############################################
# Execution: Controls output and program flow #
###############################################

execDir models
create
saveParams true
numThreads 4
msPerLine 10000
alignTraining
leaveTrainingOnDisk
safeConcurrency true

#################
# Language/Data #
#################

foreignSuffix fr
englishSuffix en
lowercase

# Choose the training sources, which can either be directories or files that list files/directories
# Note that training on the test set does not peek at the correct answers (no cheating)
trainSources data/training
sentences MAX

# The test sources must have hand alignments for all sentence pairs
testSources
maxTestSentences MAX
offsetTestSentences 0

##############
# Evaluation #
##############

competitiveThresholding
writeGIZA

Then use the following command to run the Berkeley aligner:

../scripts/align-words 4g aligner.conf

Depending on your machine, this command may take hours or days to complete. On the data of our tutorial, it shouldn't take more than one hour (typically much less with modern computers). You may want to adjust the amount of RAM allocated to the aligner based on the amount of training data you have available and the amount of RAM on your machine. This is done by replacing 4g (which stands for 4 gigabytes) with some value that is more appropriate, e.g., 16g.

The next step extracts all phrase pairs of the training data that are consistent with the word alignment computed in the previous step. To reduce running time and memory usage, phrase extraction in Phrasal is typically done by filtering phrase tables against specific tuning or test sets (i.e., all the source-language data provided to the SMT decoder). To extract phrase pairs applicable to our tuning set ( work/tuning/nc-dev2007.tok.fr), run:

../scripts/extract-phrases 4g models models -fFilterCorpus data/dev/nc-dev2007.tok.fr

The first command-line argument (4g) again specifies the amount of memory, and the second argument (models) names the directory in which alignment and phrase-based translation models will be stored. Note that ../scripts/extract-phrases uses phrase extraction parameters that generally worked well in our experiments (on Chinese-to-English and Arabic-to-English), but that may be ineffective for some other language pairs. To see a list of extraction parameters you may be able to adjust for other language pairs, run:

../scripts/extract-phrases -help

The last step is to run minimum error rate training (MERT). First, save the following decoding model configuration file to disk. This will tell the system where to find the language model, phrase table, and reordering table just created above. It also provides a set of initial decoding weights for the first iteration of MERT training.

# filename: phrasal.conf

# translation table
[ttable-file]
models/phrases-tm.gz

# language model
[lmodel-file]
lm/news.lm.gz

# number of translation options for each phrase in f
[ttable-limit]
20

[additional-featurizers]
edu.stanford.nlp.mt.decoder.feat.HierarchicalReorderingFeaturizer(models/phrases-om.gz,msd2-bidirectional-fe,LexR,hierarchical,hierarchical,bin)

# reordering weights
[weight-d]
1
1
1
1
1
1
1

# language model weight
[weight-l]
1

# translation model weights
[weight-t]
0.3
0.2
0.3
0.2
0

# word penalty weight
[weight-w]
0

# maximum gap between covered spans
[distortion-limit]
6

# detect processors present, and use them all
[localprocs]
0

Now, let's tune the decoding model using MERT:

../scripts/phrasal-mert.pl 4g data/dev/nc-dev2007.tok.fr data/dev/nc-dev2007.tok.en bleu phrasal.conf

The above command runs MERT using four gigabytes of memory, using data/dev/nc-dev2007.tok.fr as input, data/dev/nc-dev2007.tok.en as the single reference translation, and optimizing parameters with respect to the BLEU metric. phrasal.conf is a Phrasal configuration file, which specifies decoder search parameters (e.g., stack size, distortion limit) and initial conditions for MERT. This configuration file contains comments to help users assign values that may be more suitable for their specific needs. Note that one can specify more than one reference translation by separating reference file names with commas (e.g., ref0,ref1,ref2,ref3).

Step 3: Testing

This section explains how to test the SMT system just created in step 2. As before, the data and supporting scripts used in our tutorial is taken from the WMT08 Shared Task. Testing the system involves first tokenizing the test data so that it matches the representation used to train and tune the system. The system is then used to translate the source data, with the resulting translations being evaluated using the BLEU score.

Use the following commands to tokenize the source and target test data.

mkdir data/devtest
scripts/tokenizer.perl -l fr < devtest/test2007.fr | tr 'A-Z' 'a-z' > data/devtest/test2007.tok.fr
scripts/tokenizer.perl -l en < devtest/test2007.en | tr 'A-Z' 'a-z' > data/devtest/test2007.tok.en

Now, we'll want to extract a phrase-table and reordering model for the phrases on the source side of the test data. The following extract-phrases command uses the same bitext alignments used above to extract phrases for the decoding model's training set, but this time it only extracts phrases that are applicable to the source phrases in data/devtest/test2007.fr. The resulting phrase table and reordering model is placed in the directory testmodels.

mkdir testmodels
../scripts/extract-phrases 4g models testmodels -fFilterCorpus data/devtest/test2007.tok.fr

To create a decoder configuration file that uses the new phrase table and reordering model, first make a copy of the trained decoding model we just learned with phrasal-mert.pl.

cp phrasal-mert/phrasal.final.ini phrasal-mert/phrasal.test.ini

Edit phrasal-mert/phrasal.test.ini so that the [ttable-file] section and the HeirachicalReorderingFeaturizer in the [additional-featurizers] section point to testmodels/phrases-tm.gz and testmodels/phrases-om.gz, respectively.

# filename: phrasal-mert/phrasal.test.ini

[ttable-file]
testmodels/phrases-tm.gz

...

[additional-featurizers]
edu.stanford.nlp.mt.decoder.feat.HierarchicalReorderingFeaturizer(testmodels/phrases-om.gz,msd2-bidirectional-fe,LexR,hierarchical,hierarchical,bin)

Translate the test source data using the new decoder configuration file.

../scripts/decode 4g phrasal-mert/phrasal.test.ini < data/devtest/test2007.tok.fr > test2007.trans.en

The quality of the translated sentences can be evaluated using the BLEU score.

../scripts/bleu data/devtest/test2007.tok.en < test2007.trans.en

Additional Information on Using Phrasal

Language Models

Phrasal works with ARPA format language models. Almost all software for building language models can produce language models in ARPA backoff format. It is the de facto standard. The most commonly used package for making language models is SRILM, and that is what we use in the Quick Start Guide. It is also what we have standardly used at Stanford with Phrasal. However, it is not free software. You might also be interested in other packages for building language models, several of which are licensed under the GPL:

FAQ

  • How do I use phrases with gaps?

You need to do two things:

1) Run extract-phrases with the extra arguments -withGaps true, which extracts both contiguous and discontinuous phrases. To prevent target-discontinuous phrases from being extracted, add the extra arguments -noTargetGaps true. Note that phrase extraction with discontinuous phrases requires much more memory than with contiguous phrases alone. The experiments of [Galley and Manning, 2010] required 16GB of RAM.

2) Add a gaps field to the decoder configuration file, for example:

[gaps]
15
7

This configures the decoder so that the maximum source side phrase span is 15 words, and the maximum target side phrase span is 7 words. As an alternative, adding --gaps 15 7 to the Phrasal command line achieves the same effect. To provide these arguments to Phrasal in MERT, instead add these extra arguments to phrasal-mert.pl: --phrasal_flags=\"--gaps 15 7\". Note that these two thresholds only affect discontinuous phrases, and do not prevent the use of contiguous phrases whose lengths happen to exceed these values.

The optimal maximum gap spans are data and language dependent. In the experiments of [Galley and Manning, 2010], the above two values represent what worked best for Chinese-to-English. However, disabling target gaps worked almost as well and made the decoder run 2 to 3 times faster. To prevent the use of phrases with target gaps, either omit the second value, or set it to 0 (in this case, target-discontinuous phrases are not even loaded into memory during decoder initialization). If target-discontinuous phrases are ignored during decoding, it is best to extract phrases using -noTargetGaps true.

For questions specific to discontinuous phrases in Stanford Phrasal, please email mgalley@acm.org.

Personal tools