NLPLAB>>NiuTrans Home>>NiuTransAdvancedUsage



An advanced usage of NiuTrans - Version 1.3.1 Beta

Config Files
NiuTrans has many features which might be helpful for advanced users. All those features can be activated by modefying the config files provided within the package. The following is a description of each configuration file used in the system.

NiuTrans.phrase.user.config
"NiuTrans.phrase.user.config" records the decoding settings. Users can modify it and use the advanced features of NiuTrans. Basically "NiuTrans.phrase.user.config" contains the following information.

###########################################
### NiuTrans decoder configuration file ###
###          phrase-based system        ###
###              2011-07-01             ###
###########################################

#>>> runtime resource tables

# language model
param="Ngram-LanguageModel-File"     value="../sample-data/lm.trie.data"

# target-side vocabulary
param="Target-Vocab-File"            value="../sample-data/lm.vocab"

# MaxEnt-based lexicalized reordering model
param="ME-Reordering-Table"          value="../training/me.reordering.table"

# MSD lexicalized reordering model
param="MSD-Reordering-Model"         value="../training/msd.reordering.table"

# phrase translation model
param="Phrase-Table"                 value="../training/phrase.translation.table"

#>>> runtime parameters

# number of MERT iterations
param="nround"                       value="10"

# order of n-gram language model
param="ngram"                        value="3"

# use punctuation pruning (1) or not (0)
param="usepuncpruning"               value="1"

# use cube-pruning (1) or not (0)
param="usecubepruning"               value="1"

# use maxent reordering model (1) or not (0)
param="use-me-reorder"               value="1"

# use msd reordering model (1) or not (0)
param="use-msd-reorder"              value="1"

# number of threads
param="nthread"                      value="4"

# how many translations are dumped
param="nbest"                        value="20"

# output OOVs and word-deletions in the translation result
param="outputnull"                   value="0"

# beam size (or beam width)
param="beamsize"                     value="20"

# number of references of dev. set
param="nref"                         value="1"

#>>> model parameters

# features defined in the log-linear model
#  0: n-gram language model
#  1: number of target-words
#  2: Pr(e|f). f->e translation probablilty.
#  3: Lex(e|f). f->e lexical weight
#  4: Pr(f|e). e->f translation probablilty.
#  5: Lex(f|e). e->f lexical weight
#  6: number of phrases
#  7: number of bi-lex links (not fired in current version)
#  8: number of NULL-translation (i.e. word deletion)
#  9: MaxEnt-based lexicalized reordering model
# 10: <UNDEFINED>
# 11: MSD reordering model: Previous & Monotonic
# 12: MSD reordering model: Previous & Swap
# 13: MSD reordering model: Previous & Discontinuous
# 14: MSD reordering model: Following & Monotonic
# 15: MSD reordering model: Following & Swap
# 16: MSD reordering model: Following & Discontinuous

# feature weights
param="weights" \
value="1.000 0.500 0.200 0.200 0.200 0.200 0.500 0.500 -0.100 1.000 0.000 0.100 0.100 0.100 0.100 0.100 0.100"

# bound the feature weight in MERT
# e.g. the first number "-3:7" means that the first feature weight ranges in [-3, 7]
param="ranges" \
value="-3:7 -3:3 0:3 0:0.4 0:3 0:0.4 -3:3 -3:3 -3:0 -3:3 0:0 0:3 0:0.3 0:0.3 0:3 0:0.3 0:0.3"

# fix a dimention (1) or not (0)
param="fixedfs"  value="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
NiuTrans.phrase.train.model.config
"\config\NiuTrans.phrase.train.model.config" records the settings for training the translation table and the reordering models. It contains the following information.
###########################################
### NiuTrans  phrase train model config ###
###########################################

# temp file path
param="Lexical-Table"                value="lex"
param="Extract-Phrase-Pairs"         value="extractPhrasePairs"

# phrase table parameters
param="Max-Source-Phrase-Size"       value="3"                          # number greater than 0
param="Max-Target-Phrase-Size"       value="5"                          # number greater than 0
param="Phrase-Cut-Off"               value="0"                          # number not less than 0

# phrase translation model
param="Phrase-Table"                 value="phrase.translation.table"

# maxent lexicalized reordering model
param="ME-max-src-phrase-len"        value="3"                          # > 0 or = -1 (unlimited)
param="ME-max-tar-phrase-len"        value="5"                          # > 0 or = -1 (unlimited)
param="ME-null-algn-word-num"        value="1"                          # >= 0 or = -1 (unlimited)
param="ME-use-src-parse-pruning"     value="0"                          # "0" or "1"
param="ME-src-parse-path"            value="/path/to/src-parse/"        # source parses (one parse per line)
param="ME-max-sample-num"            value="5000000"                    # number greater than 0 or "-1" (unlimited)
param="ME-Reordering-Table"          value="me.reordering.table"

# msd lexicalized reordering model
param="MSD-model-type"               value="1"                          # "1", "2" or "3"
param="MSD-filter-method"            value="tran-table"                 # "tran-table" or "msd-sum-1"
param="MSD-max-phrase-len"           value="7"                          # number greater than 0
param="MSD-Reordering-Model"         value="msd.reordering.table"

Useful Features and Tips

The following information is provided for your reference
How to Generate n-Best Lists
How to Enlarge the Beam Width
What Pruning Methods are Adopted
How to Speed-up the Decoder
How Many Reference Translations Can Be Involved
How to Use Higher Order N-gram Language Models
How to Control Phrase Table Size
How to Scale ME-based Reordering Model to Larger Corpus
How to Scale MSD Reordering Model to Larger Corpus
How to Add Self-developed Features
How to Plug External Translations into the Decoder

  • How to Generate n-Best Lists
    It can be trivially done by setting parameter "nbest" defined in "NiuTrans.phrase.user.config". E.g. if you want to generate a list of 50-best translations, you can modify "NiuTrans.phrase.user.config" as follows:

    # how many translations are dumped
    param="nbest"                     value="50"
    
  • How to Enlarge the Beam Width
    In the NiuTrans system beam width is controlled by the parameter "beamsize" defined in "NiuTrans.phrase.user.config". E.g. if you wish to choose a beam of width 100, you can modify "NiuTrans.phrase.user.config", as follows:

    # beam size (or beam width)
    param="beamsize"                     value="100"
    
  • What Pruning Methods are Adopted
    The current version supports two pruning methods: punctuation pruning and cube pruning. The first method divides the input sentence into several segments according to punctuations (such as comma). The decoding is then performed on each segment individually. Finally the translation is generated by gluing the translations of these segments. The second method can be regarded as an instance of heuristic search. Here we re-implement the method described in (Chiang, 2007).
        To activate the two pruning techniques, users can fire the triggers "usepuncpruning" and "usecubepruning" defined in "NiuTrans.phrase.user.config". Of course, each of them can be set individually.

    # use punctuation pruning (1) or not (0)
    param="usepuncpruning"               value="1"
    
    # use cube-pruning (1) or not (0)
    param="usecubepruning"               value="1"
    

  • How to Speed-up the Decoder
    A straightforward solution is pruning. As described above, punctuation pruning and/or cube pruning can be employed for system speed-up. By default both of them are activated in our system (On Chinese-English translation tasks, they generally lead to a 10-fold speed improvement). Also, multi-thread running-mode can make the system faster if more than one CPU/core is available. To run the system on multiple threads, users can use the parameter "nthread" defined in "NiuTrans.phrase.user.config". E.g. if you want to run decoder with 6 threads, you can set "nthread" like this

    # number of threads
    param="nthread"                      value="6"
    

    To further speed-up the system, another obvious solution is to filter the translation table and the reordering model using input sentences. This feature will be supported in the later version of the system.

  • How Many Reference Translations Can Be Involved
    The NiuTrans system does not any upper limit on the number of reference translations used in either weight tuning or evaluation. E.g. if you want to use 3 references for weight tuning, you can format your tuning data file as follows (Note that "#" indicates a comment here, and SHOULD NOT appear in users' file).

    澳洲 重新 开放 驻 马尼拉 大使馆               # sentence-1
                                                  # a blank line
    australia reopens embassy in manila           # the 1st reference translation
    australia reopened manila embassy             # the 2nd reference translation
    australia reopens its embassy to manila       # the 3rd reference translation
    澳洲 是 与 北韩 有邦交 的 少数 国家 之 .      # sentence-2
    ...

    Then set the "-nref" accordingly. For weight tuning (Note: "-nref 3"),

    $> perl NiuTrans-mert-model.pl \
            -dev ../sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
            -c ../work/NiuTrans.phrase.user.config \
            -nref 1 \
            -r 3 \
            -l ../work/mert-model.log

    For evaluation (Note: "-nref 3"),

    ...
    $> perl NiuTrans-generate-xml-for-mteval.pl -1f 1best.out -tf test-ref.txt -rnum 3
    ...
    
  • How to Use Higher Order N-gram Language Models
    You first need to choose the order for n-gram language model. E.g. if you prefers a 5-gram languguage model, you can type the following command to train LM (NOTE: "-n 5").

    $> ../bin/NiuTrans.LMTrainer -t sample-submission-version/LM-training-set/e.lm.txt -n 5 \
       -v lm.vocab -m lm.trie.data
    

    Then set the config file accordingly

    $> cd scripts/
    $> perl NiuTrans-generate-mert-config.pl -tmdir ../work/model/ -lmdir ../work/lm/ \
            -ngram 5 -o ../work/NiuTrans.phrase.user.config
  • How to Control Phrase Table Size
    To avoid extremely large phrase table, "\config\NiuTrans.phrase.train.model.config" defines two parameters "Max-Source-Phrase-Size" and "Max-Target-Phrase-Size" which control the maximum number of words on source-side and target-side of a phrase-pair, respectively. Generally both two parameters greatly impact the number of resulting phrase-pairs. Note that, although extracting larger phrases can increase the coverage rate of a phrase table, it does not always benefit the BLEU improvement due to data sparseness.
        Another way to reduce the size of phrase table is to throw away the low-frequency phrases. This can be done using the parameter "Phrase-Cut-Off" defined in "\config\NiuTrans.phrase.train.model.config". When "Phrase-Cut-Off" is set to n, all phrases appearing equal to or less than n times are thrown away by the NiuTrans system.
        E.g. the example below shows how to obtain a phrase table with areasonable size. In this setting, the maximum number of source words and target words are set to 3 and 5, respectively. Moreover, all phrases with frequency 1 are filtered.

    param="Max-Source-Phrase-Size"       value="3"
    param="Max-Target-Phrase-Size"       value="5"
    param="Phrase-Cut-Off"               value="1"
    
  • How to Scale ME-based Reordering Model to Larger Corpus
    We follow the work of (Xiong et al., 2006) to design the ME-based lexicalized reordering model. In general, the size of the (ME-based) reordering model increases greatly as more training data is involved. Thus several parameters are defined to control the size of the resulting model. They can be found in the configuration file "\config\NiuTrans.phrase.train.model.config", and start with symbol "ME-".
        1. "ME-max-src-phrase-len" and "ME-max-tar-phrase-len" restrict the maximum number of words appearing in the source-side phrase and target-side phrase. Obviously larger "ME-max-src-phrase-len" (or "ME-max-tar-phrase-len") means a smaller model.
        2. "ME-null-algn-word-num" limits the number of unaligned target words that appear between two adjacent blocks.
        3. "ME-use-src-parse-pruning" is a trigger, and indicates using source-side parse to constraint the training sample extraction. In our in-house experiments, using source-side parse as constraints can greatly reduce the size of resulting model but does not lose BLEU score significantly.
        4. "ME-src-parse-path" specifies the file of source parses (one parse per line). It is meaningful only when "ME-use-src-parse-pruning" is turned on.
        5. "ME-max-sample-num" limits the maximum number of extracted samples. Because the ME trainer "maxent(.exe)" cannot work on a very large number of training samples, controlling the maximum number of extracted samples is a reasonable way to avoid the unacceptable training time and memory cost. By default, "ME-max-sample-num" is set to 5000000 in the NiuTrans system. This setting means that only the first 5,000,000 samples affect the model training, and a too large training corpus does not actually result in a larger model.
        To train ME-based reordering model on a larger data set, it is recommended to set the above parameters as follows (for Chinese-to-English translation tasks). Note that this setting requires users to provide the source-side parse trees for pruning.

    param="ME-max-src-phrase-len"        value="3"
    param="ME-max-tar-phrase-len"        value="5"
    param="ME-null-algn-word-num"        value="1"
    param="ME-use-src-parse-pruning"     value="1"                      # if you have source parses
    param="ME-src-parse-path"            value="/path/to/src-parse/"
    param="ME-max-sample-num"            value="-1"                     # depends on how large your corpus is
                                                                        # can be set to a positive number as needed
  • How to Scale MSD Reordering Model to Larger Corpus
    It is worth pointing out that the NiuTrans system have three models to calculate M, S, D probabilities. Users can choose one of them using the parameter "MSD-model-type". When "MSD-model-type" is set to "1", the MSD reordering is modeled on word-level, as what the Moses system does. In addition to the basic model, the phrase-based MSD model and the hiearachical phrase-based MSD model (Galley et al., 2008) are also implemented. They can be use when "MSD-model-type" is set to "2" or "3".
        When trained on a large corpus, the MSD model might be very large. The situationis even more severe when model 3 is involved. To alleviate this problem, users can use the parameter "MSD-filter-method" which filters the MSD model using phrase translation table (any entry that is not covered by the phrase table are excluded).
        Also, users can use the parameter "MSD-max-phrase-len" to limit the maximum number of words in a source or target phrase. This parameter can effectively limit the size of the generated MSD model.
        Below gives an example that restricts the MSD to a acceptable size.

    param="MSD-model-type"               value="1"                             # "1", "2" or "3"
    param="MSD-filter-method"            value="tran-table"                    # "tran-table" or "msd-sum-1"
    param="MSD-max-phrase-len"           value="7"                             # number greater than 0
    
  • How to Add Self-developed Features
    The NiuTrans system allows users to add self-developed features into the phrase translation table. In the default setting, each entry in the translation table is associated with 6 features. E.g. below is a sample table ("phrase.translation.table"), where each entry is coupled with a 6-dimention feature vector.

    ...
    一定 ||| must ||| -2.35374 -2.90407 -1.60161 -2.12482 1 0
    一定 ||| a certain ||| -2.83659 -1.07536 -4.97444 -1.90004 1 0
    一定 ||| be ||| -4.0444 -5.74325 -2.32375 -4.46486 1 0
    一定 ||| be sure ||| -4.21145 -1.3278 -5.75147 -3.32514 1 0
    一定 ||| ' ll ||| -5.10527 -5.32301 -8.64566 -4.80402 1 0
    ...

    To add new features into the table, users can append them to these feature vectors. E.g. suppose that we wish to add a feature that indicates whether the phrase pair is low-frequency in the training data (appears only once) or not (appears two times or more). We can update the above table, as follows:

    ..
    一定 ||| must ||| -2.35374 -2.90407 -1.60161 -2.12482 1 0 0
    一定 ||| a certain ||| -2.83659 -1.07536 -4.97444 -1.90004 1 0 0
    一定 ||| be ||| -4.0444 -5.74325 -2.32375 -4.46486 1 0 0
    一定 ||| be sure ||| -4.21145 -1.3278 -5.75147 -3.32514 1 0 0
    一定 ||| ' ll ||| -5.10527 -5.32301 -8.64566 -4.80402 1 0 1
    ...

    We then modify the config file "NiuTrans.phrase.user.config" to activate the newly-introduced feature.

    param="freefeature"                   value="1"
    param="tablefeatnum"                  value="7"
    

    where "freefeature" is a trigger that activates the use of additional features. "tablefeatnum" sets the number of features defined in the table.

  • How to Plug External Translations into the Decoder
    The NiuTrans system also defines some special markups to support this feature. E.g. below is sample sentence to be decoded.

    彼得泰勒 是 一名 英国 资深 金融 分析师 .
    (Peter Taylor is a senior financial analyst at UK .)

    If you have prior knowledge about how to translate "彼得泰勒" and "英国", you can add your own translations using the special markups.:

    彼得泰勒 是 一名 英国 资深 金融 分析师 . |||| {0 ||| 0 ||| Peter Taylor ||| $ne ||| 彼得泰勒} \
    {3 ||| 3 ||| UK ||| $ne ||| 英国}

    where "||||" is a separator, "{0 ||| 0 ||| Peter Taylor ||| $ne ||| 彼得泰勒}" and "{3 ||| 3 ||| UK ||| $ne ||| 英国}" are two user-defined translations. Each consists of 5 terms. The first two numbers indicate the span to be translated; the third term is the translation specified by users; the fourth term indicates the type of translation; and the last term repeats the corresponding word sequence. Note that "\" is used to ease the display here. Please remove "\" in you file, and use "彼得泰勒 是 一名 英国 资深 金融 分析师 . |||| {0 ||| 0 ||| Peter Taylor ||| $ne ||| 彼得泰勒}{3 ||| 3 ||| UK ||| $ne ||| 英国}" directly.