NEUNLPLab>>NiuTrans Home>>NiuTrans.Syntax



A Step-by-step manual of NiuTrans.Syntax - Version 1.3.1 Beta

1. Data Preparation

  • The NiuTrans system is a "data-driven" MT system which requires "data" for training and/or tuning the system. It requires users to prepare the following data files before running the system.
    a). Training data: bilingual sentence-pairs and word alignments.
    b). Tuning data: source sentences with one or more reference translations.
    c). Test data: some new sentences.
    d). Evaluation data: reference translations of test sentences.
    In the NiuTrans package, some sample files are offered for experimenting with the system and studying the format requirement. They are located in "NiuTrans/sample-data/sample-submission-version".

    sample-submission-version/
      -- TM-training-set/                   # word-aligned bilingual corpus (100,000 sentence-pairs)
           -- chinese.txt                   # source sentences
           -- english.txt                   # target sentences (case-removed)
           -- Alignment.txt                 # word alignments of the sentence-pairs
           -- chinese.tree.txt              # parse trees of source sentences
           -- english.tree.txt              # parse trees of target sentences
      -- LM-training-set/
           -- e.lm.txt                      # monolingual corpus for training language model (100K target sentences)
      -- Dev-set/
           -- Niu.dev.txt                   # development dataset for weight tuning (400 sentences)
           -- Niu.dev.tree.txt              # development dataset with tree annotation (on source sentences)
      -- Test-set/
           -- Niu.test.txt                  # test dataset (1K sentences)
           -- Niu.test.tree.txt             # test dataset with tree annotation
      -- Reference-for-evaluation/
           -- Niu.test.reference            # references of the test sentences (1K sentences)
      -- description-of-the-sample-data     # a description of the sample data
  • Format: please unpack "NiuTrans/sample-data/sample.tar.gz", and refer to "description-of-the-sample-data" to find more information about data format.

  • In the following, the above data files are used to illustrate how to run the NiuTrans system (e.g. how to train MT models, tune feature weights, and decode test sentences).

2. Obtaining Syntactic Transfer Rules

  • Instructions (perl is required. Also, Cygwin is required for Windows users).

    string-to-tree

    $> cd NiuTrans/sample-data/
    $> tar xzf sample.tar.gz
    $> cd ../
    $> mkdir work/model.syntax.s2t/ -p
    $> cd scripts/
    $> perl NiuTrans-syntax-train-model.pl \
            -model s2t \
            -src   ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
            -tgt   ../sample-data/sample-submission-version/TM-training-set/english.txt \
            -aln   ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
            -ttree ../sample-data/sample-submission-version/TM-training-set/english.tree.txt \
            -out   ../work/model.syntax.s2t/syntax.string2tree.rule
    

    tree-to-string

    $> cd NiuTrans/sample-data/
    $> tar xzf sample.tar.gz
    $> cd ../
    $> mkdir work/model.syntax.t2s/ -p
    $> cd scripts/
    $> perl NiuTrans-syntax-train-model.pl \
            -model t2s \
            -src   ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
            -stree ../sample-data/sample-submission-version/TM-training-set/chinese.tree.txt \
            -tgt   ../sample-data/sample-submission-version/TM-training-set/english.txt \
            -aln   ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
            -out   ../work/model.syntax.t2s/syntax.tree2string.rule
    

    tree-to-tree

    $> cd NiuTrans/sample-data/
    $> tar xzf sample.tar.gz
    $> cd ../
    $> mkdir work/model.syntax.t2t/ -p
    $> cd scripts/
    $> perl NiuTrans-syntax-train-model.pl \
            -model t2t \
            -src   ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
            -stree ../sample-data/sample-submission-version/TM-training-set/chinese.tree.txt \
            -tgt   ../sample-data/sample-submission-version/TM-training-set/english.txt \
            -ttree ../sample-data/sample-submission-version/TM-training-set/english.tree.txt \
            -aln   ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
            -out   ../work/model.syntax.t2t/syntax.tree2tree.rule
    
    "-model" specifies SMT translation model, the model decides what type of rules can be generated, its value can be "s2t", "t2s" or "t2t".
    "-src", "-tgt" and "-aln" specify the source sentences, the target sentences and the alignments between them (one sentence per line).
    "-stree" specifies the parse trees of source sentences.
    "-ttree" specifies the parse trees of target sentences.

  • Output

    string-to-tree

    Output: three files are generated and placed in "NiuTrans/work/model.syntax.s2t/":

    - syntax.string2tree.rule                    # syntax rule table
    - syntax.string2tree.rule.bina               # binarization rule table for decoder
    - syntax.string2tree.rule.unbina             # unbinarization rule table for decoder
    

    tree-to-string

    Output: three files are generated and placed in "NiuTrans/work/model.syntax.t2s/":

    - syntax.tree2string.rule                    # syntax rule table
    - syntax.tree2string.rule.bina               # binarization rule table for decoder
    - syntax.tree2string.rule.unbina             # unbinarization rule table for decoder
    

    tree-to-tree

    Output: three files are generated and placed in "NiuTrans/work/model.syntax.t2t/":

    - syntax.tree2tree.rule                      # syntax rule table
    - syntax.tree2tree.rule.bina                 # binarization rule table for decoder
    - syntax.tree2tree.rule.unbina               # unbinarization rule table for decoder
    

  • Note: Please enter the "NiuTrans/scripts/" directory before running the script "NiuTrans-syntax-train-model.pl".

3. Training n-gram language model

  • Instructions

    $> cd ../
    $> mkdir work/lm/
    $> cd scripts/
    $> perl NiuTrans-training-ngram-LM.pl \
            -corpus ../sample-data/sample-submission-version/LM-training-set/e.lm.txt \
            -ngram  3 \
            -vocab  ../work/lm/lm.vocab \
            -lmbin  ../work/lm/lm.trie.data
    
    "-ngram" specifies the order of n-gram LM. E.g. "-ngram 3" indicates a 3-gram language model.
    "-vocab" specifies where the target-side vocabulary is generated.
    "-lmbin" specifies where the language model file is generated.

  • Output: two files are generated and placed in "NiuTrans/work/lm/":

    - lm.vocab                            # target-side vocabulary
    - lm.trie.data                        # binary-encoded language model
    

4. Generating Configuration File

  • Instructions

    string-to-tree

    $> cd NiuTrans/scripts/
    $> mkdir ../work/config/ -p
    $> perl NiuTrans-syntax-generate-mert-config.pl \
            -model      s2t \
            -syntaxrule ../work/model.syntax.s2t/syntax.string2tree.rule.bina \
            -lmdir      ../work/lm/ \
            -nref       1 \
            -ngram      3 \
            -out        ../work/config/NiuTrans.syntax.s2t.user.config
    

    tree-to-string

    $> cd NiuTrans/scripts/
    $> mkdir ../work/config/ -p
    $> perl NiuTrans-syntax-generate-mert-config.pl \
            -model      t2s \
            -syntaxrule ../work/model.syntax.t2s/syntax.tree2string.rule.bina \
            -lmdir      ../work/lm/ \
            -nref       1 \
            -ngram      3 \
            -out        ../work/config/NiuTrans.syntax.t2s.user.config
    

    tree-to-tree

    $> cd NiuTrans/scripts/
    $> mkdir ../work/config/ -p
    $> perl NiuTrans-syntax-generate-mert-config.pl \
            -model      t2t \
            -syntaxrule ../work/model.syntax.t2t/syntax.tree2tree.rule.bina \
            -lmdir      ../work/lm/ \
            -nref       1 \
            -ngram      3 \
            -out ../work/config/NiuTrans.syntax.t2t.user.config
    
    "-model" specifies what type of rules can be used to mert, its value can be "s2t", "t2s" or "t2t".
    "-syntaxrule" specifies the path to the syntactic rule table.
    "-lmdir" specifies the directory that holds the n-gram language model and the target-side vocabulary.

    "-nref" specifies how many reference translations per source-sentence are provided.
    "-ngram" specifies the order of n-gram language model.
    "-out" specifies the output (i.e. a config file).

  • Output

    string-to-tree

    Output: a configuration file is generated and placed in "NiuTrans/work/config". Users can modify this generated config file as needed.

    - NiuTrans.syntax.s2t.user.config           # configuration file for MERT and decoding
    

    tree-to-string

    Output: a configuration file is generated and placed in "NiuTrans/work/config".

    - NiuTrans.syntax.t2s.user.config           # configuration file for MERT and decoding
    

    tree-to-tree

    Output: a configuration file is generated and placed in "NiuTrans/work/config".

    - NiuTrans.syntax.t2t.user.config           # configuration file for MERT and decoding
    

5. Weight Tuning

  • Instructions (perl is required).

    string-to-tree

    $> cd NiuTrans/scripts/
    $> perl NiuTrans-syntax-mert-model.pl \
            -model  s2t \
            -config ../work/config/NiuTrans.syntax.s2t.user.config \
            -dev    ../sample-data/sample-submission-version/Dev-set/Niu.dev.tree.txt \
            -nref   1 \
            -round  2 \
            -log    ../work/syntax-s2t-mert-model.log
    

    tree-to-string

    $> cd NiuTrans/scripts/
    $> perl NiuTrans-syntax-mert-model.pl \
            -model  t2s
            -config ../work/config/NiuTrans.syntax.t2s.user.config \
            -dev    ../sample-data/sample-submission-version/Dev-set/Niu.dev.tree.txt \
            -nref   1 \
            -round  2 \
            -log    ../work/syntax-t2s-mert-model.log
    

    tree-to-tree

    $> perl NiuTrans-syntax-mert-model.pl \
            -model  t2t \
            -config ../work/config/NiuTrans.syntax.t2t.user.config \
            -dev    ../sample-data/sample-submission-version/Dev-set/Niu.dev.tree.txt \
            -nref   1 \
            -round  2 \
            -log    ../work/syntax-t2t-mert-model.log
    
    "-model" specifies what type of rules can be used to mert, its value can be "s2t", "t2s" or "t2t".
    "-config" specifies the configuration file generated in the previous steps.
    "-dev" specifies the development dataset (or tuning set) for weight tuning.
    "-nref" specifies how many reference translations per source-sentence are provided.
    "-round" specifies how many rounds the MERT performs (by default, 1 round = 10 MERT iterations).
    "-log" specifies the log file generated by MERT.

  • Output: After MER training, the optimized feature weights are automatically recorded in the "-config" file (last line). Then, the config can be used to decode new sentences.

6. Decoding Test Sentences

  • Instructions (perl is required). Take tree-to-string model as an instance.

    string-to-tree

    $> cd NiuTrans/scripts/
    $> mkdir ../work/syntax.trans.result/ -p
    $> perl NiuTrans-syntax-decoder-model.pl \
            -model  s2t \
            -config ../work/config/NiuTrans.syntax.s2t.user.config \
            -test   ../sample-data/sample-submission-version/Test-set/Niu.test.tree.txt \
            -output ../work/syntax.trans.result/Niu.test.syntax.s2t.translated.en.txt
    

    tree-to-string

    $> cd NiuTrans/scripts/
    $> mkdir ../work/syntax.trans.result/ -p
    $> perl NiuTrans-syntax-decoder-model.pl \
            -model  t2s \
            -config ../work/config/NiuTrans.syntax.t2s.user.config \
            -test   ../sample-data/sample-submission-version/Test-set/Niu.test.tree.txt \
            -output ../work/syntax.trans.result/Niu.test.syntax.t2s.translated.en.txt
    

    tree-to-tree

    $> cd NiuTrans/scripts/
    $> mkdir ../work/syntax.trans.result/ -p
    $> perl NiuTrans-syntax-decoder-model.pl \
            -model  t2t \
            -config ../work/config/NiuTrans.syntax.t2t.user.config \
            -test   ../sample-data/sample-submission-version/Test-set/Niu.test.tree.txt \
            -output ../work/syntax.trans.result/Niu.test.syntax.t2t.translated.en.txt
    
    "-model" specifies what type of rules can be used for decoder, its value can be "s2t", "t2s" or "t2t".
    "-config" specifies the configuration file.
    "-test" specifies the test dataset (one sentence per line).
    "-output" specifies the translation result file (the result is dumped to "stdout" if this option is not specified).

  • Output

    string-to-tree

    Output: a new file is generated in "NiuTrans/work/syntax.trans.result":

    - Niu.test.syntax.s2t.translated.en.txt                # 1-best translation of the test sentences
    

    tree-to-string

    Output: a new file is generated in "NiuTrans/work/syntax.trans.result":

    - Niu.test.syntax.t2s.translated.en.txt                # 1-best translation of the test sentences
    

    tree-to-tree

    Output: a new file is generated in "NiuTrans/work/syntax.trans.result":

    - Niu.test.syntax.t2t.translated.en.txt                # 1-best translation of the test sentences
    

7. Evaluation

  • Instructions (perl is required)

    string-to-tree

    $> perl NiuTrans-generate-xml-for-mteval.pl \
            -1f   ../work/syntax.trans.result/Niu.test.syntax.s2t.translated.en.txt \
            -tf   ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \ 
            -rnum 1
    $> perl mteval-v13a.pl \
            -r    ref.xml \
            -s    src.xml \
            -t    tst.xml
    

    tree-to-string

    $> perl NiuTrans-generate-xml-for-mteval.pl \
            -1f   ../work/syntax.trans.result/Niu.test.syntax.t2s.translated.en.txt \
            -tf   ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
            -rnum 1
    $> perl mteval-v13a.pl \
            -r    ref.xml \
            -s    src.xml \
            -t    tst.xml
    

    tree-to-tree

    $> perl NiuTrans-generate-xml-for-mteval.pl \
            -1f   ../work/syntax.trans.result/Niu.test.syntax.t2t.translated.en.txt \
            -tf   ../sample-data/sample-submission-version/Reference-for-evaluation/Niu.test.reference \
            -rnum 1
    $> perl mteval-v13a.pl \
            -r    ref.xml \
            -s    src.xml \
            -t    tst.xml
    
    "-1f" specifies the file of the 1-best translations of the test dataset.
    "-tf" specifies the file of the source sentences and their reference translations of the test dataset.
    "-rnum" specifies how many reference translations per test sentence are provided.
    "-r" specifies the file of the reference translations.
    "-s" specifies the file of source sentences.
    "-t" specifies the file of (1-best) translations generated by the MT system.

  • Output: The IBM-version BLEU score is displayed. If everything goes well, you will obtain a score of about 0.2212 for the sample data set.

  • Note: script mteval-v13a.pl relies on the package XML::Parser. If XML::Parser is not installed on your system, please follow the following commands to install it.

    $> su root
    $> tar xzf XML-Parser-2.41.tar.gz
    $> cd XML-Parser-2.41/
    $> perl Makefile.PL
    $> make install
    


Advanced Usage
In addition to the brief manual shown above, a more detailed description of various settings can be found here. In general, the BLEU score can be further improved by using those advanced features. Hope it helps!