Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspicious casing while reproducing the conll14 results #6

Open
shiman opened this issue Feb 18, 2019 · 3 comments
Open

Suspicious casing while reproducing the conll14 results #6

shiman opened this issue Feb 18, 2019 · 3 comments

Comments

@shiman
Copy link

shiman commented Feb 18, 2019

Hi,

I want to reproduce the same (or at least very similar) m2 scores on the official conll14 test set. Following the README file, I successfully set up the environment and could get some results by the following command:

python2 models/run_gecsmt.py \
    -f models/moses.dense-cclm.mert.avg.ini \
    -w reproduce/ \
    -i conll14st-test/noalt/official-2014.combined.m2 \
    --m2 \
    -o reproduce/conll.out \
    --moses $PWD/build/mosesdecoder \
    --lazy $PWD/build/lazy \
    --scripts $PWD/train/scripts

The output file was supposed to be almost (if not exactly) the same with your submission, and so should the m2 scores be. However, I only got the following m2 scores:

Precision : 0.5977
Recall : 0.2794
F_0.5 : 0.4868

while the reported F0.5 is 0.4893, which is what I was expecting.

I vimdiffed my output against yours, and found that my output contained a few casing mistakes while yours doesn't. For example, in the middle part of sentence 333, my output was:

... doctors to disclose information To Patients Relatives.It challenges The Confidentiality and privacy principles.Currently , under the Health Insurance Portability and ...

The bolded tokens look suspicious. Here their first letters are all capitalized, but the original input is not. Your output looks fine, too.

I digged a little into the script: models/run_gecsmt.py, and realized maybe there is something wrong during the recasing phase? More specifically, at line 78:

run_cmd("cat {pfx}.out.tok" \
" | {scripts}/impose_case.perl {pfx}.in {pfx}.out.tok.aln" \
" | {moses}/scripts/tokenizer/deescape-special-chars.perl" \
" | {scripts}/impose_tok.perl {pfx}.in > {pfx}.out" \
.format(pfx=prefix, scripts=args.scripts, moses=args.moses))

It looks like we are recasing the output (tokenized) using the raw input (untokenized) and the alignment file. I suspect this is incorrect because the alignment file is based on the tokenized files, and we should do something like this:

{scripts}/impose_case.perl {pfx}.in.tok {pfx}.out.tok.aln

I did try doing so. While I successfully got the correct cases for the example above, now all sentence beginning letters are in lowercase too.

This got me totally confused. How can I get the expected results and scores? What seems to be the problem? Could you shed some light?

For your reference, I also attached my output and logs here.

run.log
conll.out.txt

@emjotde
Copy link
Contributor

emjotde commented Feb 18, 2019

Hm, I seem to remember that uppercasing the fist letter of each sentence was part of the pipeline. @snukky is currently travelling, but will probably be able to take a look soon.

In the meantime try to apply this script to your output:
https://github.com/marian-nmt/moses-scripts/blob/master/scripts/recaser/detruecase.perl

@shiman
Copy link
Author

shiman commented Feb 19, 2019

Thanks for the prompt response.

I tried restoring casing by aligning the output with the tokenized input, and adding detrucasing in the pipeline:

    # restore casing and tokenization
    run_cmd("cat {pfx}.out.tok" \
            " | {scripts}/impose_case.perl {pfx}.in.tok {pfx}.out.tok.aln" \
            " | {moses}/scripts/tokenizer/deescape-special-chars.perl" \
            " | {scripts}/impose_tok.perl {pfx}.in" \
            " | {moses}/scripts/recaser/detruecase.perl" \
            " > {pfx}.out"
        .format(pfx=prefix, scripts=args.scripts, moses=args.moses))

but the score is even worse:

Precision : 0.5876
Recall : 0.2800
F_0.5 : 0.4818

By comparing the results against yours, the differences are still mostly about casing. While the original run_gecsmt.py script looks suspicious (because of incorrect uppercasing), the idea of restoring casing by aligning with the tokenized input (as proposed in the previous post) doesn't seem right either, because the tokenized input ({pfx}.in.tok) is over-lowercasing. For example, the first line in the test set:

Keeping the Secret of Genetic Testing

was completely lowercased into:

keeping the secret of genetic testing

in the {pfx}.in.tok file. So it is impossible to recover the original case by aligning with it.

Thanks for your help anyway. Looking forward to getting some hints from @snukky .

@snukky
Copy link
Contributor

snukky commented Feb 20, 2019

I seem to remember that uppercasing the fist letter of each sentence was part of the pipeline.

It was done using that custom perl script, not Moses scripts, as we used lazy with LM for truecasing.

@shiman I'm not sure where the differences come from, but the script run_gecsmt.py was added later and has not been used to generate the outputs, so there can be some inconsistency. Nonetheless, detruecasing there seems to be the same as in the original training pipeline (https://github.com/grammatical/baselines-emnlp2016/blob/master/train/run_cross.perl#L675), where the provided outputs come from.

I'll check again when I get back home.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants