-
Notifications
You must be signed in to change notification settings - Fork 0
/
hadoopExtraction.tex
1297 lines (1195 loc) · 65.7 KB
/
hadoopExtraction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Data Structures for Hierarchical Phrase-Based Translation Grammars}
\chaptermark{Data Structures for Hiero Grammars}
\label{chap:hfile}
% TODOFINAL grep map, reduce and italicize
% TODOFINAL bottom margin too big, make it smaller
% TODOFINAL for viva: reread hbase book doc on hfile
% TODOFINAL grep for model (in this chapter) and clarify if needed
% TODOFINAL grep for backoff back-off
% TODOFINAL grep for gram. use $n$-gram consistently
% TODOFINAL grep for filter filtering usage
% TODOFINAL grep for future and add it to future work in conclusion
% TODOFINAL check with Bill the 2000 words per minute ??????
% TODOFINAL check abbreviation space after dot (i.e. e.g. etc.)
% TODOFINAL grep for corpus (parallel corpus) and replace with parallel text
% TODOFINAL grep for MapReduce feature
% TODOFINAL grep for collection specific and add hyphen
% TODOFINAL (done ?) subsubsections to be numbered
% TODOFINAL grep for word alignment and add or not hyphen
% TODOFINAL grep for Hfile wrongly capitalized
% TODOFINAL try to remove all mentions of hbase
% TODOFINAL check for duplicates in bibliography
% TODOFINAL introduce MapReduce in background chapter
% TODOFINAL check title casing
% TODOFINAL check usage of word ``model''
% TODOFINAL check usage of past tense and present tense
We have seen in \autoref{chap:smt} that the main components of an SMT
system are the translation model and the language model. The translation
model is trained on parallel text while the language model is typically
trained on the target side of the parallel text and possibly additional
monolingual target text. It has been shown that increasing the amount
of parallel
text~\citep{pino-iglesias-degispert-blackwood-brunning-byrne:2010:WMT}
and increasing the amount of monolingual
text~\citep{brants-popat-xu-och-dean:2007:EMNLP-CoNLL}
is beneficial for translation quality.
Monolingual and parallel text availability and distributed computing techniques
have allowed SMT
researchers to build ever larger language models and
translation models, based on very large amounts of data.
However, decoders need to be
able to retrieve information, such as $n$-gram conditional
probabilities or translation grammar rules, efficiently from these models to be able to
translate an input sentence or a set of input sentences in a reasonable amount
of time. For online systems such as the commercial systems
mentioned in \autoref{chap:intro}, translation even needs to be carried out in % TODOFINAL for online systems: maybe refer to intro (Google Translate, etc.)
\emph{real time}, i.e.\ for example with a speed of 2000 words translated per minute.
Therefore SMT models need to be stored in a data structure
that supports efficient querying and is scalable. We introduce a solution
that addresses these two requirements and that has the advantage that
it reuses an existing framework in order to ease the implementation.
We store translation models as a set of key-value pairs in an
HFile\footnote{\url{http://hbase.apache.org}}. We apply this
strategy in order to retrieve relevant rules
from a hierarchical phrase-based grammar at the test set level. We compare our
approach to alternative strategies and show that our approach
offers competitive performance in terms
of speed, memory and simplicity~\citep{pino-waite-byrne:2012:PBML}. We
have made software written
for this work
available online.\footnote{\url{https://github.com/jmp84/ruleXtract} (branch PBML-2012)}
\section{Introduction}
\label{sec:hfileIntro}
Current machine translation research is characterised by increasing amounts
of available data. For example, \autoref{fig:wmtdata} shows that
for the WMT machine translation
workshop~\citep{bojar-buck-callisonburch-federmann-haddow-koehn-monz-post-soricut-specia:2013:WMT}
French-English constrained track translation shared task, the English side of parallel
data has increased from 13.8M tokens in 2006 to 1012.7M tokens in 2013, and that
available English monolingual data has increased from 27.5M tokens to 6852.7M
tokens over the same period.
%
\begin{figure}
\begin{center}
\includegraphics{figures/wmt/wmtdata.eps}
\caption{Number of English tokens (in millions) in parallel and monolingual
data available for the WMT French-English constrained track translation
shared task for the years 2006 to 2013.}
\label{fig:wmtdata}
\end{center}
\end{figure}
%
Along with growing amounts of data, the use of more powerful computers
and distributed computing models such as
% rather than to paper
MapReduce~\citep{dean-ghemawat:2008:ACM,lin-dyer:2010:book} has enabled machine % TODOFINAL refer to background
translation researchers to build larger statistical machine translation models.
%Indeed, MapReduce was originally designed to process large datasets for parallelizable
%algorithms.
%
% notes on brants et al 2007 paper
% uses cutoffs
% p(w_i | w_{i - k + 1}^{i - 1}) = rho(w_{i - k + 1}^i) if w_{i - k + 1}^i is found
% = lambda(w_{i - k + 1}^{i - 1}) p(w_{i - k + 2}^i) o/w
% sb not a probability not a problem: it's just a feature scaled appropriately by mert
% vocabulary generation
% word to integer id conversion
% cutoff: words occurring less than threshold mapped to UNK
% vocabulary generation done with mapreduce
% mapreduce program is simply the canonical word count program
% generation of n-grams
% same as word count but with ngrams
% partition function that computes hash on first words of the n-gram
% use hash based on first 2 words
% needs to store unigram counts on all shards, as well as the corpus size N
% language model generation
% this is done online from what i understand
% given an n-gram, a particular shard is contacted
% there is further sharding from ngrams and their count
% but the last two words of the n-grams and also unigrams need to be stored again
% decoder architecture: store 1000-10000 ngrams then send them
% to the servers and compute sb probas on the fly
% search procedure
% current graph. then advance one word in source language,
% expand active hypotheses, queue up necessary ngrams,
% ask for scores, score expanded hypotheses and prune
%
MapReduce has thus been applied to various steps of the
translation
pipeline:
word alignment~\citep{dyer-cordova-mont-lin:2008:WMT},
translation model building~\citep{dyer-cordova-mont-lin:2008:WMT}
and
language
modelling~\citep{brants-popat-xu-och-dean:2007:EMNLP-CoNLL}.
We review these applications of MapReduce in more detail
in \autoref{sec:relatedwork}. The challenge is to find
effective modelling techniques that can be implemented
in the MapReduce framework.
However, once SMT models such as language models and translation
grammars are built, either with MapReduce
or some other method, the models must be made usable
by translation decoders, which only need a fraction of the information
contained in those models to be able to translate an input source sentence or a
set of input source sentences. For example, in translation from French to
English with a phrase-based
model (see \autoref{sec:phraseBasedTranslation}),
given an input sentence \emph{Salut toi}, we only need to know the
translation probabilities the translation model assigns to translations of
the words
\emph{Salut} and \emph{toi} and to translations of the phrase \emph{Salut toi}.
Similarly, given a test set, a decoder
only needs to retrieve the rules whose source side matches part of one of the source
sentences in the test set to be able to generate hypotheses.
With large models, simply retrieving relevant rules together with their
translation
probabilities becomes a challenge.
In the system
described by \citet{iglesias-degispert-banga-byrne:2009:NAACL},
given a training parallel corpus, rules are extracted and target
phrases and hierarchical phrases with counts are stored on disk.
Then, given an input test set, rules are extracted from the parallel
corpus a second time. Only rules relevant to the test set
are retained: source sides of phrase-based rules have to match
an $n$-gram from the test set and consecutive terminals
in source sides of hierarchical rules also have to match
an $n$-gram from the test set. As a consequence, some hierarchical
rules are never used in decoding. % TODOFINAL give an example
Source sides and rules together with their counts are kept
in memory using hash map datastructures, target sides with counts
are loaded from disk as precomputed in the first step, so that
source-to-target and target-to-source probabilities can be computed.
This
method becomes progressively slower with larger amounts of data
because it requires extracting rules from the training parallel corpus
and recomputing translation probabilities
each time we translate a new test set or a new
sentence. We
would like to improve on this design for more rapid experimentation. We also would like
to use a computing infrastructure that requires minimal maintenance. For example,
HBase\footnote{\url{http://hbase.apache.org}}, the open source non-relational database
implementation of
BigTable~\citep{chang-dean-ghemawat-hsieh-wallach-burrows-chandra-fikes-gruber:2008:ACM}, has been applied to the use of distributed language
models~\citep{yu:2008:mastersthesis}.
Rule filtering, i.e. the retrieval of rules relevant to the translation
of a test set or a sentence, is an essential step in our pipeline that can be a
bottleneck. Our goal is to reduce its processing time from several hours to a
few minutes.
In this chapter, we address the problem of retrieving relevant rules with their
translation probabilities. We report on investigations into
storing the model in the HFile data
structure. To our knowledge, this is the first
detailed proposed implementation of translation model storage and
filtering using HFile data structures. We find that this approach
offers a good compromise
between speed, performance and ease of implementation. The infrastructure necessary
for filtering is lightweight and requires the use of only one machine. We will
apply this approach to test set rule filtering prior
to decoding. We will discuss alternative
strategies as well as their strengths and weaknesses in terms of speed and
memory usage. In \autoref{sec:relatedwork}, we will review approaches that
have been used for model building and model filtering.
The HFile data structure that is used to
store models will be presented in \autoref{sec:hfile}.
In \autoref{sec:rulextractMapReduce}, we detail our custom implementation
of rule extraction and translation model estimation using
MapReduce. Our method and
alternative strategies for rule filtering are compared empirically in
\autoref{sec:rulextract}. We conclude in \autoref{sec:conclusion}.
\section{Related Work}
\label{sec:relatedwork}
In this section, we review applications of MapReduce to
the translation pipeline as well as techniques for storing
and retrieving from SMT models.
\subsection{Applications of MapReduce to SMT}
\label{sec:applicationsMapReduceSMT}
\citet{brants-popat-xu-och-dean:2007:EMNLP-CoNLL}
introduce a new smoothing method called
\emph{Stupid Backoff} (see \autoref{sec:stupidBackoffSmoothing}).
The Stupid Backoff smoothing scheme is recalled in \autoref{eq:stupidBackoffSchemeReminder}:
%
\begin{equation}
p_{\text{stupid backoff}}(w_i \mid w_{i - n + 1}^{i - 1}) =
\begin{cases}
\frac{c(w_{i - n + 1}^i)}{c(w_{i - n + 1}^{i - 1})} & \text{if } c(w_{i - n + 1}^i) > 0 \\
\alpha \, p_{\text{stupid backoff}}(w_i \mid w_{i - n + 2}^{i - 1}) & \text{if } c(w_{i - n + 1}^i) = 0
\end{cases}
\label{eq:stupidBackoffSchemeReminder}
\end{equation}
%
With respect to the traditional backoff scheme, the Stupid Backoff
scheme uses no discounting and simply uses relative frequency
for the non backed-off score and the backed-off score
scaling parameter is independent of
the $n$-gram history.
Therefore this scheme does not define a conditional probability
distribution % TODOFINAL add the word conditional in background
over a word given its history.
The Stupid Backoff language model building and application fit the
MapReduce framework well. The input to language model building
is a large monolingual corpus. The first step is to build
a vocabulary. This is done with the canonical example of word
counting with MapReduce. Counts % (see TODOFINAL refer to background)
are needed in order to remove words occurring less than a threshold
from the vocabulary. The second step is to obtain $n$-grams and their
count. This is done again with MapReduce, and the \emph{map} and \emph{reduce}
functions are analogous to the ones defined for word counting but this time
unigrams are replaced by $n$-grams.
For $n$-gram counting, the partition, or sharding, function hashes
on the first two words of each $n$-gram. In addition, unigram counts
and the total size of the corpus is available in each partition, i.e.
to each reducer. This allows relative frequencies to be computed.
\citet{brants-popat-xu-och-dean:2007:EMNLP-CoNLL} also
demonstrate
that for large amounts of monolingual data, i.e. above 10 billion
tokens, Stupid Backoff smoothing and Kneser-Ney smoothing perform comparably.
In addition, only
Stupid Backoff smoothing can be scaled to datasets
with more than 31 billion tokens. The scalability of Kneser-Ney
smoothing has been improved in recent work~\citep{heafield-pouzyrevsky-clark-koehn:2013:ACL}.\footnote{see also \url{http://kheafield.com/code/kenlm/estimation/}, Scalability section}
\citet{dyer-cordova-mont-lin:2008:WMT}
observe that translation model estimation has become prohibitive
on a single core and that existing \emph{ad hoc} parallelisation
algorithms may be more fragile than using an existing framework such
as the Hadoop implementation of
MapReduce.\footnote{\url{https://hadoop.apache.org/}}
They provide solutions to word alignment model
estimation and translation rule extraction and estimation using MapReduce
and demonstrate the scalability of their method.
The convenience
of the MapReduce framework for parallelisation has led to the building
of end-to-end toolkits for entire phrase-based~\citep{gao-vogel:2010:PBML} and
hierarchical phrase-based models~\citep{venugopal-zollmann:2009:PBML}
for translation using the MapReduce framework.
\subsection{SMT Models Storage and Retrieval Solutions}
We now review techniques appearing in the literature that have been used to
store SMT models and to retrieve the information needed in translation from
these models. SMT models are usually discrete probabilistic models and can
therefore be represented as a set of key-value pairs. To obtain relevant
information from a model stored in a certain data structure, a set of keys called a
\emph{query set} is formed; each key in this query set is then sought in that
datastructure. Strategies include:
%
\begin{itemize}
\item storing the model as a simple data structure in memory
\item storing the model in a text file
\item storing the model in more complicated data structures such as
tries~\citep{fredkin:1960:ACM} (in memory or disk)
\item storing fractions of the entire model
\item storing data as opposed to a precomputed model
\item storing models in a distributed fashion
\end{itemize}
%
Each of these
strategies is discussed below.
In some cases, it may be possible to fit a model into RAM. In
this case the model can be stored as a memory associative array, such as a hash
table. In-memory storage allows allows for faster query retrieval than
on-disk storage, however only smaller models will fit in memory.
In-memory storage has been used to store model
parameters between iterations of expectation-maximisation for word
alignment~\citep{dyer-cordova-mont-lin:2008:WMT,lin-dyer:2010:book}.
% TODOFINAL somehow cite work by matei zaharia
For larger models, the set of key-value pairs can be stored as a table in a
single text file on local disk. Values for keys in the query set can be retrieved
by scanning through the entire file. For each key in the file, its membership is
tested in the query set. This is the approach adopted in the \emph{Joshua 5.0}
decoder~\citep{post-ganitkevitch-orland-weese-cao-callisonburch:2013:WMT}\footnote{Inferred from the decoder training
scripts available at \url{http://joshua-decoder.org/5.0/index.html}.}, which
uses regular expressions or $n$-grams to test
membership (see \autoref{sec:joshuaFiltering}).
\citet{venugopal-zollmann:2009:PBML} use MapReduce to scan a file concurrently:
the \emph{map} function tests if the vocabulary of a rule matches the
vocabulary of a test set.
The model can also be stored using a trie associative
array~\citep{fredkin:1960:ACM}. A trie is a type of tree where each node
represents a shared prefix of a set of keys represented by the child nodes. Each
node only stores the prefix it represents. The keys are therefore compactly
encoded in the structure of the trie itself. Querying the trie is a
$\mathcal{O}(\log(n))$ operation, where $n$ is the number of keys in the
dataset. The trie may also be small enough to fit in physical memory to further
reduce querying time. \citet{zens-ney:2007:HLTNAACL} use tries
to store a phrase-based grammar. All the source phrases are represented
in a trie stored on disk. Only relevant parts of the trie are loaded into
memory when a source phrase is sought.
\citet{ganitkevitch-cao-weese-post-callisonburch:2012:WMT} extend this
approach to store hierarchical phrase-based grammars. Both
source sides and target sides are stored in a packed representation of a trie.
Packed tries have also been applied for storing language
models~\citep{pauls-klein:2011:HLTACL,heafield:2011:WMT}.
It is also possible to create a much smaller approximate version of the model.
\citet{talbot-osborne:2007:ACL} represent a set of $n$-grams as a Bloom
filter~\citep{bloom:1970:ACM}.
They first use a standard Bloom filter to define a binary feature that
indicates whether an $n$-gram was seen in a monolingual corpus. They
also use a Bloom filter to encode $n$-grams together with quantised counts
in order to define a multinomial feature, that is a feature
with a finite set of possible values---in this case the quantised values. Both these data structures can
substantially reduce the disk space and memory usage with respect to lossless
representations of language models. However, they allow
false positives for $n$-gram
membership queries and overestimates of $n$-gram quantised counts. The authors
demonstrate that the features they introduce are useful in translation, despite
this lack of exactness.
In a related publication~\citep{talbot-osborne:2007:EMNLP-CoNLL}, the authors
demonstrate how these techniques can be used as a replacement to represent
a smoothed language model. \citet{talbot-brants:2008:ACL} use a Bloomier
filter~\citep{chazelle-kilian-rubinfeld-tal:2004:ACM-SIAM}
to represent $n$-grams and necessary statistics such as probabilities and backoff
weights. Unlike Bloom filters that only support Boolean characteristic
functions on a set, Bloomier filters support arbitrary functions, in this case
a mapping between an $n$-gram and its statistics. False positives can still occur
but for true positives, correct statistics are retrieved.
% TODOFINAL explain this better
%talbot-brants acl
%represent ngrams, proba, backoff with hash
%randomized language model
%kind of error: false positive but when ngram found, params are correct
%based on bloomier filter
%bloomier filter allows for arbitrary function as opposed to binary function
\citet{guthrie-hepple:2010:EMNLP} propose an extension to previous work
on randomised language models~\citep{talbot-osborne:2007:EMNLP-CoNLL} which prevents
the random corruption of model parameters but does not stop the random
assignment of parameters to unseen $n$-grams.
\citet{levenberg-osborne:2009:EMNLP} extend randomised language models to
stream-based language models. Another way of building a smaller approximate
version of a model is to retain items with high frequency counts from a stream
of data~\citep{manku-motwani:2002:VLDB}. This technique has been applied to
language modelling~\citep{goyal-daumeiii-venkatasubramanian:2009:NAACL} and
translation rule extraction~\citep{przywara-bojar:2011:PBML}.
%talbot acl
%use bloom filter
%use standard bloom filter to test whether ngram was seen
%use log-frequency bloom filter to add frequency information
%ngram counts are quantised
%log frequency bloom filter: (ngram,1) (ngram,2) ... (ngram, quantizedcount(ngram)) inserted into the BF
%at retrieval time, we get back a count equal or greater than the actual quantized count
%trick to reduce error: c(w1-wn) < min(c(w1-wn-1), c(w2,wn))
%bloom filter language model test: use binary feature
%log-frequency bf lm: use the quantized count
%experiments: those features are useful in translation
%talbot emnlp
%this time not just binary features, but the whole lm estimated
%still use a log-frequency bf
%new: expected value of count
%E[c(x)|qc(x)=j] = (b^{j-1} + b^j - 1) / 2
%ccl: replacement for usual lm, use less space and memory
Instead of doing some precomputation on a dataset, it is possible to compute
the sufficient
statistics at query time using a suffix array~\citep{manber-myers:1990:SIAM}, so
that the model can be estimated only when needed. A suffix array is a sequence of
pointers to each suffix in a training corpus. The sequence is sorted with
respect to the lexicographic order of the referenced suffixes. Suffix arrays
have been used for computing statistics for language
models~\citep{zhang-vogel:2006:techreport}, phrase-based
systems~\citep{callisonburch-bannard-schroeder:2005:ACL,zhang-vogel:2005:EAMT},
and hierarchical phrase-based systems~\citep{lopez:2007:EMNLP-CoNLL}.
\citet{callisonburch-bannard-schroeder:2005:ACL} store both the source side
and the target side of a parallel corpus in two suffix arrays. They also maintain
an index between a position in the source or target side of the corpus and the
sentence number. During decoding, all occurrences of a source phrase
are located in the suffix array representing the source side of the corpus.
This produces a source marginal count. For each of these occurrences, the corresponding
sentence pair and word alignment are retrieved and rules are extracted. This produces
a rule count, which is normalised with the source marginal count to produce a
source-to-target probability. \citet{lopez:2007:EMNLP-CoNLL} extends this approach
to address the grammar extraction and estimation of hierarchical rules.
Note that the use of suffix arrays for translation model estimation only
supports the computation of source-to-target probabilities.
% TODOFINAL explain suffix array better
%suffixarray callisonburch
%suffix array for source corpus
%suffix array for target corpus
%index between sentence number and position in corpus (src and target)
%word alignment
%compute s2t probability
%locate all occurrences of f: gives c(f)
%for each occurrence of f, retrieve sentence pair and alignment
%run extraction: gives c(f, e)
%compute c(f, e) / c(f)
Finally, some approaches store language models in a distributed fashion.
We saw in \autoref{sec:applicationsMapReduceSMT} how a Stupid Backoff language
model can be built. After relative frequencies have been computed, another
partition function that hashes on the last two words of an $n$-gram is applied
so that backoff operations can be done within the same partition.
At decoding time, the decoder architecture is modified in order
to request a batch of $n$-grams rather than a single $n$-gram.
The Stupid Backoff language model building can be scaled
to a corpus of 2 trillion tokens and the language model distributed
application can be made real time.
\citet{zhang-hildebrand-vogel:2006:EMNLP} propose a distributed large language
model backed by suffix arrays. HBase has also been used to build a distributed
language infrastructure~\citep{yu:2008:mastersthesis}. The method we propose to
use is closely related to the latter but we use a more lightweight
infrastructure than HBase. In addition, it is also possible to apply our method
to language model querying~\citep{pino-waite-byrne:2012:PBML}, which
demonstrates the flexibility of the infrastructure.
Note that there are many alternative solutions to HBase/HFile for storage of a
large number of key-value pairs, such as Berkeley
DB\footnote{\url{https://oss.oracle.com/berkeley-db.html}} or
Cassandra\footnote{\url{http://cassandra.apache.org}}.
The purpose of this chapter
is not to compare these alternatives but rather to compare existing solutions
employed in the machine translation community to one of these solutions.
\section{HFile Description}
\label{sec:hfile}
We now describe the data structure we use to store models and we review relevant
features to the design of our system. To store a model represented as key-value
pairs, we use the HFile file format,\footnote{\url{http://hbase.apache.org}} which is
an open source reimplementation of the SSTable file
format~\citep{chang-dean-ghemawat-hsieh-wallach-burrows-chandra-fikes-gruber:2008:ACM}.
The HFile format is a lookup table with key and value columns. The entries are free
to be an arbitrary string of bytes of any length. The table is sorted
lexicographically by the key byte string for efficient record retrieval by key.
\subsection{HFile Internal Structure}
% TODOFINAL read SSTable description
% TODOFINAL check whether blocks are plural or singular
\paragraph{High Level Structure} As can be seen in \autoref{fig:hfile},
the HFile internal structure is divided into blocks. There are various types of
block, depending on the kind of information a block contains.
In order to
distinguish block types, the first 8 bytes of a
block will indicate its type. The blocks are grouped into
three parts. The first part (top) is organised into
data blocks interleaved with leaf data index blocks and Bloom blocks.
The second part (middle) consists of intermediate data index blocks.
The third part (bottom) consists of metadata: root data index blocks, a file
information block and a Bloom filter index block.
\paragraph{Index Blocks} The root data index blocks map keys to their
relevant intermediate data index block, indicated by
an offset. In turn, an intermediate data index block
maps keys to their relevant leaf data index block, again indicated by an offset.
Finally, a leaf data index block maps keys to their relevant data block,
still indicated by an offset. The same mechanism is employed for the Bloom filter.
The Bloom filter index block maps keys to their relevant Bloom block.
\paragraph{Block Size Configuration} The block size is
configurable, with a default size of 64KB. Note that HFile blocks are not to be
confused with Hadoop Distributed File System (HDFS) blocks whose default size is
64MB.
\paragraph{Block Compression} The HFile format allows for
the blocks to be compressed. The choice of compression codec is selected when
the file is created. We choose the GZip compression
codec~\citep{Deutsch:1996:ZCD:RFC1950} for all our % TODOFINAL check that the gzip cite is correct
experiments.
%Block compression is also used in other related
%software~\citep{pauls-klein:2011:HLTACL}. % TODOFINAL reread paper and elaborate
\begin{figure}
\begin{center}
\begin{tabular}{|l|}
\hline
\rowcolor[gray]{0.85}
Data Block \\
\hline
\rowcolor[gray]{0.85}
... \\
\hline
\rowcolor[gray]{0.95}
Leaf Data Index Block / Bloom Block \\
\hline
\rowcolor[gray]{0.85}
... \\
\hline
\rowcolor[gray]{0.85}
Data Block \\
\hline
\rowcolor[gray]{0.85}
... \\
\hline
\rowcolor[gray]{0.95}
Leaf Data Index Block / Bloom Block \\
\hline
\rowcolor[gray]{0.85}
... \\
\hline
\rowcolor[gray]{0.85}
Data Block \\
\hline
\hline
\rowcolor[gray]{0.9}
Intermediate Data Index Blocks \\
\hline
\hline
\rowcolor[gray]{0.7}
Root Data Index Blocks \\
\hline
\rowcolor[gray]{0.7}
File Info Block \\
\hline
\rowcolor[gray]{0.7}
Bloom Filter Index Block \\
\hline
\end{tabular}
\caption[Caption for LOF]{HFile internal structure.\footnotemark \ The
structure is divided into three parts. In the first part, data blocks
are interspersed with leaf data index blocks and Bloom blocks. The second
part contains intermediate data index blocks. The third part consists
of metadata: root data index blocks, file format information and a Bloom
filter index block.}
\label{fig:hfile}
\end{center}
\end{figure}
\footnotetext{After \url{http://hbase.apache.org/book/book.html} (simplified)}
\subsection{Record Retrieval}
\label{sec:recordRetrieval}
When the HFile is opened for reading, the root data index blocks are loaded into memory.
To retrieve a value from the HFile given a key, the appropriate intermediate
index block is located by a binary search through the root data index. Binary
searches are conducted on the intermediate and leaf index blocks to identify the
data block that contains the key. The identified data block is then loaded off the disk
into memory and the key-value record is retrieved by scanning the data block
sequentially.
\subsection{Bloom Filter Optimisation for Query Retrieval}
\label{sec:bloomFilterOptimization}
It is possible to query for a key that is not contained in the HFile. This very
frequently happens in translation because of language data sparsity: in our case,
because keys are $n$-grams of terminals and nonterminals, the number of possible
keys is exponential in the size of the vocabulary, and many of these
will not have been observed in any rule extracted from training data.
Querying
the existence of a key is expensive as it involves all the operations
described in \autoref{sec:recordRetrieval}: loading and binary
searching the root data index,
loading and binary searching an intermediate data index block,
loading and binary searching
a leaf data index block and finally loading and scanning a data block.
For fast existence check queries, the HFile format allows
the inclusion of an optional Bloom filter~\citep{bloom:1970:ACM}. A Bloom filter
provides a probabilistic, memory efficient representation of the key set with a
$\mathcal{O}(1)$ membership test operation. The Bloom filter may provide a false
positive, % TODOFINAL expand on Bloom filters somewhere either background or here
but never a false negative, for existence of a key in the HFile.
Therefore, it is safe to use in our setting as some keys will be looked
up in the HFile even though they are not present, but the situation
where keys that are in the HFile are not looked up will not happen.
For a large
HFile, the Bloom filter may also be very large. Therefore the Bloom filter is
also organised into blocks called Bloom blocks. Each block contains a smaller
Bloom filter that covers a range of keys in the HFile. Similar to the root data
index, a Bloom index is constructed. To check for the
existence of a key, a binary search is conducted on the Bloom index, the
relevant Bloom block is loaded, and the membership test performed.
Unlike work on Bloom filter language
models~\citep{talbot-osborne:2007:ACL,talbot-osborne:2007:EMNLP-CoNLL}, this
filter only tests the existence of a key and does not return any statistics from
the value. If a membership test is positive, a usual key search will
still be carried out. During the execution of a query, two keys may
reference the same index or Bloom blocks. To prevent these blocks from being
repeatedly loaded from disk, the blocks are cached after reading.
\subsection{Local Disk Optimisation}
The HFile format is designed to be used with HDFS, a distributed file system
based on the Google File System~\citep{ghemawat-gobioff-leung:2003:OSP}. Large
files are split into HDFS blocks that are stored on many nodes in a cluster.
However, the HFile format can also be used completely independently of HDFS. If
its size is smaller than local disk space, the entire HFile can be stored on the local
disk of one machine and accessed through the machine's local file system. We
report in \autoref{sec:rulextract} that using local disk
is faster than using HDFS.
\subsection{Query sorting optimisation}
\label{sec:querySortingOptimization}
Prior to HFile lookup, we sort keys in the query set lexicographically. If two
keys in the set of queries are contained in the same block, then the block is
only loaded once. In addition, the computer hardware and operating system allow
further automatic improvements to the query execution. Examples of these
automatic improvements include reduced disk seek time,
caching data from
disk by the operating system,\footnote{The Linux Documentation Project, The File System, \url{http://tldp.org}}
or CPU caching data from main memory~\citep{patterson-hennessy:2005:COA}.
\section{Hierarchical Rule Extraction with MapReduce}
\label{sec:rulextractMapReduce}
In the previous section, we have described the HFile format. We will use
this format to store a translation grammar. In this
section, we describe how we generate this grammar and produce the HFile
that represents the grammar.
We describe a MapReduce implementation of hierarchical grammar
extraction and estimation. The implementation is an
extension of \emph{Method 3} described by
\citet{dyer-cordova-mont-lin:2008:WMT}, which was originally developed
to estimate translation probabilities for phrase-based models.
MapReduce~\citep{dean-ghemawat:2008:ACM} is a framework designed for processing
large amounts of data in a distributed fashion on a computer cluster. In this
framework, the programmer needs to implement a function called \emph{map} and
a function called \emph{reduce}. The \emph{map} function is defined in
\autoref{eq:mapFunction}:
%
\begin{equation}
\text{map} : (k_1, v_1) \longmapsto [(k_2, v_2)]
\label{eq:mapFunction}
\end{equation}
%
where $(k_1, v_1)$ is a key-value pair and $[(k_2, v_2)]$ is a list of
key-value pairs. We use the $[ \, ]$ notation to denote a list. Keys and values are arbitrary byte sequences. As the key-value
pairs $(k_2, v_2)$ are computed by the \emph{map function}, the MapReduce framework groups all values $v_2$ by key $k_2$ in order
to provide an input $(k_2, [v_2])$ to the \emph{reduce} function, defined in
\autoref{eq:reduceFunction}:
%
\begin{equation}
\text{reduce} : (k_2, [v_2]) \longmapsto [(k_3, v_3)]
\label{eq:reduceFunction}
\end{equation}
%
The input to the \emph{reduce} function is a key and a list of values. The
\emph{reduce} function generates a list of key-value pairs.
Using this notation, % TODOFINAL refere to background
we can describe the grammar extraction and estimation
as a series of MapReduce jobs, summarised in
\autoref{fig:ruleXtractionPipeline}. This architecture is similar
to the one described in the original
publication~\citep{dyer-cordova-mont-lin:2008:WMT}. For
source-to-target and target-to-source probability computation,
we used Method 3 described in that publication, which was
reported to be faster than other methods. Keeping each
feature computation as a separate MapReduce job makes
the integration of a new feature more convenient.
%
\begin{figure}
\tikzstyle{MapReduceJob} = [rectangle, draw, fill=blue!20, rounded corners,
align = left, text width = 3.5cm]
\tikzstyle{JobInputOutput} = [draw, ellipse,fill=red!20,
align = left, text width = 2.9cm]
\tikzstyle{line} = [draw, very thick, color=black!50, -latex']
%\begin{center}
\begin{tikzpicture}[node distance = 2.2cm]
\begin{normalsize}
% Place nodes
\node [JobInputOutput, text width = 6cm] (corpus) {
$k_1$ : metadata \\
$k_2$ : word-aligned sentence pair
};
\node [MapReduceJob, below of=corpus, text width = 7cm] (extract) {
\textbf{Rule Extraction} (see \autoref{sec:hierruleextract}) \\
\emph{map} : extract rules \\
\emph{reduce} : none};
\node [JobInputOutput, below of=extract] (rules) {
$k_1$ : rule \\
$k_2$ : metadata};
\node [MapReduceJob, below left = 1cm and -1.5cm of rules, text width = 6.5cm] (s2t) {
\textbf{Source-to-target Probability} \\
\emph{map} : output (\emph{src}, (\emph{trg}, 1)) \\
\emph{reduce} : sum \emph{trg} counts, normalise};
\node [MapReduceJob, below right = 1cm and -1.5cm of rules, text width = 6.5cm] (t2s) {
\textbf{Target-to-source Probability} \\
\emph{map} : output (\emph{trg}, (\emph{src}, 1)) \\
\emph{reduce} : sum \emph{src} counts, normalise};
% \node [MapReduceJob, below = 1.2cm of rules] (other) {\textbf{Other MapReduce Feature}};
\node [JobInputOutput, below = 0.7cm of s2t, text width = 4.4cm] (rulesPlusS2t) {
$k_1$ : rule \\
$k_2$ : source-to-target pr};
\node [JobInputOutput, below = 0.7cm of t2s, text width = 4.4cm] (rulesPlusT2s) {
$k_1$ : rule \\
$k_2$ : target-to-source pr};
% \node [JobInputOutput, below = 1.7cm of other] (rulesPlusOther) {$k_1$: rule \\
% $k_2$: other feature};
\node [MapReduceJob, below = 5.7cm of rules, text width = 8.5cm] (merge) {
\textbf{Merge} \\
\emph{map} : output (\emph{src}, (\emph{trg}, feature)) \\
\emph{reduce} : output (\emph{src}, list of \emph{trg} and features)};
\node [JobInputOutput, below = 1cm of merge, text width = 5cm] (ruleAllFeatures) {
$k_1$ : \emph{src} \\
$k_2$ : list of \emph{trg} and features};
% Draw edges
\path [line] (corpus) -- (extract);
\path [line] (extract) -- (rules);
\path [line] (rules) -- (s2t);
\path [line] (rules) -- (t2s);
% \path [line] (rules) -- (other);
\path [line] (s2t) -- (rulesPlusS2t);
\path [line] (t2s) -- (rulesPlusT2s);
% \path [line] (other) -- (rulesPlusOther);
% \path [line] (rulesPlusOther) -- (merge);
\path [line] (rulesPlusS2t) -- (merge);
\path [line] (rulesPlusT2s) -- (merge);
\path [line] (merge) -- (ruleAllFeatures);
\end{normalsize}
\end{tikzpicture}
%\end{center}
\caption{Translation grammar extraction and estimation pipeline as a series of MapReduce jobs. Ellipses
represent (intermediate) input/output of MapReduce jobs. Rectangles represent
MapReduce jobs. Source, target and probability are abbreviated into
src, trg and pr. Rules are first extracted, then the source-to-target and
target-to-source translation models are estimated. Finally, the features,
i.e. the translation models, are merged and the final output has rules sources
as keys and lists of targets with features as values.}
\label{fig:ruleXtractionPipeline}
\end{figure}
\subsection{Rule Extraction}
The first step in grammar extraction and estimation is rule extraction. For
this step, $v_1$ represents a sentence pair together with a word alignment and $k_1$
contains metadata about the parallel corpus such as what collection the sentence
pair comes from. This kind of information is useful for domain adaptation as
demonstrated in \autoref{sec:domainAdaptationGrammar}. The \emph{map}
function simply extracts hierarchical rules for
that sentence pair and outputs the rules. $k_2$ is a rule and $v_2$ contains
the metadata from $k_1$. For rule extraction, there is no reduce step.
\autoref{eq:mapIllustrationRuleXtraction} illustrates the \emph{map} function
for rule extraction:
%
\begin{equation}
map : (\text{newswire}, (\bm{f}, \bm{e}, \bm{a})) \longmapsto [(r_1, \text{newswire}), (r_2, \text{newswire}),...]
\label{eq:mapIllustrationRuleXtraction}
\end{equation}
\subsection{Corpus-Level Feature Computation} \label{sec:corpusLevelFeatureComputation}
Rule extraction is followed by several MapReduce jobs that are run simultaneously
in order to
compute \emph{corpus-level} features. Corpus-level features
are features related
to each rule and the computation of which requires looking at the entire set of
extracted rules. In this description, we only consider source-to-target and
target-to-source probabilities, but many other corpus-level features are possible,
for example features that relate to the word alignment from which rules were
extracted. We use the metadata stored in the output of the
extraction job in order to compute collection specific probabilities.
Collection specific probabilities are probabilities computed over a specific subset
of the training corpus. The metadata indicates whether a rule was extracted from
that specific subset.
Let us
consider first the collection-specific source-to-target probability computation.
The input to the \emph{map} function
is the output of the extraction job, that is a rule together with metadata.
The \emph{map} function checks that the rule comes from the collection we are
interested in and if so, outputs the source of the rule as a key $k_2$ and the
target of the rule together with a count of one as a value $v_2$. Targets (and
counts) corresponding to the same source are grouped together by the MapReduce
framework. The input to the \emph{reduce} function is a source ($k_2$) and
a list of targets with counts ($[v_2]$). The \emph{reduce} function aggregates
the counts for the distinct targets and then normalises those counts with the
total count corresponding to the source $k_2$. The output of the \emph{reduce}
function is a list of pairs (rule, probability).
The target-to-source probability computation is very similar. Apart from
inverting the role played by source and target, the \emph{map} and \emph{reduce}
function also need to change the rule format for
rules that invert the order of nonterminals in the source
and the target, for example \RT[$X$][$X_1 a X_2$][$X_2 b X_1$].
%\emph{hierarchical reordering rules}. % TODOFINAL mention this in background
%A hierarchical reordering rule is a rule with two nonterminals that inverts the
%order of the nonterminals in the source and in the target. For example,
%\RT[$X$][$X_1 a X_2$][$X_2 b X_1$] is a hierarchical reordering rule.
For this
type of rule, the \emph{map} function has to invert the order of nonterminals
in both the source and target so that probabilities are computed correctly. For
example, the \emph{map} function will output the target of the rule
written as $X_1 b X_2$ and the source of the rule
written as $X_2 a X_1$. The \emph{reduce} function
will restore the original notation.
We have described the computation of
collection specific translation probabilities.
The source-to-target and target-to-source probabilities over the
entire corpus are computed in an identical manner with the
collection simply being the entire corpus.
\autoref{fig:ruleXtractionPipeline} illustrates only two corpus-level
features, however, in practice, more features are computed: source-to-target
and target-to-source probability for each collection and for the entire parallel
text, as well as lexical features (see \autoref{sec:features}).
\subsection{Feature Merging}
Once all MapReduce features have been computed, a MapReduce job that merges
the features is called. The input key $k_1$ to the \emph{map} function is a
rule and the input value $v_1$ is a feature. The \emph{map} function outputs
the source of the rule as a key $k_2$ and the target of the rule together with
the feature as value $v_2$. Targets and features corresponding to the same
source are grouped together by the MapReduce framework. The input to the
\emph{reduce} function is a source ($k_2$) and a list of targets with features
($[v_2]$). The \emph{reduce} function simply merges the features for identical
targets. The output key $k_3$ is the source $k_2$ and the output value $v_3$ is
a list of targets with all computed features.
The output of this merging job
is converted to an HFile format. Because the keys in the HFile format need to
be sorted, we also want the output of the merging job to be sorted by key. One
way to achieve this is to specify only one reducer for the job but this solution
is very slow because only one core will be used for the reduce step. A better
way is to use a \emph{TotalOrderPartitioner}. We first use a small amount of
data so that it may be feasible to carry out rule extraction with only one
reducer in the merge step. We use the output of this small scale extraction as
input to an \emph{InputSampler}. The sampler will generate a partition file for
the partitioner. The use of a \emph{TotalOrderPartitioner} will guarantee that
all keys are sorted after the reduce step. % TODOFINAL clear as mud, expand
\section{Hierarchical Rule Filtering for Translation}
\label{sec:rulextract}
In the previous section, we have described how to obtain an HFile
that represents a hierarchical translation grammar.
We will now describe how this datastructure can be used to retrieve rules for
a given test set efficiently. We describe our system called
\emph{ruleXtract}, and compare it to other methods through time and memory
measurements.
\subsection{Task Description}
Given a test set, defined as a set of sentences to be translated, and a
hierarchical phrase-based translation grammar, we would
like to retrieve all the relevant rules from the model. A phrase-based rule
in the HFile is
relevant if its source is a subsequence of a sentence in the test set. A
hierarchical rule in the HFile is relevant if
its source, where nonterminals are instantiated into
terminals, is
a subsequence of a sentence in the test set. For example, with a test set
containing one sentence \emph{Salut toi}, the phrase-based rules with sources
\emph{Salut}, \emph{toi}, \emph{Salut toi} are relevant and the hierarchical
rules with sources \emph{Salut X} and \emph{X toi} are relevant, because
instantiating \emph{Salut X} into \emph{Salut toi} produces a subsequence
of the test sentence and similarly for \emph{X toi}.
\subsection{HFile for Hierarchical Phrase-Based Grammars}
\label{sec:hfileForHiero}
% TODOFINAL explain better the source pattern instantiation business with an example.
% TODOFINAL maybe refer to background for pattern definition
Given a test set and an HFile storing a hierarchical phrase-based grammar, we
first generate queries from the test set, then retrieve relevant rules along
with their corpus-level features from the HFile. To generate queries, we have a set
of allowed \emph{source patterns} and instantiate these patterns against the
test set. Rule patterns were defined in \autoref{sec:hiergrammar}.
A source pattern is simply the source side of a rule pattern.
Here, we briefly redefine source patterns.
A source pattern is a regular expression that matches
the source side of a rule. For example, the
pattern $\Sigma^+X$, with $\Sigma$ the source vocabulary,
represents a rule source side containing a sequence of
terminals followed by the nonterminal X. If the input sentence is
\emph{Salut à toi}, the pattern will be instantiated as \emph{Salut X},
\emph{Salut à X} and \emph{à X}. We impose the following constraints on source pattern
instantiation where the first four relate to constraints in extraction and the
last one relates to a decoding constraint:
%
\begin{itemize}
\item $s_{\text{max}}$: maximum number of terminals for phrase-based rules.
\item $s_{\text{max elements}}$: maximum number of terminals and nonterminals.
\item $s_{\text{max terminals}}$: maximum number of consecutive terminals for
hierarchical rules.
\item $s_{\text{max NT}}$: maximum nonterminal span in a hierarchical
rule.
\item $s_{\text{max span}}$: maximum span for the rule. This constraint
is not to be confused with the previous $s_{\text{max NT}}$ constraint.
While $s_{\text{max NT}}$ represents the maximum span of a nonterminal
on the right hand side of a rule, $s_{\text{max span}}$ represents the maximum span
of an entire rule, or alternatively the maximum span of the nonterminal on the left
hand side of the rule. Because the latter constraint is enforced by the decoder, we
also enforce it in source pattern instantiation so that rules that will
never be used by the decoder are not generated.
\end{itemize}
%
The source pattern instances are then sorted for more efficient HFile lookup
(see \autoref{sec:querySortingOptimization}). Each
query is then looked up in the HFile and if
present, an HFile record is retrieved.
% TODOFINAL include this somewhere
%We typically run this retrieval step on one machine only.
We now compare our approach to similar approaches whose aim is
to obtain rules which are relevant to a test set.
\subsection{Suffix Arrays for Hierarchical Phrase-Based Grammars}
One alternative strategy for building a set specific grammar is to use
suffix arrays introduced in \autoref{sec:relatedwork}. We use the \emph{cdec}
software~\citep{dyer-lopez-ganitkevitch-weese-ture-blunsom-setiawan-eidelman-resnik:2010:ACL}
implementation of suffix arrays for
hierarchical phrase-based rule extraction. The \emph{cdec} implementation is based on
earlier work~\citep{lopez:2007:EMNLP-CoNLL} which extends suffix array based
rule extraction from phrase-based system
to hierarchical phrase-based systems (see again \autoref{sec:relatedwork}).
Given a test set, a set of source pattern instances is generated similarly to
what is done for \emph{ruleXtract}. These source pattern instances are then
sought in a suffix array compiled from the source side of a parallel corpus.
Rules are extracted using the word alignment, and source-to-target
probabilities are then computed as described in \autoref{sec:relatedwork}.
\subsection{Text File Representation of Hierarchical Phrase-Based Grammars}
\label{sec:joshuaFiltering}
We now describe the \emph{Joshua}
decoder~\citep{weese-ganitkevitch-callisonburch-post-lopez:2011:WMT}
implementation for storing and retrieving from a translation
model.
The first implementation variant, which we call \emph{Joshua}, stores the
translation model in a text file. Given a test set, each word in the test set
vocabulary is mapped to the list of sentences in which it appears. Then, each
rule in the translation model is compiled to a regular expression, and each
sentence that contains at least a vocabulary word of the rule is matched against
this regular expression. If at least one match is successful, the rule is
retained.
A faster version is provided that matches consecutive terminals in the
source side of a rule to the set of $n$-grams extracted from the test set. We
call this version \emph{Joshua Fast}. A parallel version also exists that chunks
the grammar file and distributes each chunk processing as a separate process on
a cluster running Sun Grid Engine~\citep{gentzsch:2001:CCG}. We call this
version \emph{Joshua Parallel}. The parallel version using the faster matching
algorithm is called \emph{Joshua Fast Parallel}.
\subsection{Experimental Design}
In this section, we describe our experimental setup in terms of data and various
configurations for grammar extraction, grammar filtering and
time and memory measurements.
%
\paragraph{Parallel Text} We use a small parallel corpus of 750,950 word-aligned sentence
pairs and a larger corpus of 9,221,421 word-aligned sentence pairs from the
NIST'12 Chinese-English evaluation, in order to investigate how systems perform % TODOFINAL footnote or cite for nist