Skip to content
Loganathan Ramasamy edited this page Apr 7, 2015 · 58 revisions

Datasets


Dataset name Source
korektor-czech-130202 The current Korektor model
syn2005 Czech National Corpus (CNC) - http://hdl.handle.net/11858/00-097C-0000-0023-119E-8
syn2010 Czech National Corpus (CNC) - http://hdl.handle.net/11858/00-097C-0000-0023-119F-6

Evaluation


precision = TP / (TP + FP)

recall = TP / (TP + FN)

F1-score = 2 * (precision * recall) / (precision + recall)

Error detection

Measure Description
TP Number of words with spelling errors that the spell checker detected correctly
FP Number of words identified as spelling errors that are not actually spelling errors
TN Number of correct words that the spell checker did not flag as having spelling errors
FN Number of words with spelling errors that the spell checker did not flag as having spelling errors

Error correction

Measure Description
TP Number of words with spelling errors for which the spell checker gave the correct suggestion
FP Number of words (with/without spelling errors) for which the spell checker made suggestions, and for those, either the suggestion is not needed (in the case of non-existing errors) or the suggestion is incorrect if indeed there was an error in the original word.
TN Number of correct words that the spell checker did not flag as having spelling errors and no suggestions were made.
FN Number of words with spelling errors that the spell checker did not flag as having spelling errors or did not provide any suggestions

Results


Error detection based on varying edit distances

Dataset Max edit distance Precision Recall F1-score
kor-cz-130202 1-edit 94.7 90.8 92.7
syn2005 95.7 90.8 93.2
syn2010 94.7 89.9 92.2
kor-cz-130202 2-edit 94.1 95.4 94.8
syn2005 95.0 95.9 95.4
syn2010 94.1 95.0 94.5
kor-cz-130202 3edit 94.1 95.4 94.8
syn2005 95.0 95.9 95.4
syn2010 94.1 95.0 94.5
kor-cz-130202 4-edit 94.1 95.4 94.8
syn2005 95.0 95.9 95.4
syn2010 94.1 95.0 94.5
kor-cz-130202 5-edit 94.1 95.4 94.8
syn2005 95.0 95.9 95.4
syn2010 94.1 95.0 94.5

Note that the results are same for edit distances 2,3,4,5. This maybe due to the edit distance parameter does not really influence the error detection much.

Error correction results for varying edit distances

#|top-1|top-1|top-1|top-2|top-2|top-2|top-3|top-3|top-3 ----|----|----|----|----|----|----|----|----|----|---- dataset|precision|recall|F1-score|precision|recall|F1-score|precision|recall|F1-score kor-cz-130202-1-ed|85.2|89.9|87.5|90.9|90.5|90.7|93.3|90.7|92.0 syn2005-1-ed|87.9|90.1|89.0|92.3|90.5|91.4|93.7|90.7|92.2 syn2010-1-ed|86.0|89.0|87.5|91.8|89.6|90.7|92.3|89.7|91.0 kor-cz-130202-2-ed|84.2|94.9|89.2|91.0|95.3|93.1|93.2|95.4|94.3 syn2005-2-ed|86.8|95.5|91.0|91.8|95.7|93.7|93.2|95.8|94.5 syn2010-2-ed|85.0|94.4|89.5|91.4|94.8|93.1|92.3|94.9|93.5 kor-cz-130202-3-ed|84.2|94.9|89.2|91.0|95.3|93.1|93.2|95.4|94.3 syn2005-3-ed|86.8|95.5|91.0|91.4|95.7|93.5|92.7|95.8|94.2 syn2010-3-ed|85.0|94.4|89.5|90.9|94.8|92.8|91.8|94.8|93.3 kor-cz-130202-4-ed|84.2|94.9|89.2|91.0|95.3|93.1|93.2|95.4|94.3 syn2005-4-ed|86.8|95.5|91.0|91.4|95.7|93.5|92.7|95.8|94.2 syn2010-4-ed|85.0|94.4|89.5|90.9|94.8|92.8|91.8|94.8|93.3 kor-cz-130202-5-ed|84.2|94.9|89.2|91.0|95.3|93.1|93.2|95.4|94.3 syn2005-5-ed|86.8|95.5|91.0|91.4|95.7|93.5|92.7|95.8|94.2 syn2010-5-ed|85.0|94.4|89.5|90.9|94.8|92.8|91.8|94.8|93.3

Clone this wiki locally