The Arabic Error Type Annotation tool (ARETA) aims to annotate Arabic error types following the Arabic Learner Corpus (ALC) tagset annotation. ARETA is described in Automatic Error Type Annotation for Arabic (Belkebir and Habash, 2021).
You will need Python 3.7 and above (64-bit).
- Install CamelTools.
- Install requirements.
pip install -r requirements.txt
Usage: annotate_err_type_ar.py [OPTIONS] --sys_path system --ref_path reference
where
system - the system output
reference - the reference file
OPTIONS
--show_edit_paths - whether to show the shortest edit paths or not; defaults to false.
--output_path - output file directory; defaults to standard output.
Example:
python annotate_err_type_ar.py --sys_path sample/sys_sample.txt --ref_path sample/ref_sample.txt
The output lists triplets of system, reference and error types. For the complete list of error types, see table of error types below.
Example:
System sentence: إن أمتحان الاستاذة صعبة
Reference sentence: إن إمتحان الأستاذ صعب
Annotation result:
System | Reference | Error type |
---|---|---|
إن | إن | UC |
أمتحان | إمتحان | OH |
الاستاذة | الأستاذ | OH+XG |
صعبة | صعب | XG |
Note: It is important to note that every sentence in the system/reference outputs must start with an s
. Refer to sample/sys_sample.txt and sample/ref_sample.txt for examples on how these files should look like.
Usage: annotate_eval_ar.py [OPTIONS] system source_reference
where
system - the system output
source_reference - source sentences with gold token edits (.m2 file)
OPTIONS
Example:
python annotate_eval_ar.py sample/CLMB-1 sample/QALB-Test2014.m2
This generates:
annot_input_ref.tsv
file in theoutput
folder that contains the error types annotation between the input and the reference.annot_input_sys.tsv
file in theoutput
folder that contains the error types annotation between the input and the system.subclasses_results_CLMB-1.tsv
file in theresults
folder. This file contains the results of the evaluation of the system's output against the reference.
Usage: utilities/generate-m2-source.py m2-file
where
m2-file - the m2 file
Example:
python utilities/generate-m2-source.py sample/QALB-Test2014.m2 > sample/QALB-Test2014.source.sent
Usage: utilities/generate-m2-reference.py m2-file
where
m2-file - the m2 file
Example:
python utilities/generate-m2-reference.py sample/QALB-Test2014.m2 > sample/QALB-Test2014.reference.cor
This re_alignment tool realigns files from Ossama's basic aligner by shifting the word in null -> word pair to the word before or after according to minimum edit distance.
Usage: utilities/adjust_align_tool.py file_to_adjust_align
where
file_to_adjust_align - File to be realigned (should follow Ossama's basic alignement file format)
Example:
python utilities/adjust_align_tool.py sample/align.basic > sample/align.adjust
In the configuration file config.json
, the user should specify the mode of the morphological analyser. The default value is analyser
in which all the analyses are considered. The second option is
mle
and in this case, we need to specify the parameter mle_top
which represents the maximum number of analyses to be considered. The uc
parameter takes the values 0 or 1. 0 to indicate that we do not consider the unchanged error types, 1 otherwise.
{
"mode": "analyser",
"mle_top": ""
}
Class | Sub-class | Description | Arabic Example | Buckwalter Transliteration |
Orthographic | OH | Hamza error | اكثر← أكثر | Akvr → >kvr |
OT | Confusion in Ha and Ta Mutadarrifatin | مشاركه ← مشاركة | m$Arkh → m$Arkp | |
OA | Confusuion in Alif and Ya Mutadarrifatin | علي ← على | Ely → ElY | |
OW | Confusion in Alif Fariqa | وكانو ← وكانوا | wkAnw→ wkAnwA | |
ON | Confusion Between Nun and Tanwin | ثوبن ← ثوبٌ | vwbn → vwbN | |
OS | Shortening the long vowels | أوقت ← أوقات | >wqt → >wqAt | |
OG | Lengthening the short vowels | نقيمو ← نقيم | nqymw → nqym | |
OC | Wrong order of word characters | تبرينا ← تربينا | tbrynA → trbynA | |
OR | Replacement in word character(s) | مصلنا ← وصلنا | mSlnA → wSlnA | |
OD | Additional character(s) | يعدوم ← يدوم | yEdwm → ydwm | |
OM | Missing character(s) | سالين ← سائلين | sAlyn → sA}lyn | |
OO | Other orthographic errors | - | - | |
Morphological | MI | Word inflection | معروف ← عارف | mErwf → EArf |
MT | Verb tense | تفرحني ← أفرحتني | tfrHny → >frHtny | |
MO | Other morphological errors | - | - | |
Syntax | XC | Case | رائع ← رائعاً | rA}E → rA}EAF |
XF | Definiteness | السن ← سن | Alsn → sn | |
XG | Gender | الغربي ← الغربية | Algrby → Algrbyp | |
XN | Number | فكرتي ← أفكاري | fkrty → >fkAry | |
XT | Unnecessary word | على ←Null | ElY →Null | |
XM | Missing word | Null ← على | Null → ElY | |
XO | Other syntactic errors | - | - | |
Semantic | SW | Word selection error | من ← عن | mn → En |
SF | Fasl wa wasl (confusion in conjunction use/non-use) | سبحان ← فسبحان | sbHAn → fsbHAn | |
SO | Other semantic errors | - | - | |
Punctuation | PC | Punctuation confusion | المتوسط. ← المتوسط، | AlmtwsT. → AlmtwsT، |
PT | Unnecessary punctuation | العام, ← العام | AlEAm, → AlEAm | |
PM | Missing punctuation | العظيم ← العظيم، | AlEZym → AlEZym، | |
PO | Other errors in punctuation | - | - | |
Merge | MG | Words are merged | ذهبتالبارحة ← ذهبت البارحة | *hbtAlbArHp → *hbt AlbArHp |
Split | SP | Words are split | المحا دثات ← المحادثات | AlmHA dvAt → AlmHAdvAt |
If you find ARETA useful in your research, please cite Automatic Error Type Annotation for Arabic (Belkebir and Habash, 2021) (PDF) (BIB).
This tool is available under the MIT license. See the LICENSE file for more info.