Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
diac_stats.py		diac_stats.py

README.md

diac_stats.py

Description

This is a python script that takes a text file containing (tokenized and normalized) arabic text and outputs data about the words and more importantly about the diacritization of the arabic words in the text by running get_file_diac_stats() from diac_handler.py module and get_file_token_stats() from token_handler.py module

Here is the full list of the data generated:

number of tokens
number of arabic words
- and a file with all of them on separate lines
number of numbers
- and a file with all of them on separate lines
number of punctuations
- and a file with all of them on separate lines
number of symbols
- and a file with all of them on separate lines
number of other tokens (not arabic, numeric, punct or symbolic)
number of arabic words with any number of diacritical marks
- and a file with all of them on separate lines
number of arabic words with full diacritisation
- and a file with all of them on separate lines
number of diacritical marks
number of occurances of each diacritical mark present
number of diac groups (diac combinations on one letter)
number of diac groups at each letter index (i.e. number of diac groups that occured on the zero'th letter of a word, and the same for the first letter, second, third, etc)
- and a file with all of them on separate lines
number of diac groups that occured on the first letter of a word
- and a file with all of them on separate lines
number of diac groups that occured on the last letter of a word
- and a file with all of them on separate lines
number of occurances of each diac group

our definition of full diacritization is that each letter must have a fatha, damma, kasra, sukoon, or any kind of tanween on them except the mad letters which are the ا, و, ي which can have no diacs however, for a word to be fully diacritized must have at least a one letter that has some of the previously mentioned diacritical marks

note that the zero'th index diac group is the diac group before the first letter of the word

Usage

The script can be used by running the following line in the terminal

python diac_stats.py input.txt [output_dir]

The script takes 1 required input which is the path to the text file with the arabic text to calculate the statistics of, and the second is an optional argument that is the path to the directory that the script will create (if not already created) and put all the output data inside

Input/Output Specifications

Input

The input file input.txt must be tokenized using the simple word tokenizer and letter-normalized, and the script tokenize_normalize.py can be used to perform both tasks. Running the diac_stats.py script on untokenized data will most probably yield wrong data.

Output

The data previously mentioned is in two categories, the statistics and files with all different categories of words. The statistics will be printed in the terminal and will also be written in a tsv file in the output_dir with the other files with compilations of different words in each category as outlined in the Description section above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diac_stats

diac_stats

README.md

diac_stats.py

Description

Usage

Input/Output Specifications

Files

diac_stats

Directory actions

More options

Directory actions

More options

Latest commit

History

diac_stats

Folders and files

parent directory

README.md

diac_stats.py

Description

Usage

Input/Output Specifications