Greek Crime Analyzer is a project that consists of:
- Crawling and storing greek crime articles in MongoDB
- Classifying them to a crime type using SVM classifier [custom-trained model, adaptable]
- Analyzing them using machine learning (NLP) and elastic custom analyzers to create additional json fields to the final analysis.
- UI with all the articles available and their text analysis, pie charts, a map with crime-coordinate locations
Version | |
---|---|
Python | 3.8 |
Elasticsearch | 7.10.1 |
Dash | 1.18.1 |
Scrapy | 2.4.1 |
Spacy | 2.3.5 |
Django | 3.1.4 |
install frozen-requirements from the main folder and requirements from the dash subfolder
The database model follows a noSQL schema can be found in api/models/article_model
Diagram of the schema:
For the crawling layer, scrapy is deployed. The spider can be found in crawling/spiders/newsbomb_spider.py
The spider's tasks are:
- Performs text mining from the article's url to find some basic information, e.g. the article's scope (Greece / Global)
- Uses many start URL's to crawl for many different crime types
- Extracts information from each URL (title, date, body, tags, author, link, type, scope)
- The spider uses the pipelines.py in
crawling/crawling/pipelines.py
to further customly clear the downloaded data - In the pipelines, the spider also performs crime-analysis for each article and creates a new record to the mongoDB
After the crawling is over, we synchronize the mongoDB with the elastic database using python manage.py search_index --rebuild
An example of 1 elastic article record (after the crime-analysis process) can be seen below:
This layer is used to classify an article and categorize it in one of the following categories: murder, drugs, theft, sexual crime, terrorism. This layer is enabled, in combination with the analysis layer, when a new article is crawled. The classifier is custom-trained by using greek crime articles that had available "tags" by the article's author. Tags that implied a crime type were text mined to create a custom annotated dataset. The classifier with a UI can also be found as a standalone project: https://github.com/SimonaMnv/ArachneClassifier
For the analysis of the text two methods were deployed:
- NLP (Spacy, NLTK e.t.c)
- Elasticsearch analyzers
Elasticseach analyzers:
in elasticsearchapp/documents.py
the schema of elastic is defined. The same fields as in mongo are applied, but the "title" and "body" fields are enriched with analyzers. More specifically, a custom greek analyzer has been created to lowercase, remove stopwords (extra stopwords from spacy added), apply a stemmer.
We communicate with our elastic using queries. All of the project's elastic queries are located in elasticsearchapp/query_results
Analyzing process:
- For the victims gender a custom strategy has been deployed. Text mining is used for the very basic words that imply a specific gender and a dependency collector (https://explosion.ai/demos/displacy) is used to identify the gender if the text mining fails. The dependecy collector is applied on a "summarized" set of sentences extracted from the title+body.
Elastic has a built-in TF-IDF (https://sci2lab.github.io/ml_tutorial/tfidf/) which is used to collect all the top (based on a threshold) verbs in each crime category. From all the 4K murder cases (for example) that we have, we find the top 50 verbs that are most frequently repeated throughout the database.
After that, we also collect all the local verbs from the current article given and finally, we locate the sentences in which those verbs match and extract those specific sentences. In that way, we create a "summary" of the article that includes only important verbs. This helps us keep only the very important information to extract the gender of the victim as it is most likely to be in those sentences due to the fact that verbs imply an act.
To further narrow down the sentences, rule based matching is used (https://spacy.io/usage/rule-based-matching) to match the grammar rules in
ML/POS/patter.py
that would suggest that an act is done by someone, or an act is done to someone. Finally, we have our final sentences and we classify the gender - For the "crime status", simply, text mining is used
- For the "acts", "age", "date", a custom-trained NER model is used. The model can be located in
ML/NER/custom_model
and is trained on ~50 articles by using a NER annotator tool (https://github.com/ManivannanMurugavel/spacy-ner-annotator). Further annotation is required for accuracy improvement - For the "location", SpaCy's greek NER is used
For the UI, plotly's dash (runs on flask) is deployed. The crime dash in dash/crime_dash.py
uses elastic's api calls to retrieve the analyzed data.
click to see a preview: https://user-images.githubusercontent.com/59322298/114617279-9f8e1c80-9cb0-11eb-9edf-71c4829cb41a.mp4