Text analyzer for data for Media Analytics Tool project.
- Python 3
gcc
andg++
libraries
git clone [email protected]:lvtffuk/mat-analyzer.git
cd mat-analyzer
pip install -r requirements.txt
python ./
New analyzer is created when the module is defined in src/analyzers
. The class must extend BaseAnalyzer
and at least define get_name
and _analyze
methods.
# test.py
from src.analyzers.base import BaseAnalyzer
class Test(BaseAnalyzer):
def get_name(self) -> str:
return "Test"
def _analyze(self) -> None:
# Do some stuff
Analyzers are loaded as modules by the name defined in the env vars.
The settings are set with environment variables.
Variable | Description | Required | Default value |
---|---|---|---|
ANALYZER |
The analyzer to use. The app can be run with different analyzers on same input data. | ✔️ | |
INPUT_FILE |
The filepath of the csv file with input file texts. |
✔️ | |
OUTPUT_DIR |
The directory where the output is stored. | ✔️ | |
DATA_KEY |
The column in the input csv file where are the texts. |
✔️ | |
DOC_ID_KEY |
The column in the input csv file where the document ID is. |
✔️ | |
CSV_SEPARATOR |
The separator of the input csv files. |
❌ | ; |
CONFIG_FILE |
The path to the config yml file of the analyzer. |
❌ | None |
LANGUAGE |
The language for the udpipe analysis. | ❌ | cs |
CLEAR |
Indicates if the output dir should be cleared before the run. All downloads are starting again. | ❌ | 0 |
STOP_WORDS |
The path to non-header csv file including stop words. |
❌ | None |
The input csv
file must contain at least two columns. One with document ids and one with the texts to analyze.
"doc_id";"text";"additional_field"
"1";"Some text";"foo"
"2";"Some other text";"bar"
First line must be a header.
Creates Biterm Topic Model from lemmatized texts. The model is stored as btm-model.pkl
in the output directory.
iterations: 20
seed: 12321
T: 8
M: 20
alpha: 6.25
beta: 0.01
Creates LDA model from lemmatized texts. The model is stored as lda.model
, lda.model.expElogbeta.npy
, lda.model.id2word
and lda.model.state
in the output directory.
For another analyzes the corpus is stored in the output directory as lda-corpus.json
.
num_topics: 10
Creates LSI model from lemmatized texts. The model is stored as lsi.model
and lsi.model.projection
in the output directory.
Creates XML file for Voyant tools. The file is stored as voyant-tools.xml
in the output directory.
author_key: 'author'
published_key: 'published'
The column in the input file representing the author. The field is required.
The column in the input file representing the publication time of the text. The field is required.
The file should be uploaded in the Voyant tools but before that the xpath of the fields must be specified:
Field | Xpath |
---|---|
Documents |
//items |
Content |
//item/content |
Author |
//item/author |
Publication Date |
//item/published |
For more information check the docs.
Creates Word2Vec model from lemmatized texts. The model is stored as word2vec.model
file in the output directory.
vector_size: 100
window: 5
min_count: 1
workers: 4
The output directory contains models mentioned above and additional files.
File | Description |
---|---|
corpus |
The directory containing NER corpus (It will be deleted in the future). |
udpipe-data.csv |
All of the lemmatized tokens data. |
udpipe.csv |
The original texts with lemmatized words. |
udpipe.md5 |
MD5 checksum of the input file. If the checksum matches the lemmatization is not done again. |
The image is stored in GitHub packages registry and the app can be run in the docker environment.
docker pull ghcr.io/lvtffuk/mat-analyzer:latest
docker run \
--name=mat-analyzer \
-e 'ANALYZER=BTM|LSI|Word2Vec|VT' \
-e 'INPUT_FILE=./input/tweets.csv' \
-e 'OUTPUT_DIR=./output' \
-e 'DATA_KEY=tweet_id' \
-e 'DOC_ID_KEY=tweet' \
-v '/absolute/path/to/output/dir:/usr/src/app/output' \
-v '/absolute/path/to/input/dir:/usr/src/app/input' \
ghcr.io/lvtffuk/mat-analyzer:latest
The volumes must be set for accessing input and output data.
This work was supported by the European Regional Development Fund-Project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (No. CZ.02.1.01/0.0/0.0/16_019/0000734)."