mat-analyzer

Text analyzer for data for Media Analytics Tool project.

Development

Requirements

Python 3
gcc and g++ libraries

Installation & test run

git clone [email protected]:lvtffuk/mat-analyzer.git
cd mat-analyzer
pip install -r requirements.txt
python ./

Creating new analyzer

New analyzer is created when the module is defined in src/analyzers. The class must extend BaseAnalyzer and at least define get_name and _analyze methods.

# test.py
from src.analyzers.base import BaseAnalyzer

class Test(BaseAnalyzer):

	def get_name(self) -> str:
		return "Test"

	def _analyze(self) -> None:
		# Do some stuff

Analyzers are loaded as modules by the name defined in the env vars.

Settings

The settings are set with environment variables.

Variable	Description	Required	Default value
`ANALYZER`	The analyzer to use. The app can be run with different analyzers on same input data.	✔️
`INPUT_FILE`	The filepath of the `csv` file with input file texts.	✔️
`OUTPUT_DIR`	The directory where the output is stored.	✔️
`DATA_KEY`	The column in the input `csv` file where are the texts.	✔️
`DOC_ID_KEY`	The column in the input `csv` file where the document ID is.	✔️
`CSV_SEPARATOR`	The separator of the input `csv` files.	❌	`;`
`CONFIG_FILE`	The path to the config `yml` file of the analyzer.	❌	`None`
`LANGUAGE`	The language for the udpipe analysis.	❌	`cs`
`CLEAR`	Indicates if the output dir should be cleared before the run. All downloads are starting again.	❌	`0`
`STOP_WORDS`	The path to non-header `csv` file including stop words.	❌	`None`

Input file

The input csv file must contain at least two columns. One with document ids and one with the texts to analyze.

Example

"doc_id";"text";"additional_field"
"1";"Some text";"foo"
"2";"Some other text";"bar"

First line must be a header.

Analyzers

BTM

Creates Biterm Topic Model from lemmatized texts. The model is stored as btm-model.pkl in the output directory.

Configuration

iterations: 20
seed: 12321
T: 8
M: 20
alpha: 6.25
beta: 0.01

LDA

Creates LDA model from lemmatized texts. The model is stored as lda.model, lda.model.expElogbeta.npy, lda.model.id2word and lda.model.state in the output directory. For another analyzes the corpus is stored in the output directory as lda-corpus.json.

Configuration

num_topics: 10

LSI

Creates LSI model from lemmatized texts. The model is stored as lsi.model and lsi.model.projection in the output directory.

VT

Creates XML file for Voyant tools. The file is stored as voyant-tools.xml in the output directory.

Configuration

author_key: 'author'
published_key: 'published'

author_key

The column in the input file representing the author. The field is required.

published_key

The column in the input file representing the publication time of the text. The field is required.

Import

The file should be uploaded in the Voyant tools but before that the xpath of the fields must be specified:

Field	Xpath
`Documents`	`//items`
`Content`	`//item/content`
`Author`	`//item/author`
`Publication Date`	`//item/published`

For more information check the docs.

Word2Vec

Creates Word2Vec model from lemmatized texts. The model is stored as word2vec.model file in the output directory.

Configuration

vector_size: 100
window: 5
min_count: 1
workers: 4

Output

The output directory contains models mentioned above and additional files.

File	Description
`corpus`	The directory containing NER corpus (It will be deleted in the future).
`udpipe-data.csv`	All of the lemmatized tokens data.
`udpipe.csv`	The original texts with lemmatized words.
`udpipe.md5`	MD5 checksum of the input file. If the checksum matches the lemmatization is not done again.

Docker

The image is stored in GitHub packages registry and the app can be run in the docker environment.

docker pull ghcr.io/lvtffuk/mat-analyzer:latest

docker run \
--name=mat-analyzer \
-e 'ANALYZER=BTM|LSI|Word2Vec|VT' \
-e 'INPUT_FILE=./input/tweets.csv' \
-e 'OUTPUT_DIR=./output' \
-e 'DATA_KEY=tweet_id' \
-e 'DOC_ID_KEY=tweet' \
-v '/absolute/path/to/output/dir:/usr/src/app/output' \
-v '/absolute/path/to/input/dir:/usr/src/app/input' \
ghcr.io/lvtffuk/mat-analyzer:latest

The volumes must be set for accessing input and output data.

This work was supported by the European Regional Development Fund-Project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (No. CZ.02.1.01/0.0/0.0/16_019/0000734)."

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__main__.py		__main__.py
logolink_OP_VVV_hor_bar_eng.jpg		logolink_OP_VVV_hor_bar_eng.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mat-analyzer

Development

Requirements

Installation & test run

Creating new analyzer

Settings

Input file

Example

Example

Analyzers

BTM

Configuration

LDA

Configuration

LSI

VT

Configuration

author_key

published_key

Import

Word2Vec

Configuration

Output

Docker

About

Releases

Packages

Languages

License

lvtffuk/mat-analyzer

Folders and files

Latest commit

History

Repository files navigation

mat-analyzer

Development

Requirements

Installation & test run

Creating new analyzer

Settings

Input file

Example

Example

Analyzers

BTM

Configuration

LDA

Configuration

LSI

VT

Configuration

author_key

published_key

Import

Word2Vec

Configuration

Output

Docker

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages