Science: paper
Pipeline:
- Collect the list of GitHub repositories to process.
- Fetch repositories and save them as
UASTModel
(a.k.a.UAST
model). - Calculate document frequencies.
- Produce BOW (bag-of-words) models from
UAST
models. - Join BOW models into the single BOW model.
- Convert the BOW model to Vowpal Wabbit format.
- Convert Vowpal Wabbit dataset to BigARTM batches.
- Train the topic model using BigARTM.
- Convert the result to
TopicModel
.
There are several options. You can use GitHub API or execute a query in BigQuery. The easiest way is to download the source{d}'s dataset with the whole world's open source projects (not released yet).
In the end, you should have a text file, say, repos.txt
with a separate line per URL:
https://github.com/tensorflow/tensorflow
https://github.com/pytorch/pytorch
...
The first thing you need is to install enry,
source{d}'s source code classifer. The following command should produce enry
executable
in the current directory. In the future, we suppose that we do not leave this directory.
ast2vec enry
Let's run the cloning pipeline:
ast2vec repos2uast -p 16 -t 4 --organize-files 2 -o uasts repos.txt
This will run 16 processes, each clones a repository, converts files to
Abstract Syntax Trees using Babelfish in 4 threads and finally
writes the result to uasts
directory.
The art of choosing -p
and -t
is hard to conceive. The general rule is to inspect the system
load and decide what is the current bottleneck:
Cause | Symptoms | Action |
---|---|---|
cloning IO underload | low CPU usage | increase -p |
network bandwidth limit | CPU usage stays the same disregarding -p |
increase -t |
Babelfish bandwidth limit | high CPU usage | decrease -t |
no free memory | out-of-memory errors, swapping, lags | decrease -p and/or -t |
In some cases Babelfish server responses take too much time and you get timeout errors.
Try to increase --timeout
or if it does not help, decrease -t
and even -p
.
In the end, you will have .asdf
files inside 2 levels of directories in uasts
.
If resuming the pipeline, make sure to pass --disable-overwrite
to not do the same work twice.
ast2vec uasts2df -p 4 uasts docfreq.asdf
We run 4 workers and save the result to docfreq.asdf
.
ast2vec uast2bow --df docfreq.asdf -v 100000 -p 4 uasts bows
Again, 4 workers. We set the number of distinct tokens to 100k here. The bigger the vocabulary size,
the better the model but the higher memory usage and bigger the bag-of-words models. It is sane to
increase -v
up to 2-3 million.
The results will be in bow
directory.
ast2vec join-bow -p 4 --bow bows joined_bow.asdf
4 workers merge the individual bags-of-words together into joined_bow.asdf
.
ast2vec bow2vw --bow joined_bow.asdf -o vw_dataset.txt
We transform the merged BOW model stored in ASDF binary format to simple text "Vowpal Wabbit" format.
The reason we use the intermediate format is that BigARTM's Python API is much slower at the direct conversion.
You will need a working bigartm
command-line application. The following command should install
bigartm
to the current working directory, provided by you have all the dependencies present in the
system.
ast2vec bigartm
The actual conversion:
./bigartm -c vw_dataset.txt -p 0 --save-batches artm_batches --save-dictionary artm_batches/artm.dict
Stage 1 performs the main optimization:
./bigartm --use-batches artm_batches --use-dictionary artm_batches/artm.dict -t 256 -p 20 --threads 4 --rand-seed 777 --regularizer "1000 Decorrelation" --save-model stage1.bigartm
Stage 2 optimizes for sparsity:
./bigartm --use-batches artm_batches --use-dictionary artm_batches/artm.dict --load-model stage1.bigartm -p 10 --threads 4 --rand-seed 777 --regularizer "1000 Decorrelation" "0.5 SparsePhi" "0.5 SparseTheta" --save-model stage2.bigartm
We set the number of topics to 256, the number of workers to 4. -p
sets the number of iterations (passes).
Choosing the stages and the regularizers is an art. Please refer to BigARTM papers.
First we convert the model to the text format:
./bigartm --use-batches artm_batches --use-dictionary artm_batches/artm.dict --load-model stage2.bigartm -p 0 --write-model-readable readable_stage2.txt
Second we convert the text format to the ASDF:
ast2vec bigartm2asdf readable_stage2.txt topic_model.asdf