Implementation of DeepTileBars: Visualizing Term Distribution of Neural Information Retrieval
pyspark
nltk
BeautifulSoup
keras
krovetzstemmer
gensim
-
Trained gensim word2vec model: the model is named as
word2vec.100
. Download it from Google Drive and put it along with its auxiliary files in thedata/
directory. -
Inverse Document Frequency (IDF) file:
data/term2idf.json
, which is essentially a dictionary storing the mappingword -> idf
. -
Query file: download from TREC, unzip and put it in the
data/08.million-query-topics
-
LETOR-MQ2008 file:
./MQ2008/
is the folder downloaded from Microsoft.
python preprocess.py
spark-submit --master [your-spark-cluster] --py-files trecweb_parser.py extract_file.py /path/to/corpus /path/to/clean-file
Warning: python3 users may need to fix a bug in NLTK follow this post.
update: As far as we know, NLTK 3.3.0 has fixed this bug.
spark-submit --master [your-spark-cluster] texttiling.py /path/to/clean-file /path/to/segmented-file
spark-submit --master [your-spark-cluster] text2img.py /path/to/segmented-file /path/to/images
python rank.py /path/to/images epochs
e.g.
python rank.py ./img 5
If you are using this repo, please cite the following paper:
@inproceedings{deeptilebars2018,
title={DeepTileBars: Visualizing Term Distribution for Neural Information Retrieval},
author={Tang, Zhiwen and Yang, Grace Hui},
journal={AAAI 2019},
year={2019}
}