Skip to content

smt-HS/DeepTileBars-release

Repository files navigation

DeepTileBars-release

Implementation of DeepTileBars: Visualizing Term Distribution of Neural Information Retrieval

Dependencies

pyspark
nltk
BeautifulSoup
keras
krovetzstemmer
gensim

Running the model

0 Data Preparing

  • Trained gensim word2vec model: the model is named as word2vec.100. Download it from Google Drive and put it along with its auxiliary files in the data/ directory.

  • Inverse Document Frequency (IDF) file: data/term2idf.json, which is essentially a dictionary storing the mapping word -> idf.

  • Query file: download from TREC, unzip and put it in the data/08.million-query-topics

  • LETOR-MQ2008 file: ./MQ2008/ is the folder downloaded from Microsoft.

1 Preprocessing

python preprocess.py

2 Extracting and cleaning documents

spark-submit --master [your-spark-cluster] --py-files trecweb_parser.py extract_file.py /path/to/corpus /path/to/clean-file

3 TextTiling

Warning: python3 users may need to fix a bug in NLTK follow this post.

update: As far as we know, NLTK 3.3.0 has fixed this bug.

spark-submit --master [your-spark-cluster] texttiling.py /path/to/clean-file /path/to/segmented-file

4 Coloring

spark-submit --master [your-spark-cluster] text2img.py  /path/to/segmented-file /path/to/images

5 Run the model

python rank.py /path/to/images epochs

e.g.

python rank.py ./img 5

Citation

If you are using this repo, please cite the following paper:

@inproceedings{deeptilebars2018,
    title={DeepTileBars: Visualizing Term Distribution for Neural Information Retrieval},
    author={Tang, Zhiwen and Yang, Grace Hui},
    journal={AAAI 2019},
    year={2019}
}

About

Implementation of DeepTileBars

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published