DeepTileBars-release

Implementation of DeepTileBars: Visualizing Term Distribution of Neural Information Retrieval

Dependencies

pyspark
nltk
BeautifulSoup
keras
krovetzstemmer
gensim

Running the model

0 Data Preparing

Trained gensim word2vec model: the model is named as word2vec.100. Download it from Google Drive and put it along with its auxiliary files in the data/ directory.
Inverse Document Frequency (IDF) file: data/term2idf.json, which is essentially a dictionary storing the mapping word -> idf.
Query file: download from TREC, unzip and put it in the data/08.million-query-topics
LETOR-MQ2008 file: ./MQ2008/ is the folder downloaded from Microsoft.

1 Preprocessing

python preprocess.py

2 Extracting and cleaning documents

spark-submit --master [your-spark-cluster] --py-files trecweb_parser.py extract_file.py /path/to/corpus /path/to/clean-file

3 TextTiling

Warning: python3 users may need to fix a bug in NLTK follow this post.

update: As far as we know, NLTK 3.3.0 has fixed this bug.

spark-submit --master [your-spark-cluster] texttiling.py /path/to/clean-file /path/to/segmented-file

4 Coloring

spark-submit --master [your-spark-cluster] text2img.py  /path/to/segmented-file /path/to/images

5 Run the model

python rank.py /path/to/images epochs

e.g.

python rank.py ./img 5

Citation

If you are using this repo, please cite the following paper:

@inproceedings{deeptilebars2018,
    title={DeepTileBars: Visualizing Term Distribution for Neural Information Retrieval},
    author={Tang, Zhiwen and Yang, Grace Hui},
    journal={AAAI 2019},
    year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepTileBars-release

Dependencies

Running the model

0 Data Preparing

1 Preprocessing

2 Extracting and cleaning documents

3 TextTiling

4 Coloring

5 Run the model

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
eval		eval
.gitignore		.gitignore
README.md		README.md
extract_file.py		extract_file.py
preprocess.py		preprocess.py
rank.py		rank.py
text2img.py		text2img.py
texttiling.py		texttiling.py
trecweb_parser.py		trecweb_parser.py
utils.py		utils.py

smt-HS/DeepTileBars-release

Folders and files

Latest commit

History

Repository files navigation

DeepTileBars-release

Dependencies

Running the model

0 Data Preparing

1 Preprocessing

2 Extracting and cleaning documents

3 TextTiling

4 Coloring

5 Run the model

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages