Word Clusters based Document Embedding (WcDe)

This repository provides implementation for generating Word Clusters based Document Embedding (WcDe). The purpose of this repository, at the moment, is to allow the user to experiment with the methodology. Therefore, at the moment, the demo is only configured to work for BBC datasets with 100-dimensional pre-trained GloVe embedding. This demo generates WcDe document representations, clusters them and evaluates the performance based on the Normalized Mutual Information score.

Files and Methods

demo.py - This file runs the demo. It contains the methods and parameters that are specific to this demo.

__main__ - The main body of demo.py that sets the demo parameters, generates document vectors, clusters document vectors and evaluates the performance using Normalized Mutual Information between the clusters and the true class of documents. The parameters that can be set for the demo are -

Variable	Default Value	Type	Comment
dataset_path	"/path/to/bbc"	str	Path to `bbc` or `bbcsport` directories
embedding_file	"/path/to/glove.6B.100d.txt"	str	Word embedding to be used
word_vector_size	100	int	Size of each word vectors
clustering_algorithm	"ahc"	str	The clustering technique. Acceptable values are - "ahc", "kmeans".
linkage	"ward"	str	Merge Criteria for Hierarchical Clustering. Acceptable values are - "ward", "complete", "average", "single".
n_clusters	None	int or None	Number of clusters for Flat clustering
distance_threshold	8	float	Distance Threshold for Hierarchical Clustering
weighting_scheme	"cfidf"	str	The weighting scheme to be used to calculate score of word cluster in the document. Acceptable values are
length_normalize	True	bool	Whether to length normalize the WcDe document vector or not

read_bbc_dataset() - It reads any of the BBC datasets (BBC or BBCSport). The raw text files can be downloaded from http://mlg.ucd.ie/datasets/bbc.html. The zipped file can be unzipped to get the raw data in the form of two directories - bbc or bbcsport dataset. This method takes the path of one directory and parses its contents to get the texts and corresponding classes. To make the experiment deterministic, the documents are read in the alphabetical order of class name. For the documents of same class the documents are sorted in the alphabetical order of file names.
read_glove_embeddings() - Reads GloVe pre-trained word embeddings and returns a list of words and an array containing word vectors corresponding to the words. To make the experiment deterministic, the words are sorted in alphabetical order.

WcDe.py - This file contains the methods that implement the WcDe methodology.
1. cluster_word_vectors() - Clusters the word vectors.
2. get_document_vectors() - Generates the WcDe document vectors.
helpers.py - This file contains additional helper methods.
1. tokenize() - Tokenizes a piece of text.
2. flatten() - Flattens nested lists.

Getting Started

To run the demo follow these simple steps.

Prerequisites

Python >= 3.6
virtualenv for creating the virtual environment

For information on how to install virtualenv, please refer to - Python - Installing packages using pip and virtual environments

Installation

Clone the repo

git clone https://github.com/sunandabansal/WcDe
cd WcDe

Set up the virtual environment
Create a virtual environment
```
virtualenv env
```
Activate the virtual environment
```
source env/bin/activate
```
Install the packages
```
pip3 install -r requirements.txt
```

Usage Instructions

In demo.py, set the the following values in the main body.

Variable Comment

dataset_path Path to bbc or bbcsport directories

embedding_file Path to GloVe 100-dimensional pre-trained word embedding

word_vector_size Size of each word vectors
```
dataset_path      = "your/path/to/bbc"
embedding_file    = "your/path/to/glove.6B.100d.txt"
word_vector_size  = 100
```
Run
```
python3 demo.py
```

Expected Output

Using Glove 100d word embedding and the default clustering configurations given below -

Variable	Value	Comment
clustering_algorithm	"ahc"	The clustering technique
linkage	"ward"	Merge Criteria for Hierarchical Clustering
n_clusters	None	Number of clusters for Flat clustering
distance_threshold	8	Distance Threshold for Hierarchical Clustering

The output for BBC Dataset-

Reading dataset.
Tokenizing documents.
Getting word vectors.
Clustering word vectors.
Generating document vectors.
Clustering document vectors.
Performance (NMI): 0.7956132539056411

The output for BBC Sport Dataset-

Reading dataset.
Tokenizing documents.
Getting word vectors.
Clustering word vectors.
Generating document vectors.
Clustering document vectors.
Performance (NMI): 0.8004845766270388

Deactivating environment

For deactivating the virtual environment, run -

deactivate

License

Distributed under the Creative Commons Attribution 4.0 International License License.
See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word Clusters based Document Embedding (WcDe)

Files and Methods

Getting Started

Prerequisites

Installation

Usage Instructions

Expected Output

Deactivating environment

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WcDe.py		WcDe.py
demo.py		demo.py
helpers.py		helpers.py
requirements.txt		requirements.txt

License

sunandabansal/WcDe

Folders and files

Latest commit

History

Repository files navigation

Word Clusters based Document Embedding (WcDe)

Files and Methods

Getting Started

Prerequisites

Installation

Usage Instructions

Expected Output

Deactivating environment

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages