Skip to content

Commit

Permalink
clarifying data format and memory usage in docs
Browse files Browse the repository at this point in the history
  • Loading branch information
kunaldahiya committed May 5, 2021
1 parent dcbf94a commit 81dedfb
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 3 deletions.
14 changes: 11 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,13 @@ DeepXML supports multiple feature architectures such as Bag-of-embedding/Astec,
```txt
* Download the (zipped file) BoW features from XML repository.
* Extract the zipped file into data directory.
* The following files should be available in <work_dir>/data/<dataset>
* The following files should be available in <work_dir>/data/<dataset> for new datasets (ignore the next step)
- trn_X_Xf.txt
- trn_X_Y.txt
- tst_X_Xf.txt
- tst_X_Y.txt
- fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
* The following files should be available in <work_dir>/data/<dataset> if the dataset is in old format (please refer to next step to convert the data to new format)
- train.txt
- test.txt
- fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
Expand Down Expand Up @@ -89,8 +95,8 @@ An ensemble can be trained as follows. A json file is used to specify architectu
* framework
- DeepXML: Divides the XML problems in 4 modules as proposed in the paper.
- DeepXML-OVA: Train the method in 1-vs-all fashion [4][5], i.e., loss is computed for each label in each iteration.
- DeepXML-ANNS: Train the method using a label shortlist. Support is available for a fixed graph or periodic training of the ANNS graph.
- DeepXML-OVA: Train the architecture in 1-vs-all fashion [4][5], i.e., loss is computed for each label in each iteration.
- DeepXML-ANNS: Train the architecture using a label shortlist. Support is available for a fixed graph or periodic training of the ANNS graph.
* dataset
- Name of the dataset.
Expand All @@ -117,6 +123,8 @@ An ensemble can be trained as follows. A json file is used to specify architectu
* Other file formats such as npy, npz, pickle are also supported.
* Initializing with token embeddings (computed from FastText) leads to noticible accuracy gain in Astec. Please ensure that the token embedding file is available in data directory, if 'init=token_embeddings', otherwise it'll throw an error.
* Config files are made available in deepxml/configs/<framework>/<method> for datasets in XC repository. You can use them when trying out Astec/DeepXML on new datasets.
* We conducted our experiments on a 24-core Intel Xeon 2.6 GHz machine with 440GB RAM with a single Nvidia P40 GPU. 128GB memory should suffice for most datasets.
* Astec make use of CPU (mainly for nmslib) as well as GPU.
```

## Cite as
Expand Down
1 change: 1 addition & 0 deletions deepxml/configs/DeepXML/LF-AmazonTitles-1.3M.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
"surrogate_method": 1,
"embedding_dims": 300,
"top_k": 350,
"save_top_k": 100,
"beta": 0.60,
"save_predictions": true,
"trn_label_fname": "trn_X_Y.txt",
Expand Down

0 comments on commit 81dedfb

Please sign in to comment.