Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Morphology #6

Open
wants to merge 201 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
201 commits
Select commit Hold shift + click to select a range
f96996a
experimenting with cnn only
yoonkim Jun 17, 2015
dc4e5e6
experimenting with cnn only
yoonkim Jun 17, 2015
140b1df
experimenting with cnn only
yoonkim Jun 17, 2015
eff662b
experimenting with cnn only
yoonkim Jun 17, 2015
1767114
char-cnn only
yoonkim Jun 19, 2015
20e0ae2
char-cnn only
yoonkim Jun 19, 2015
94034af
concat word/char vecs
yoonkim Jun 22, 2015
1fef8e0
concat word/char vecs
yoonkim Jun 22, 2015
d91a79c
outer prod
yoonkim Jun 22, 2015
94ae5d0
outer prod
yoonkim Jun 22, 2015
448d8d5
outer
yoonkim Jun 22, 2015
68124fc
outer
yoonkim Jun 22, 2015
0598c1b
tensorprod
yoonkim Jun 25, 2015
70e5158
tensorprod
yoonkim Jun 25, 2015
23dfa78
tensorprod
yoonkim Jun 25, 2015
1a09027
tensorprod
yoonkim Jun 25, 2015
47e8b9d
updated tensorprod
yoonkim Jun 27, 2015
efe690b
updated tensorprod
yoonkim Jun 27, 2015
1e21d4b
unk words
yoonkim Jun 27, 2015
4847c76
unk words
yoonkim Jun 27, 2015
ba3c305
train/val/test unk
yoonkim Jun 27, 2015
ee0e85f
train/val/test unk
yoonkim Jun 27, 2015
96a4884
train2
yoonkim Jun 27, 2015
0009202
train2
yoonkim Jun 27, 2015
51097ea
refactored into main
yoonkim Jun 28, 2015
aa5de75
refactored into main
yoonkim Jun 28, 2015
9b2e7b1
refactored code
yoonkim Jun 28, 2015
47f3eec
refactored code
yoonkim Jun 28, 2015
5e13ea6
batch normalization
yoonkim Jun 29, 2015
40bb48b
batch normalization
yoonkim Jun 29, 2015
523286f
more refactoring
yoonkim Jun 29, 2015
009feea
more refactoring
yoonkim Jun 29, 2015
436123c
batch norm
yoonkim Jul 1, 2015
dc0cc02
batch norm
yoonkim Jul 1, 2015
c4a2e5e
refactored code
yoonkim Jul 2, 2015
3338e6f
refactored code
yoonkim Jul 2, 2015
1697448
refactored code
yoonkim Jul 2, 2015
4720c2d
refactored code
yoonkim Jul 2, 2015
9297e3a
refactored code
yoonkim Jul 2, 2015
aa9fbe8
refactored code
yoonkim Jul 2, 2015
084ec7d
refactored code
yoonkim Jul 2, 2015
6b27ab4
refactored code
yoonkim Jul 2, 2015
84fc5a3
refactoring
yoonkim Jul 4, 2015
56c877a
refactoring
yoonkim Jul 4, 2015
1de408e
made tensor product faster
yoonkim Jul 4, 2015
dca1607
made tensor product faster
yoonkim Jul 4, 2015
fc3454d
check max chargrams
yoonkim Jul 4, 2015
fb50f6d
check max chargrams
yoonkim Jul 4, 2015
2c702cc
attention layer
yoonkim Jul 5, 2015
8f7bfb3
attention layer
yoonkim Jul 5, 2015
2500e0c
model introspection working now
yoonkim Jul 5, 2015
c083b2d
model introspection working now
yoonkim Jul 5, 2015
fdfed7e
dropout;
yoonkim Jul 5, 2015
149114e
dropout;
yoonkim Jul 5, 2015
6b3bbf3
batch one-hot lookuptable
yoonkim Jul 6, 2015
b50a724
batch one-hot lookuptable
yoonkim Jul 6, 2015
0040f39
attention layer fixed
yoonkim Jul 6, 2015
327702b
attention layer fixed
yoonkim Jul 6, 2015
255ed1f
element-by-element attention layer
yoonkim Jul 9, 2015
1c02299
element-by-element attention layer
yoonkim Jul 9, 2015
48e03ea
element-by-element attention layer
yoonkim Jul 9, 2015
e2cbc05
element-by-element attention layer
yoonkim Jul 9, 2015
080d294
tiger corpus
yoonkim Jul 10, 2015
0ce46e5
tiger corpus
yoonkim Jul 10, 2015
2a33f0c
attention cnn
yoonkim Jul 11, 2015
27bb98f
attention cnn
yoonkim Jul 11, 2015
ad289cd
zero vector at start
yoonkim Jul 11, 2015
34e10b8
zero vector at start
yoonkim Jul 11, 2015
db1e699
add retraining functionality
yoonkim Jul 14, 2015
0405897
add retraining functionality
yoonkim Jul 14, 2015
9009a75
eos token issue
yoonkim Jul 15, 2015
413cd81
eos token issue
yoonkim Jul 15, 2015
557d843
cleaned up
yoonkim Jul 20, 2015
871d3da
cleaned up
yoonkim Jul 20, 2015
44ead13
refactoring
yoonkim Jul 23, 2015
ad4703a
refactoring
yoonkim Jul 23, 2015
16cd495
char-char model
yoonkim Jul 24, 2015
d893708
char-char model
yoonkim Jul 24, 2015
d3597d5
lr discount
yoonkim Jul 24, 2015
eaf5e89
lr discount
yoonkim Jul 24, 2015
d2039c2
highway connections
yoonkim Jul 24, 2015
8678940
highway connections
yoonkim Jul 24, 2015
932cedd
pos transformations
yoonkim Jul 25, 2015
be8fd74
pos transformations
yoonkim Jul 25, 2015
a594411
refactored highway layers
yoonkim Jul 27, 2015
4f55c2a
refactored highway layers
yoonkim Jul 27, 2015
69930be
refactored highway layers
yoonkim Jul 27, 2015
7f0d794
refactored highway layers
yoonkim Jul 27, 2015
389ba97
learning rate decay
Aug 4, 2015
5b3c1d3
learning rate decay
Aug 4, 2015
7c2625b
faster evaluation
Aug 7, 2015
f518492
faster evaluation
Aug 7, 2015
8a7e0ee
evaluation
Aug 7, 2015
d35f521
evaluation
Aug 7, 2015
c462a74
evaluate
Aug 7, 2015
b4ef32d
evaluate
Aug 7, 2015
45a653f
eval
Aug 8, 2015
6c2dc0f
eval
Aug 8, 2015
42eba67
default params
Aug 8, 2015
ba0df4c
default params
Aug 8, 2015
768e6f3
cleaned up code
Aug 9, 2015
b1e040a
cleaned up code
Aug 9, 2015
ca89216
clean up code
Aug 9, 2015
6e16bbd
clean up code
Aug 9, 2015
65c7e30
fixed main
Aug 9, 2015
622fb9f
fixed main
Aug 9, 2015
2a204ae
fix eval
Aug 9, 2015
96b7eda
fix eval
Aug 9, 2015
2c27cf3
eval
Aug 9, 2015
5369d45
eval
Aug 9, 2015
bf10864
data
Aug 10, 2015
55c1750
data
Aug 10, 2015
8a9407b
data
Aug 10, 2015
2a1aa92
data
Aug 10, 2015
dead1bd
refactoring
Aug 10, 2015
56806a1
refactoring
Aug 10, 2015
93ad6cf
code clean-up
Aug 11, 2015
65b2bde
code clean-up
Aug 11, 2015
3b601a7
eos change for non-ptb
Aug 11, 2015
8815b98
eos change for non-ptb
Aug 11, 2015
d6c1b05
max_word_l fix
Aug 12, 2015
7b097fb
max_word_l fix
Aug 12, 2015
3d52e85
readme
Aug 12, 2015
7032b26
readme
Aug 12, 2015
d1aeb52
readme
Aug 12, 2015
73b990e
readme
Aug 12, 2015
473ae60
readme
Aug 12, 2015
b5a827f
readme
Aug 12, 2015
63131c5
readme
Aug 12, 2015
3183e82
readme
Aug 12, 2015
fd08dd2
readme
Aug 12, 2015
e29701f
readme
Aug 12, 2015
3bc5e43
introspect
Aug 13, 2015
9ad40dd
introspect
Aug 13, 2015
32cbb8d
readme
Aug 14, 2015
98a5ea6
readme
Aug 14, 2015
f0a2ca6
more introspection
yoonkim Aug 14, 2015
cd18aaa
more introspection
yoonkim Aug 14, 2015
2b9f546
more introspection
yoonkim Aug 16, 2015
ebb830a
more introspection
yoonkim Aug 16, 2015
f7e46c6
small typo
srush Aug 16, 2015
7ce9671
small typo
srush Aug 16, 2015
ff79ffd
Merge pull request #1 from srush/experiments
yoonkim Aug 16, 2015
87353e8
Merge pull request #1 from srush/experiments
yoonkim Aug 16, 2015
0006f92
more introspection
Aug 16, 2015
38f31e0
more introspection
Aug 16, 2015
05773e2
Adding cudnn temporal conv.
srush Aug 16, 2015
7433da6
Adding cudnn temporal conv.
srush Aug 16, 2015
3b7f827
Merge pull request #2 from srush/timing
yoonkim Aug 16, 2015
77f65b8
Merge pull request #2 from srush/timing
yoonkim Aug 16, 2015
27379a0
Fix memory issue
srush Aug 17, 2015
e823f1a
Fix memory issue
srush Aug 17, 2015
3603807
readme
Aug 17, 2015
f869a58
readme
Aug 17, 2015
4414dc1
readme
Aug 17, 2015
904f731
readme
Aug 17, 2015
1a03cfd
readme
yoonkim Aug 17, 2015
98f873a
readme
yoonkim Aug 17, 2015
53ff990
readme
yoonkim Aug 17, 2015
6bcc4d2
readme
yoonkim Aug 17, 2015
a957998
licence
yoonkim Aug 17, 2015
568d1b4
licence
yoonkim Aug 17, 2015
0d6fb57
licence
yoonkim Aug 17, 2015
220c51c
licence
yoonkim Aug 17, 2015
d7dcef3
readme
yoonkim Aug 17, 2015
872f681
readme
yoonkim Aug 17, 2015
e26ea49
Merge pull request #3 from srush/memory
yoonkim Aug 17, 2015
c1c1aea
Merge pull request #3 from srush/memory
yoonkim Aug 17, 2015
d72d88b
batch run loader
Aug 17, 2015
dfd3cbf
batch run loader
Aug 17, 2015
11fd291
readme
Aug 17, 2015
5d380f5
readme
Aug 17, 2015
3ac4467
readme 2
Aug 17, 2015
013b5e6
readme 2
Aug 17, 2015
5381f4a
faster BatchLoaderUnk
Aug 17, 2015
9e27c6d
faster BatchLoaderUnk
Aug 17, 2015
c0dac23
Merge pull request #1 from yoonkim/master
srush Aug 18, 2015
e71fb08
Morphology
srush Aug 18, 2015
4d7a8a7
word morpho
srush Aug 19, 2015
f480c72
fix max pooling for cudnn
Aug 19, 2015
00e0420
readme
Aug 19, 2015
7c57048
readme
Aug 20, 2015
0a89bdc
README
Aug 20, 2015
c4faf30
clean-up
Aug 20, 2015
fd9bd60
Changes
srush Aug 20, 2015
5e422f4
Changes
srush Aug 20, 2015
9427c73
Merge
srush Aug 20, 2015
e10f5f9
Merge branch 'yoonkim-master'
srush Aug 20, 2015
010d3f5
Merge
srush Aug 20, 2015
7bcd8ab
Merge
srush Aug 20, 2015
06f3470
Morphology
srush Aug 18, 2015
fae32d4
word morpho
srush Aug 19, 2015
604cefd
Changes
srush Aug 20, 2015
1b24d2f
Changes
srush Aug 20, 2015
6787a89
Morpho
srush Aug 20, 2015
0c18071
Morpho
srush Aug 20, 2015
a1a5095
Fix up morphology
srush Aug 20, 2015
5ca15b3
Fix up morphology
srush Aug 20, 2015
41320c3
updates
srush Aug 21, 2015
19f190d
updates
srush Aug 21, 2015
f42d933
.
srush Sep 10, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
*.t7
*.sh
*.sh~
*.out
*.err
*.txt
*.zip
*.tsv
21 changes: 21 additions & 0 deletions LICENCE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
The MIT License (MIT)

Copyright (c) <2015> <Yoon Kim>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
99 changes: 98 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,98 @@
# word-char-rnn
## Character-Aware Neural Language Models
A neural language model (NLM) built on character inputs only. Predictions
are still made at the word-level. The model employs a convolutional neural network (CNN) over characters
to use as inputs into an long short-term memory (LSTM)
recurrent neural network language model (RNN-LM). Also optionally
passes the output from the CNN through a [Highway Network](http://arxiv.org/abs/1507.06228),
which improves performance.

Note: Paper will be posted on arXiv very soon.

Much of the base code is from Andrej Karpathy's excellent character RNN implementation,
available at https://github.com/karpathy/char-rnn

### Requirements
Code is written in Lua and requires Torch. It also requires
the `nngraph` and `optim` packages, which can be installed via:
```
luarocks install nngraph
luarocks install optim
```
GPU usage will additionally require `cutorch` and `cunn` packages:
```
luarocks install cutorch
luarocks install cunn
```

`cudnn` will result in a good (8x-10x) speed-up for convolutions, so it is
highly recommended. This will make the training time of a character-level model
be somewhat competitive against a word-level model (0.5 secs/batch vs 0.25 secs/batch for
the large character/word-level models described below).

```
git clone https://github.com/soumith/cudnn.torch.git
luarocks make cudnn-scm-1.rockspec
```
### Data
Data should be put into the `data/` directory, split into `train.txt`,
`valid.txt`, and `test.txt`

Each line of the .txt file should be a sentence. The English Penn
Treebank (PTB) data (Tomas Mikolov's pre-processed version with vocab size equal to 10K,
widely used by the language modeling community) is given as the default.

The paper also runs the models on non-English data (Czech, French, German, Russian, and Spanish), from the ICML 2014
paper [Compositional Morphology for Word Representations and Language Modelling](http://arxiv.org/abs/1405.4273)
by Jan Botha and Phil Blunsom. This can be downloaded from [Jan's website](https://bothameister.github.io).

#### Note on PTB
The PTB data above does not have end-of-sentence tokens for each sentence, and hence these must be
manually appended. This can be done by adding `-EOS '+'` to the script (obviously you
can use other characters than `+` to represent an end-of-sentence token---we recommend a single
unused character).

Jan's datasets already have end-of-sentence tokens for each line so you do not need to
add the `-EOS` command (equivalent to adding `-EOS ''`, which is the default).

### Model
Here are some example scripts. Add `-gpuid 0` to each line to use a GPU (which is
required to get any reasonable speed with the CNN), and `-cudnn 1` to use the
cudnn package.

#### Character-level models
Large character-level model (LSTM-CharCNN-Large in the paper).
This is the default: should get ~82 on valid and ~79 on test.
```
th main.lua -savefile char-large -EOS '+'
```
Small character-level model (LSTM-CharCNN-Small in the paper).
This should get ~96 on valid and ~93 on test.
```
th main.lua -savefile char-small -rnn_size 300 -highway_layers 1
-kernels '{1,2,3,4,5,6}' -feature_maps '{25,50,75,100,125,150}' -EOS '+'
```

#### Word-level models
Large word-level model (LSTM-Word-Large in the paper).
This should get ~89 on valid and ~85 on test.
```
th main.lua -savefile word-large -word_vec_size 650 -highway_layers 0
-use_chars 0 -use_words 1 -EOS '+'
```
Small word-level model (LSTM-Word-Small in the paper).
This should get ~101 on valid and ~98 on test.
```
th main.lua -savefile word-small -word_vec_size 200 -highway_layers 0
-use_chars 0 -use_words 1 -rnn_size 200 -EOS '+'
```

#### Combining both
Note that if `-use_chars` and `-use_words` are both set to 1, the model
will concatenate the output from the CNN with the word embedding. We've
found this model to underperform a purely character-level model, though.

### Licence
MIT



Binary file removed data/ptb/data.t7
Binary file not shown.
Loading