-
Notifications
You must be signed in to change notification settings - Fork 14
/
README
111 lines (78 loc) · 3.75 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
LogicLDA - Topic modeling with First-Order Logic (FOL) domain knowledge
David Andrzejewski ([email protected])
Department of Computer Sciences
University of Wisconsin-Madison, USA
OVERVIEW
This code implements inference for the LogicLDA model [1]. LogicLDA
extends Latent Dirichlet Allocation (LDA) [2] by allowing the user to
specify a weighted first-order logic (FOL) knowledge base (KB), as in
Markov Logic Networks (MLN). The inferred topics will then be
influenced by both document-word corpus statistics as well as the
user-specified logical KB. This code is implemented in Java and
performs scalable MAP inference via a stochastic gradient descent
scheme.
BUILD
The LogicLDA Java project can be built with Maven:
$ mvn package
$ cp ./target/logiclda-0.0.1-SNAPSHOT-jar-with-dependencies.jar ./logiclda.jar
USAGE
Say that our dataset is named 'nyt' (see INPUT/OUTPUT FILES below).
Then we run LogicLDA inference with:
java -jar logiclda.jar nyt 500 100 10000 25 194582
Which does
500 iterations of LogicLDA collapsed Gibbs sampling to initialize
100 outer / 10000 inner iterations of Mir SGD inference
print out the Top 25 words for each topic to nyt.topics
using 194582 as the random number seed
An example dataset and bash script can be found in ./test
INPUT/OUTPUT FILES
The input and output files obey the a naming convention where all
files consist of the name of the dataset plus a filetype-specific
extesion. For example, if we were processing a corpus of New York
Times articles, we would supply input files nyt.docs, nyt.vocab, ...
and LogicLDA would give us output files nyt.topics, nyt.sample, ...
Note that *.words should be the integer indices of word tokens, not
the tokens themselves. For example, if *.vocab is
foo
bar
baz
Then the string "foo foo baz bar foo" should be in *.words as: 0 0 2 1 0
These kinds of representations can be built from plaintext with the
textproc code (https://github.com/davidandrzej/textproc).
[input]
.rules FOL rules (see RULES for details)
.alpha [T] alpha hyperparameter
.beta [TxW] beta hyperparameter
.doclist [D] document names
.vocab [W] vocabulary (one word per line)
.words [N] word indices for each corpus position
.docs [N] document indices for each corpus position
(optional)
.sent [N] sentence indices for each corpus position
.init [N] initial z-sample state
[output]
.sample [N] latent topic indices for each corpus position
.phi [TxW] topic-word probabilities P(w|z)
.theta [DxT] document-topic probabilities P(z|d)
.topics plaintext summary of learned topics
.logic plaintext summary of logic rule satisfaction
EXTENDING LOGICLDA
This software is designed to allow the straightforward inclusion of
custom rule types and sources of side information (see EXTENDING for
details).
LICENSE
This software is open-source, released under the terms of the GNU
General Public License version 3, or any later version of the GPL (see
COPYING).
REFERENCES
[1] Andrzejewski, D., Zhu, X., Craven, M., and Recht, B. (2011). A
Framework for Incorporating General Domain Knowledge into Latent
Dirichlet Allocation Using First-Order Logic. In Proceedings of the
22nd International Joint Conference on Artificial Intelligence (IJCAI
2011).
[2] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet
Allocation. Journal of Machine Learning Research (JMLR) 3
(Mar. 2003), 993-1022.
[3] Domingos, P. and Lowd, D. (2009). Markov Logic: An Interface
Layer for Artificial Intelligence. Synthesis Lectures on Artificial
Intelligence and Machine Learning, Morgan and Claypool Publishers.