-
Notifications
You must be signed in to change notification settings - Fork 0
/
todo.txt
191 lines (173 loc) · 10.3 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
TODO Wed, 14 Sep 2016 14:02:58 +0200
+ implement
x- extend(): fix for DDC (max ddc-qstr len = DDC_STATIC_BUFLEN = 4096) --> NOT EASY (no "OR" for both token and metadata filters!)
x- load client config from rc-file
- lazy union() ?
+ document
x- extend(): #DiaColloDB, #Relation, #Relation::*, #Client, #Client::*
x- client rcfile:// scheme
#DONE TODO Wed, 11 May 2016 14:27:14 +0200
x+ update kaskade DBs
1) update DBs from not-yet-installed new DiaColloDB build dir
./dcdb-upgrade.perl -u ~/dstar/corpora/$corpus/web/diacollo/data
2) update DiaColloDB distribution
(mostly DONE) TODO Mon, 14 Mar 2016 12:31:35 +0100
x+ Client::list: get working correctly!
x+ add dbinfo() (and dbheader()?) client method(s)
x+ Client::list: allow parallel queries of sub-clients in (fork or thread)
+ DiaColloDB: allow metadata constants in local indices (for subcorpus selection via list clients)
+ Client::list: check/allow null-results in sub-clients (for subcorpus selection)
#DONE Wed, 20 Apr 2016 14:12:49 +0200
x+ update DB create() and union() methods: x->t
x- text file format for Relation::create() changed
x+ update DB export() method
x+ update relation create() and union() methods: x->t
x- DiaColloDB::create
x- Cofreqs::create
x- Unigrams::create
x- TDF::create ?
x- DIaColloDB::union
x- Cofreqs::union
x- Unigrams::union
x- TDF::union() ?
x+ test unigrams relation
x+ DEBUG: wonky non-integer size[123] in cof.hdr, ug.hdr after upgrade
- PackedFile size() doesn't handle recent writes correctly
- workaround uses seek($fh,0,SEEK_END) and truncate() in write mode
x+ update Unigrams relation
x+ throw out or port unused f12(), f1() in Cofreqs --> throw an error
x+ change Upgrade API: pass around $header and allow destructive changes in e.g. upgrade() method?
- no, but DO allow backup (& maybe revert?)
x+ re-factor compatibility wrappers into DiaColloDB/Compat/vx_y_z/...
x+ allow un-collocated f1 value pass-through for Cofreqs::(load|save)TextFh()
x+ DEBUG: weird -1pass inconsistencies for ./dcdb-query.perl kern.d Mann -ds=0 -1pass
v0.09.x e.g. (N f1 f2 f12 score slice lemma):
289801166 487461 463195 8513 8.196888 0 Frau
289801166 487461 485656 4666 7.295717 0 Mann
v0.10.x e.g. (N f1 f2 f12 score slice lemma):
289801166 487461 479230 8513 8.172757 0 Frau
289801166 487461 510840 4666 7.258855 0 Mann
- QUESTION: why are f2 values different across DiaColloDB versions?
ANSWER : we were missing %i2 double-count checks in v0.10.x code
- QUESTION: why is f2(Mann) > f1 in v0.10.x ?
ANSWER : we were double-counting
x+ test diff
x+ document new/changed modules:
#Compat/*
#PackedFile/MMap : added optimized bsearch
#Upgrade/Base : extra header data, revert, instance conventions
#- Upgrade/v0_04
#- Upgrade/v0_09
#- Upgrade/v0_10_x2t
#Relation (subprofile[12] calling conventions changed)
#- Unigrams
#- Cofreqs
#DiaColloDB : parseDateRequest() conventions changed, x->t
#Persistent::copyto, moveto, copyto_a
#Utils::copyto, moveto, copyto_a, cp_a
#- xluniq xluniq
#- :pdl, :temp
#DONE Mon, 09 May 2016 15:05:50 +0200 (was: TODO Tue, 26 Apr 2016 08:59:52 +0200)
#+ f2 bug / optimization: try re-factoring db structure from xenum (+date) to tenum a la tdf (-date)
- reduces number of items for iteration in f2 loop --> reduce number of expensive calls
: nytprof.kern-f2bug-Mann-packed+mmap.d / DiaColloDB::Relation::Cofreqs::subprofile2()
-> 5.91s making 1357520 calls to ANON ($groupby->{xs2g})
-> 2.90s making 1357520 calls to DiaColloDB::PackedFile::MMap::fetchraw()
#DONE work-in-progress Thu, 21 Apr 2016 16:56:00 +0200
#+ implement cofreqs 2-phase lookup
#-> refactor: move intermediate numeric groupby keys to pack()-strings rather than join()-strings
#-> allow old join()-style strings in output a la tdf with relation-side recoding
#-> allow groupby sub to work on pre-extracted $x tuple-string
#+ DONE Tue, 26 Apr 2016 13:54:31 +0200: better, but still ca. 10x slower than (old, incorrect) single-pass variant
x+ DEBUG: 'stark/ADJA' getting bogus counts in list mode for 'Mann' (url="list://kern01-1ka.d kern01-1kb.d ?fudge=0")
---> pretty much DONE
+ looks like kern01-1ka.d "stark/ADJA/1915" isn't getting added to f2 (=30, xid=124823), src=Mann/NN/1915; id=124895
+ ... and it isn't! b/c "stark+Mann@1915" DOES NOT HAPPEN in kern01-1ka.d subcorpus
~ solution here for list-clients would seem to be an additional round-trip to get proper "f2" values
+ see Client/list.pod section "Incorrect Independent Collocate Frequencies" for description of the situation
~ in fact, it's worse than depicted there, since it's missing ($xid1,$xid2) pairs which cause item2 frequencies
to be ignored -- we don't have "real" item2 frequencies for the actual (projected) keys except in ddc mode
(and tdf mode too), since we do use a 2-phase lookup strategy in those relations; CoFreqs just looks at the
stored frequencies for the actual collocates, which are indexed BY FULL XID TUPLE, including year and
non-projected attributes. best solution might be to chuck out $f2 storage in Cofreqs index and use
ddc-style $fcoef (computed via $dmax) to tweak $N, $f1, and $f2 values from Unigrams index.
+ reformulation: milder form of this bug applies even to single native CoFreqs indices, since f2
are computed there by summing over $xid2 with nonzero f($xid1,$xid2), but
(a) we don't always project all $xid2 attributes, and
(b) we don't always project single-year slices, so
we're missing f2 counts within slices for $xid2 items which match some key
but don't always occur
#DONE Wed, 27 Jan 2016 11:11:56 +0100
x+ rename Relation::Vsem -> Relation::TDM
x- generate native-compatible profiles in vprofile() (wip)
x- fix create() code
x* remove tfidf stuff
x* comple tym, cf
x- fix Vsem::Query code (remove obsolete compileSlices etc)
x- remove stale Profile::Pdl, PdlDiff, etc. classes
x- handle groupby for term-attrs-only (ok), doc-attrs-only (ok), {term+doc}-attrs (ok)
x- add implicit 'genre' field to vsem meta-index (extract 1st component of textClass, since vsem groupby can't handle regex transformations)
x- remove stale EnumFile::Identity
x+ re-build & re-publish dstar indices
x- first: dta, kern, zeit (beta-test) [wip]
x- next: dta+dwds (test union())
x- later: others
x+ debug/correct
x- tdf create(): memory-optimize tym construction
x+ zeit.d-p on kira: mem usage spikes from 9G to 29G between tym and ptrs (first plateau at 15G, then spike)
x- ?suppress target-term output in tdm cofreq profiles (tricky)?
x- implement tdf union() method
x- implement tdf export() method
x- document, package, & upload to CPAN
x+ Alien::* module(s) for DDC, gfsm, moot?
x+ debug/fixed
x- dstar: update "install" rule for dstar build/diacollo (see dta build dir)
+ do this in-place on next test build (something small, e.g. pnn)
x- bug?: no tdf docs found for "Katze && Hund && Maus" in zeit.d-f, but ddc finds 131 matches with "#in file": why?
+ fixed: problem was bogus list-context for _intersect_p() as called by (new) TDF::catSubset()
x- tdf: auto-detect minimum 'itype' during create() and/or union()
x- fix boolean query eval (&& vs ||): maybe allow 'tdm' component in TDF::Query?
+ LATER: we're already performing && on cat-subset
x- incorrect results e.g. for "Obst" in 1998 with co-occurent "Pak"
+ 1 shared paragraph "doc" d, f(Obst,d)=1, f(Pak,d)=2
+ we SHOULD get f12(Obst,Pak|year) = \sum_{d \in year} min{f(Obst,d),f(Pak,d)} = 1, but we get f12=2
- reason seems to be that "Pak" gets 2 different tags, so it's counted as 2 different terms
- each term adds only the min{f(w1,d),f(w2,d)}=1, but we're counting the same GROUP twice
- don't know how to handle this right except for maybe creating a temp-piddle and running
ccs_accum_minimum() or similar over it
- since we don't know output size in advance, we either need to (a) operate block-wise
and pass computation state in and out (ugly), or (b) write results to a tempfile
and then read (mmap?) that in.
- ignoring for now, since results look basically ok
- fixed: use temporary doc-local hash in pdlutils diacollo_cof_t_TYPE()
x- bug symptom: [dta,dbreak=p]: q=Obst gb=l,p date=1900-1999, ds=100
+ item2=herabschauen/VVPP gets f2=2 f12=2, but ddc only finds it once in slice (and that once with 'Obst')
+ shouldn't be a grouping problem here, since we're grouping by whole term-tuples (l,p)
+ problem was stale dta_[56].files being used in diacollo index generation, some cats were doubled
x- dstar: synchronize pdl versions on BUILDHOST (kira,kaskade) and WEBHOST (kaskade)
+ incompatible pdl type-enums save raw headers using float type = 5 or 6 depending on PDL_Indx availability
- this causes "Bus error" pukes on old PDL distros without PDL_Indx (type 6 -> double)
- can be hacked with 'pdl-raw-settype.perl'
+ plato (workstation): 2.007 (dist) / 2.014 (local)
+ kira (buildhost): 2.007 (dist)
+ kaskade (runhost): 2.4.11 (dist) --> 2.007 (from wheezy, built as deb)
x- fix kwic-search links for Kant/Hegel example
+ (* #has[author,/Kant/]) should NOT link to all "Kant"-tokens, just the relevant item2
?+ maybe fix query-parsing to allow token-attribute syntax for meta-field queries (e.g. boolean expressions?)
x- merge intro trunk (probably best to branch trunk out again from current state and just replace old trunk with current branch)
+ disk usage stats (in MB)
CORPUS TEI DDC NTOK DIACOLLO-TDF DIACOLLO+TDF DIACOLLO=TDF
dta/#p - 17100 182882418 352 ~2.1% 1217 ~ 7.1% 865 ~ 5.1%
kern/#p - 4812 121559727 439 ~9.1% 1092 ~22.7% 653 ~13.6%
zeit/#file - 20084 504304208 679 ~3.4% 3341 ~16.6% 2662 ~13.3%
zeit/#p - 20084 504304208 679 ~3.4% 3840 ~19.1% 3161 ~15.7%
x+ merge in log-likelihood stuff from trunk (?)
BRANCH "trunk" = svn+ssh://odo.dwds.de/home/svn/dev/DiaColloDB/trunk
BRANCH "vsem" = svn+ssh://odo.dwds.de/home/svn/dev/DiaColloDB/branches/diacollo-0.07.006+vsem
BRANCH "native" = svn+ssh://odo.dwds.de/home/svn/dev/DiaColloDB/branches/diacollo-0.07.006+vsem-native
+ 15592 : HEAD
+ 15509 : branched vsem -> native
+ 15069 : merged -r 15066:15068 vsem -> trunk (Relation.pm, Cofreqs.pm, DDC.pm, Unigrams.pm, DiaColloDB.pm)
+ 15023 : merged -r 15021:15022 vsem -> trunk (DDC.pm)
+ 15015 : merged -r 15013:15014 vsem -> trunk (DDC.pm)
+ 15013 : branched trunk -> vsem