todo.txt

TODO Wed, 14 Sep 2016 14:02:58 +0200
+ implement
  x- extend(): fix for DDC (max ddc-qstr len = DDC_STATIC_BUFLEN = 4096) --> NOT EASY (no "OR" for both token and metadata filters!)
  x- load client config from rc-file
  - lazy union() ?
+ document
  x- extend(): #DiaColloDB, #Relation, #Relation::*, #Client, #Client::*
  x- client rcfile:// scheme

#DONE TODO Wed, 11 May 2016 14:27:14 +0200
x+ update kaskade DBs
  1) update DBs from not-yet-installed new DiaColloDB build dir
     ./dcdb-upgrade.perl -u ~/dstar/corpora/$corpus/web/diacollo/data
  2) update DiaColloDB distribution

(mostly DONE) TODO Mon, 14 Mar 2016 12:31:35 +0100
x+ Client::list: get working correctly!
x+ add dbinfo() (and dbheader()?) client method(s)
x+ Client::list: allow parallel queries of sub-clients in (fork or thread)
+ DiaColloDB: allow metadata constants in local indices (for subcorpus selection via list clients)
+ Client::list: check/allow null-results in sub-clients (for subcorpus selection)

#DONE Wed, 20 Apr 2016 14:12:49 +0200
x+ update DB create() and union() methods: x->t
  x- text file format for Relation::create() changed
x+ update DB export() method
x+ update relation create() and union() methods: x->t
  x- DiaColloDB::create
  x- Cofreqs::create
  x- Unigrams::create
  x- TDF::create ?
  x- DIaColloDB::union
  x- Cofreqs::union  
  x- Unigrams::union
  x- TDF::union() ?

x+ test unigrams relation

x+ DEBUG: wonky non-integer size[123] in cof.hdr, ug.hdr after upgrade
  - PackedFile size() doesn't handle recent writes correctly
  - workaround uses seek($fh,0,SEEK_END) and truncate() in write mode
x+ update Unigrams relation
x+ throw out or port unused f12(), f1() in Cofreqs --> throw an error
x+ change Upgrade API: pass around $header and allow destructive changes in e.g. upgrade() method?
  - no, but DO allow backup (& maybe revert?)
x+ re-factor compatibility wrappers into DiaColloDB/Compat/vx_y_z/...
x+ allow un-collocated f1 value pass-through for Cofreqs::(load|save)TextFh()
x+ DEBUG: weird -1pass inconsistencies for ./dcdb-query.perl kern.d Mann -ds=0 -1pass
  v0.09.x e.g. (N f1 f2 f12 score slice lemma):
    289801166	487461	463195	8513	8.196888	0	Frau
    289801166	487461	485656	4666	7.295717	0	Mann
  v0.10.x e.g. (N f1 f2 f12 score slice lemma):
    289801166	487461	479230	8513	8.172757	0	Frau
    289801166	487461	510840	4666	7.258855	0	Mann
  - QUESTION: why are f2 values different across DiaColloDB versions?
    ANSWER  : we were missing %i2 double-count checks in v0.10.x code
  - QUESTION: why is f2(Mann) > f1 in v0.10.x ?
    ANSWER  : we were double-counting

x+ test diff
 
x+ document new/changed modules:
  #Compat/*
  #PackedFile/MMap : added optimized bsearch
  #Upgrade/Base : extra header data, revert, instance conventions
  #- Upgrade/v0_04
  #- Upgrade/v0_09
  #- Upgrade/v0_10_x2t
  #Relation (subprofile[12] calling conventions changed)
  #- Unigrams
  #- Cofreqs
  #DiaColloDB : parseDateRequest() conventions changed, x->t
  #Persistent::copyto, moveto, copyto_a
  #Utils::copyto, moveto, copyto_a, cp_a
  #- xluniq xluniq
  #- :pdl, :temp

#DONE Mon, 09 May 2016 15:05:50 +0200 (was: TODO Tue, 26 Apr 2016 08:59:52 +0200)
#+ f2 bug / optimization: try re-factoring db structure from xenum (+date) to tenum a la tdf (-date)
  - reduces number of items for iteration in f2 loop --> reduce number of expensive calls
    : nytprof.kern-f2bug-Mann-packed+mmap.d / DiaColloDB::Relation::Cofreqs::subprofile2()
      -> 5.91s making 1357520 calls to ANON ($groupby->{xs2g})
      -> 2.90s making 1357520 calls to DiaColloDB::PackedFile::MMap::fetchraw()

#DONE work-in-progress Thu, 21 Apr 2016 16:56:00 +0200
#+ implement cofreqs 2-phase lookup
  #-> refactor: move intermediate numeric groupby keys to pack()-strings rather than join()-strings
  #-> allow old join()-style strings in output a la tdf with relation-side recoding
  #-> allow groupby sub to work on pre-extracted $x tuple-string
#+ DONE Tue, 26 Apr 2016 13:54:31 +0200: better, but still ca. 10x slower than (old, incorrect) single-pass variant

x+ DEBUG: 'stark/ADJA' getting bogus counts in list mode for 'Mann' (url="list://kern01-1ka.d kern01-1kb.d ?fudge=0")
   ---> pretty much DONE
 + looks like kern01-1ka.d "stark/ADJA/1915" isn't getting added to f2 (=30, xid=124823), src=Mann/NN/1915; id=124895
 + ... and it isn't! b/c "stark+Mann@1915" DOES NOT HAPPEN in kern01-1ka.d subcorpus
   ~ solution here for list-clients would seem to be an additional round-trip to get proper "f2" values
 + see Client/list.pod section "Incorrect Independent Collocate Frequencies" for description of the situation
   ~ in fact, it's worse than depicted there, since it's missing ($xid1,$xid2) pairs which cause item2 frequencies
     to be ignored -- we don't have "real" item2 frequencies for the actual (projected) keys except in ddc mode
     (and tdf mode too), since we do use a 2-phase lookup strategy in those relations; CoFreqs just looks at the
     stored frequencies for the actual collocates, which are indexed BY FULL XID TUPLE, including year and
     non-projected attributes.  best solution might be to chuck out $f2 storage in Cofreqs index and use
     ddc-style $fcoef (computed via $dmax) to tweak $N, $f1, and $f2 values from Unigrams index.
 + reformulation: milder form of this bug applies even to single native CoFreqs indices, since f2
   are computed there by summing over $xid2 with nonzero f($xid1,$xid2), but
   (a) we don't always project all $xid2 attributes, and
   (b) we don't always project single-year slices, so
   we're missing f2 counts within slices for $xid2 items which match some key
   but don't always occur


#DONE Wed, 27 Jan 2016 11:11:56 +0100
x+ rename Relation::Vsem -> Relation::TDM
  x- generate native-compatible profiles in vprofile() (wip)
  x- fix create() code
    x* remove tfidf stuff
    x* comple tym, cf
  x- fix Vsem::Query code (remove obsolete compileSlices etc)
  x- remove stale Profile::Pdl, PdlDiff, etc. classes
  x- handle groupby for term-attrs-only (ok), doc-attrs-only (ok), {term+doc}-attrs (ok)
  x- add implicit 'genre' field to vsem meta-index (extract 1st component of textClass, since vsem groupby can't handle regex transformations)
  x- remove stale EnumFile::Identity

x+ re-build & re-publish dstar indices
  x- first: dta, kern, zeit (beta-test) [wip]
  x- next: dta+dwds (test union())
  x- later: others

x+ debug/correct
  x- tdf create(): memory-optimize tym construction
    x+ zeit.d-p on kira: mem usage spikes from 9G to 29G between tym and ptrs (first plateau at 15G, then spike)
  x- ?suppress target-term output in tdm cofreq profiles (tricky)?
  x- implement tdf union() method
  x- implement tdf export() method
  x- document, package, & upload to CPAN
    x+ Alien::* module(s) for DDC, gfsm, moot?

x+ debug/fixed
  x- dstar: update "install" rule for dstar build/diacollo (see dta build dir)
    + do this in-place on next test build (something small, e.g. pnn)
  x- bug?: no tdf docs found for "Katze && Hund && Maus" in zeit.d-f, but ddc finds 131 matches with "#in file": why?
     + fixed: problem was bogus list-context for _intersect_p() as called by (new) TDF::catSubset()
  x- tdf: auto-detect minimum 'itype' during create() and/or union()
  x- fix boolean query eval (&& vs ||): maybe allow 'tdm' component in TDF::Query?
    + LATER: we're already performing && on cat-subset
  x- incorrect results e.g. for "Obst" in 1998 with co-occurent "Pak"
    + 1 shared paragraph "doc" d, f(Obst,d)=1, f(Pak,d)=2
    + we SHOULD get f12(Obst,Pak|year) = \sum_{d \in year} min{f(Obst,d),f(Pak,d)} = 1, but we get f12=2
      - reason seems to be that "Pak" gets 2 different tags, so it's counted as 2 different terms
      - each term adds only the min{f(w1,d),f(w2,d)}=1, but we're counting the same GROUP twice
      - don't know how to handle this right except for maybe creating a temp-piddle and running
        ccs_accum_minimum() or similar over it
      - since we don't know output size in advance, we either need to (a) operate block-wise
        and pass computation state in and out (ugly), or (b) write results to a tempfile
	and then read (mmap?) that in.
      - ignoring for now, since results look basically ok
      - fixed: use temporary doc-local hash in pdlutils diacollo_cof_t_TYPE()
  x- bug symptom: [dta,dbreak=p]: q=Obst gb=l,p date=1900-1999, ds=100
     + item2=herabschauen/VVPP gets f2=2 f12=2, but ddc only finds it once in slice (and that once with 'Obst')
     + shouldn't be a grouping problem here, since we're grouping by whole term-tuples (l,p)
     + problem was stale dta_[56].files being used in diacollo index generation, some cats were doubled
  x- dstar: synchronize pdl versions on BUILDHOST (kira,kaskade) and WEBHOST (kaskade)
    + incompatible pdl type-enums save raw headers using float type = 5 or 6 depending on PDL_Indx availability
      - this causes "Bus error" pukes on old PDL distros without PDL_Indx (type 6 -> double)
      - can be hacked with 'pdl-raw-settype.perl'
    + plato (workstation): 2.007 (dist) / 2.014 (local)
    + kira (buildhost): 2.007 (dist)
    + kaskade (runhost): 2.4.11 (dist) --> 2.007 (from wheezy, built as deb)
  x- fix kwic-search links for Kant/Hegel example
    + (* #has[author,/Kant/]) should NOT link to all "Kant"-tokens, just the relevant item2
    ?+ maybe fix query-parsing to allow token-attribute syntax for meta-field queries (e.g. boolean expressions?)
  x- merge intro trunk (probably best to branch trunk out again from current state and just replace old trunk with current branch)

+ disk usage stats (in MB)
  CORPUS	TEI	DDC	NTOK		DIACOLLO-TDF	DIACOLLO+TDF	DIACOLLO=TDF
  dta/#p	-	17100	182882418	352 ~2.1%    	1217 ~ 7.1%    	 865 ~ 5.1%
  kern/#p	-	 4812	121559727	439 ~9.1%	1092 ~22.7%	 653 ~13.6%
  zeit/#file	-	20084	504304208	679 ~3.4%	3341 ~16.6%	2662 ~13.3%
  zeit/#p	-	20084	504304208	679 ~3.4%	3840 ~19.1%	3161 ~15.7%

x+ merge in log-likelihood stuff from trunk (?)
  BRANCH "trunk" = svn+ssh://odo.dwds.de/home/svn/dev/DiaColloDB/trunk
  BRANCH "vsem" = svn+ssh://odo.dwds.de/home/svn/dev/DiaColloDB/branches/diacollo-0.07.006+vsem
  BRANCH "native" = svn+ssh://odo.dwds.de/home/svn/dev/DiaColloDB/branches/diacollo-0.07.006+vsem-native
  + 15592 : HEAD
  + 15509 : branched vsem -> native
  + 15069 : merged -r 15066:15068 vsem -> trunk (Relation.pm, Cofreqs.pm, DDC.pm, Unigrams.pm, DiaColloDB.pm)
  + 15023 : merged -r 15021:15022 vsem -> trunk (DDC.pm)
  + 15015 : merged -r 15013:15014 vsem -> trunk (DDC.pm)
  + 15013 : branched trunk -> vsem