-
Notifications
You must be signed in to change notification settings - Fork 0
/
Changes
1506 lines (1369 loc) · 68.4 KB
/
Changes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
##-*- Mode: Change-Log; coding: utf-8; -*-
##
## Change log for perl distribution DTA::CAB
v1.115 Thu, 04 Mar 2021 14:06:40 +0100 moocow
* Morph::Helsinki::DE updated tag-extraction heuristics (use original FST label conventions <+NN>, <+V> etc.)
* added Morph::SMOR (just an alias for Morph::Helsinki::DE), Morph::TAGH (just an alias for Morph)
v1.114 Mon, 01 Mar 2021 15:45:44 +0100 moocow
* added DTA::CAB::Chain::DE_free
- for use with (modified) Helsinki hfst-german FST from https://sourceforge.net/projects/hfst/files/resources/morphological-transducers
* refactored DTA::CAB::Morph::Helsinki -> DTA::CAB::Morph::Helsinki::EN + DTA::CAB::Morph::Helsinki::DE
v1.113 Thu, 16 Jul 2020 15:42:53 +0200 moocow
* added first-stab at DTA::CAB::Format:CONLLU conforming to https://universaldependencies.org/format.html
- optional special handling for some 'MISC' fields, including "json=JSON" for embedded TJ-format
- mostly based on TJ-format, with some "special fields"
- UNTESTED: development interrupted by svn server crash
v1.112 Mon, 30 Mar 2020 13:49:13 +0200 moocow
* added %DTA::CAB::Analyzer::CLOSURE_CACHE - avoid re-compiling accessClosure() subs
- appears to fix memory-leak symptoms for Analyzer::Moot on debian stretch (perl 5.24.1)
v1.111 Thu, 23 Jan 2020 10:15:17 +0100 moocow
* fixed normalized-key bug in DTA:CAB::Analyzer::DmootSub, ::MootSub
- earlier versions used normalized text as type-key ($dmoot->{tag},$moot->{word})
- old code caused bogus data-sharing for distinct input types mapped to same normalized form
- bug observed for type-wise analysis of dwb1 lemmata ("aasz","asz") both mapped to "As", both assigned lemma ("aasz")
if processed in the same document, otherwise "asz" gets assigned lemma "asz"
v1.110 Mon, 23 Sep 2019 15:57:09 +0200 moocow
* added http+unix support to dta-cab-check.perl
* added CAB::Server::HTTP::Handler::sanitizeCharset() : tweak UTF-8 charset args to browser-friendly "utf-8"
v1.109 Tue, 26 Mar 2019 13:52:49 +0100 moocow
* added Analyzer::DTAClean 'cleanPublic' option
- implements 'clean' option for public web-service
- still prunes 'unsafe' attributes (morph) including rwsub and dmootsub
- interest expressed by Graz users (G. Vogeler), 2019-03-19
* added Format::CorpusExplorerPlugin (aliases "ceplugin", "ceplug")
* added Format::Raw::Base
- common base class for Format::Raw::*, supports simple "normalized plain text" output
v1.108 Thu, 21 Mar 2019 12:59:49 +0100 moocow
* added ling2norm.xsl, ling2plain.xsl
* added format aliases teiws-names, teiws-ling-names (input option {teinames=>1})
v1.107 Fri, 22 Feb 2019 09:40:00 +0100 moocow
* added DTA::TokWrap and GermaNet::Flat dependencies (used by built-in analyzer classes)
v1.106 2019-02-12 moocow
* improved Version.pm (re-)generation: only if this looks like a "proper" checkout
* added Changes (this file: extracted from SVN logs & reformatted)
* cleanup for CPAN release
* SVNVERSION tweaks (revision only, no root URL)
* find.hack: File::Find hacks for ExtUtils::Manifest
* removed some (but not all) doubled and/or recursuive symlinks from SVN
- they don't play nicely with ExtUtils::Manifest / MakeMaker / File::Find
v1.105 2019-01-09 moocow
* added ddc full lemma-list (LemmaListAll LemmasAll llist-all ll-all lla lemmas lemmata)
v1.104 2018-12-17 moocow
* default -log-watch=USR1 for dta-cab-server.sh
* added server logInitAnalyzer option
* added -log-watch=SIGNAL syntax (reload log-config on user signal, e.g. -log-watch=USR1)
v1.103 2018-12-06 moocow
* XmlLing : escape token text if not running in twcompat mode
* syslog debugging
* added cab-syslog.l4p -- getting weird rsyslog errors
> Oct 25 13:55:12 plato liblogging-stdlog: action 'action 0' resumed (module 'builtin:ompipe') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
> Oct 25 13:55:51 plato liblogging-stdlog: action 'action 1' resumed (module 'builtin:ompipe') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
... on every message; not pretty
* systemd-friendliness for cab sysv-scripts (control groups, etc)
* dta-cab.sh: merged changes from bogus for in dstar/cabx/
* added cgiwrap for version
* web-howto typos
* updated 'fliegen' example in web-howto
* clean Version.pm
* WebServiceHowto updates for XmlLing
* alias tweaks
* XmlLing for server mode
* added support for TEI att.linguistic features
- new formatter Format::XmlLing (flat att.linguistic features, with optional TokWrap compatibility for later spliceback)
- new TEI and TEIws options 'att.linguistic=bool' : force use of XmlLing sub-formatter with appropriate options
- new TEI and TEIws aliases (ltei ... ling-tei-xml, lteiws ... ling-tei-ws)
- updated Format SUBCLASSES docs and examples
- still TODO: integrate new formats into CAB demo web-GUI and HOWTO
* added format XmlLing: use TEI att.linguistic attributes
v1.102 2018-06-20 moocow
* howto updates
for spliced2ling
* added
spliced2ling xsl stuff
* HttpProtocol.pod: added explicit
'xpost' reference
* DSGVO stuff
* clean Version.pm
* attempt to ensure Listen=SOMAXCONN for
DTA::CAB::Server::HTTP::UNIX
v1.101 2018-04-13 moocow
* dta-cab-server.sh: handle tcp<->unix relay via new variables
+ added -verbose LEVEL option for debugging
+ added 'config|debug' action to view configuration variables
* system/xlit-unix.plm: test tcp relay handling by sysv-like dta-cab-server.sh
* more cab-v1.101 check tweaks
(icinga/pnp4nagios doesn't like floats in engineering notation)
* dta-cab-http-check.perl: v1.101 perfdata fixes
* status.html.tpl: compatibility fixes for transition
* added rss and exponential moving average query times to CAB status output
- implements mantis #26054
v1.100 2018-03-21 moocow
* dta-cab-server.sh:
- disable watchdog by default (let icinga do this)
- use administrative lock-files to avoid concurrent operations
* minor tempfile tweaks attempting to get at mantis #25739
v1.99 2018-03-07 moocow
* wd_verbose=1 after r27799 debugging left it at 2
* dta-cab-server.sh: tweaks for process groups (UNIX socket server + socat relay)
* clean Version.pm
* UNIX process group tweaks
* dta-cab-server.sh: kill whole process group on 'stop'
* clean Version.pm
* v1.99: improved handling for pathological Server::HTTP::UNIX conditions
(stale unix socket, stale relay process)
- server now only WARNs for stale relay sockets; dodgy 'fix' for
mantis bug #25326 (should be a valid fix for identical relay
command-lines as in bug #25326)
v1.98 2018-02-21 moocow
* moot langid FM.* pseudo-tags: keep CARD analyses too
* check for undef pid_cmd() output in Server::UNIX -- avoid heinous death in File::Basename::basename()
v1.97 2018-02-12 moocow
* v1.97: peerenv() optimization for DTA::CAB::Server::HTTP::UNIX::ClientConn
- only call peerenv() for peer command 'socat'
+ support http+unix:// scheme in DTA::CAB::Client::HTTP::lwpUrl()
v1.96 2018-02-09 moocow
* check for existing rc-file
* clean Version.pm
* tweaks for implicit creation of parent directories for unix sockets
* fixed Server::HTTP::UNIX destructor code
- was killing off relay process via signal for post-on-fork destruction
* documented new UNIX socket stuff
* added support for UNIX server sockets in CAB/Client/HTTP.pm, dta-cab-http-client.perl
* DTA::CAB::Server::HTTP::UNIX seems to be working
- built-in socat relay
- emulation of peerhost() and peerport() for relayed sockets via socat EXEC:'socat - UNIX-CLIENT:/socket/path' idiom + /proc/PEERPD/environ
* removed stale t.t
* xlit-http: disable cache again
* svn:ignore cleanup on plato
* started working on Server::HTTP::UNIX (should work more or less transparently with dta-cab-http-server.perl)
v1.95 2018-01-15 moocow
* Unicode::CharName version fix
* report memory usage in kB, not pages
v1.94 2017-11-13 moocow
* fix mantis bug #23127, introduced in v1.93
v1.93 2017-11-10 moocow
* dta-cab-analyze.perl: removed debug code
* db flags O_RDONLY fix for Dict::DBD
* don't include 'mhessen' in dmoot/morph
- if we've non-trivially normalized via dmoot, we probably don't want it
- plus, we're not sure if it's enabled anyways
* added Analyzer/Morph/Extra hacks; based on Morph/Latin/*, tested with Morph/Extra/OrtLexHessen
v1.92 2017-11-09 moocow
* *.cmdi-xml: added 'landing pages'
* added getcmdi.sh: fetch current CMDI record
* Raw::Waste utf8 handling woes
* check defined(ENV{HOME}) for Format::Raw::Waste (docker irritations)
* debugging for Format::Raw::Waste cache-clearance
* new default raw subclass=Raw::Waste; added shared model caching and auto-update to Format::Raw::Waste
* added support for environment variable DTA_CAB_FORMAT_RAW_DEFAULT_SUBCLASS
v1.91 2017-09-05 moocow
* removed stale test data cz.*
* cab-demo script cab.perl : updated target server to 194.95.188.42:9099 (data.dwds.de:9099)
* hack to allow global alternate default waste config dir (for cabx servers)
+ 'raw' input still uses default HTTP subclass
v1.90 2017-05-24 moocow
* blockscan debugging / kira
* cleaned up some debugging code
* fix optimization for Format::XmlNative::blockScanBody()
* optimization for Format::XmlNative::blockScanBody()
v1.89 2017-05-19 moocow
* v1.89: new default labenc=>auto (utf8 > latin1) for Analyzer::Automaton
v1.88 2017-05-18 moocow
* fixes for new Chain::Multi::getChain() method
* Makefile.PL workarounds for broken EUMM on kira (ubuntu 16.04 LTS / EUMM v7.0401)
* Chain::Multi::getChain() method (useful with dta-cab-analyze.perl -onload option)
v1.87 2017-05-16 moocow
* added -onload option for dta-cab-analyze.perl (porting dta cab_dbs builds to generic dstar)
v1.86 2017-05-12 moocow
* cabx server debugging, preparing for merge
* report top-level analyzer version in 'status' output
* Analyzer::versionInfo(): include rcfile
* version template fix
* better chain-handling for DTA::CAB::Analyzer
* cab server /version handler: analyzer options
* added cab server /version wrapper
* en-chain: remove msafe?
* DTA::CAB::moduleVersions(): renamed match/ignore options to moduleMatch, moduleIgnore
* DTA::CAB::moduleVersions(): return all version identifiers as strings
* DTA::CAB::moduleVersions() option changes
* honor 'chain' option in Analyzer::versionInfo() [hack]
* added options for Analyzer::versionInfo():
- don't report timestamps for disabled analyzers (allow user selection)
* updates for dta-cab-version.perl
* various version tweaks; added DTA::CAB->moduleVersions()
v1.85 2017-04-28 moocow
* teiws ner-parsing: more fixes for old libxml (kaskade)
* clean Version.pm
* tcf+ner: attribute-order tweaks
* more fixes for tcf+ner on kaskade
* teiws ner-parsing: fixes for old libxml (kaskade)
* v1.85: teiws, tcf ner support
- teiws: added support for parsing $w->{ner} from input //(persName|placeName|orgName|name); use -fo=teinames=1
- tcf: added support for output //namedEntities layer with
-fo=teilayers='... names ...', class alias -fc=tcf+ner
v1.84 2017-04-27 moocow
* fast version checking for CAB configurations with dta-cab-version.perl
* lemmatizer updates for taghm-2.5 lemma-internal 'diamond-tags'
* doc-extra/tcf-orthswap.xsl
v1.83 2017-04-25 moocow
* webservicehowto url tweaks (bbaw epub server URLs moved)
* WebServiceHowto: added tcf munger
* explicit 'please cite this' crap
* Analyzer::Automaton: tweaks for utf-8 encoded labels
* updates for tagh v2.5 (diamond-tags <A> etc.)
v1.82 2017-01-25 moocow
* removed @rendition=#aq heuristics in Analyzer::Moot::Boltzmann
(attempt to fix mantis bug #18392)
v1.81 2017-01-10 moocow
* updated taghx http config: logo, status
* dta-cab-http-check.perl: report n cached hits rather than hit rate in perfdata
* better logging for ignored connections
* clean Version.pm
* dta-cab-http-check.perl set svn:keywords
* dta-cab-http-check.perl tweaks
* tested dta-cab-http-check.perl: seems working
* added dta-cab-http-check.perl: nagios/icinga plugin
* CAB::Server::HTTP: hacks for hadling chrome-style 'background connections'
- accept()ed sockets without any request on them
* added null-http.plm: dummy test server
* improved 'status' response
- cacheHitRate, nRequests, nErrors, memSize
v1.80 2016-12-02 moocow
* format docs
* dta-cab-server.sh: max 30 restart attempts (sleep=10)
* various lemmalist tweaks
* return all lemmata for function words in new specialized DDC-expansion format LemmaList
v1.79 2016-09-05 moocow
* fixed cab.plm eqphox reference
* added missing eqphox config to cab.plm
* cab-rc-update.sh: read local config file if present
* dta-cab-server.sh: fixed hanging when running via scripted ssh
- stdout/stderr for subprocesses was still bound on 'start', 'restart'
* updated http server docs
* added http server forkMax option
* added http server forkOn(Get|Post) options
v1.78 2016-06-16 moocow
* howto fixes
* udpated web howto for date-dependent chains
* udpated Chain::DTA docs for range-dependent chains
* auto-disable date-dependent rewrite tranducers (e.g. for Dingler)
* removed debug code from dta-cab-analyze.perl
* added date-dependent rewrite models for DTA chain
v1.77 2016-06-13 moocow
* don't treat links as XY for LangId::Simple
v1.76 2016-06-09 moocow
* fixes and tweaks for en-wsj (english)
* added Morph/Helsinki.pm
- TAGH-simulation postprocessing for Helsinki-style morphological transducers
v1.75 2016-04-29 moocow
* updated cab howto for new server limit: 512KB -> 1MB
* updated WebServiceHowto: added screenshot
* pass error response through apache cgi wrappers
* more error tweak attempts
* fixed content-type: html for new error messages
* improved error reporting in Server::HTTP::clientError(), Server::HTTP::Handler::cerror()
- generate generic error responses and send them using
HTTP::Daemon::ClientConn send_response() method rather than its
send_error() method, since the latter generates html markup
without root element (may be a problem for weblicht)
- see mantis bug #12941
* http handler tweaks
* cab-http.plm: maxRequestSize 512KB -> 1MB
v1.74 2016-02-12 moocow
* more doc tweaks & fixes
* re-generated doc index
* updated HOWTO
* better checkbox value pass-in handling
* added SIGPIPE handler for Server::HTTP : avoid death with exit code 141
- following perlmonks suggestion
v1.73 2015-11-16 moocow
* LangId::Simple: workaround for mantis bug #6737
v1.72 2015-11-12 moocow
* fixed double URL-encoding of query parameters on apache redirect (NE apache redirect option)
* file demo -> file upload
* symlinked tests/format-examples -> ../format-examples
* removed tests/format-examples (symlinking)
* moved tests/format-examples/ to top-level format-examples/
* renamed 'demo' to 'web service'
* Format/TEI: use tokwrap 'auto' low-level class by default, not 'http'
- should speed things up a bit; we're getting weird errors from kaskade http tokenizer for some reason
* web-service howto cleanup
* more cab-curl-*post.sh cleanup
* made cab-curl-*post.sh a bit more comfortable: allow omission of base URL
* htmlifypods fixes
* webservice howto re-formatting
* web howto; looks pretty much ok
* more web-service howto work, TEIws fixes
* TEIws fixes for missing @t or @text attributes
* xml-rpc: ignore textbufr, teibufr
* clean version.pm
* xml-rpc: ignore textbufr, teibufr
* doc fixes while writing web howto
v1.71 2015-11-10 moocow
* more format examples
* more format documentation: examples
* fixed some pod errors
* documented some more formats
* documented LangId::Simple
* Analyzer/Moot.pm set use_dmoot=1 by default (unless set explicitly in analysis opts)
v1.70 2015-10-02 moocow
* fixed morph+moot on csv1g files for dstar cab_eqlemma/corpus-csvx.1g
* v1.70: fixed 'Possible precedence issue with control flow operator' warnings from perl v5.20.2
v1.69 2015-08-06 moocow
* clean Version.pm
* fixed 'Possible precedence issue with control flow operator at DTA/CAB/Format/XmlTokWrapFast.pm line 147.' warning
* handle EINTR (interrupted system call) in sysread() calls from CAB::Socket
- used for parallel job-queues in dta-cab-analyze.perl as called in dstar build/cab_corpus/ subdirectory
* EINTR woes
* added cab-error-eintr.log: 'interrupted system call' during CAB analysis in dstar build
- probably resulting from a SIGCHLD handler getting called during a queue-socket read
v1.68 2015-04-29 moocow
* fixes for LangId::Simple if no 'msafe' analysis is present (fixes bogus dstar FM.la tags)
v1.67 2015-03-25 moocow
* example: updated
* NE-tagging heuristics: don't force NE for placeName (e.g. 'Golf von Foo')
* v1.67: dmoot, moot heuristics for TEI <(pers|place)Name> and <foreign> tags
- doesn't work from straight-up TEI input, since 'xp' attribute is populated by build-time script dtatw-get-ddc-attrs.perl
v1.66 2015-03-06 moocow
* added weblicht -> cmdi
* fixed PatternLayoutl typo in Logger.pm (introduced in r5410)
* re-set CAB_SLEEP default to 3 (for watchdog)
* removed tokenizer-waste.xml (replaced by tokenizer-waste-update.xml)
* removed tagger-new.xml (replaced by tagger-update.xml)
* removed ddc-dstar-c4.cmdi-xml
- superseded by ddc-dstar-c4-update.cmdi-xml
* tiny tweaks
* dta-cab-server.sh robustness improvements
* more cab-server stuff (still wip)
* improved dta-cab-server.sh stuff
* added 'fmt=tcf' to 'Input Parameters' section for dstar/ddc services
- otherwise limit gets integrated with a '?'
- e.g. http://kaskade.dwds.de/dstar/dta/dstar.perl?fmt=tcf?limit=10 rather than ...?fmt=tcf&limit=10
* finer-grained sleep commands
* added updates
* added *update.cmdi-xml
* implemented WebLichtWebServices:N naming scheme in //CMD//ResourceProxy/@id
* added system/apache-cgi-wrap/.htcabrc-data-9096-autoclean
* added tcf+pos pseudo-formats to demo.html.tpl
* added tcf-pos pseudo-format
* added ddc-c4*.cmdi-xml
* moved dta corpus query to id=s070!
* added some more web services
* moved orig/cab.cmdi-xml back to .
* added WebLichtWebServices.url
* moved WebLichtWebServices.url -> WebLichtWebServices.url_old
* fixed TCF parsing bug
v1.65 2014-12-02 moocow
* don't let topkwrap ignore mapclass attribute in tei mode
* TEIws format update#
- allow #-prefixed IDs in @prev,@next attributes gracefully
* disabled debug code
* ignore some stuff
* tcf tweaks: encode tei in textCorpus/textSource as schema trunk describes
* tei-in-tcf embedding uses textSource element
v1.64 2014-11-27 moocow
* disable cab demo debug
* Format/JSON fix: don't output scalar references (e.g. teibufr, textbufr)
* tcf token id fix
* tcf sentence id fix
* fixed TCF typos
* always include //sentence/@ID for TCF format
v1.63 2014-11-25 moocow
* htdocs/demo.js fixes for implicit tokenization of un-tokenized tcf
- effectively ignore 'tokenize' checkbox for tcf
* clean Version.pm
* TCF format fixes and updates
- improved tcf parsing using getChildrenByLocalName() instead of findnodes()
- added tcf tokenization if only 'text' layer is present using DTA::CAB::Format::Raw
* ifmt is safe too
* improved tcf parsing
v1.62 2014-11-12 moocow
* added 'ofmt' to list of safe pass-through parameters
* status home link: .. (for demo)
* demo fix: disable raw text for live-mode
* demo.js fixes for inline return
* more tcf options
* output format option only for upload gui
* more tcf i/o tweaks
* more tei/tcf and server i/o format tweaks: looks good, go live on MONDAY
* different in- and output-formats for server, TEI, TCF format tweaks using doc->{textbufr}
v1.61 2014-10-16 moocow
* added eval files
* don't output sentence comments for ExpandList
* verbose logging options
* log-stderr typo
* added playground/logo as symlink
* removed old logo/ symlink ; replacing with real mccoy
* cabx directory basically in place
* automaton resultfst crashing
* added logos
* cab demo: added logo
* added 48p logo
* tag-hacks: added mathematical operators to 'punctuation-like' class
* MootSub tag-tweaking hacks: avoid 'normal' tags for non-wordlike tokens
v1.60 2014-08-22 moocow
* fixed DTA::CAB::Analyzer::_am_wordlike_regex() to allow combining diacritical whetver [[:alpha:]] is included
- unicode should really call these things alphabetic, imho, but it doesn't
v1.59 2014-06-24 moocow
* added dta 'lemma', 'lemma1' chains (with exlex)
* sleep between stop and start actions on restart
* allow direct demo-gui display of xml responses
- fixed 'pretty' parameter pass-through bug in DTA::CAB::Format::Registry::newFormat()
- stop tcf format complaining about missing document for spliceback (avoid garbage in apache logs)
v1.58 2014-06-16 moocow
* added example scripts cab-curl-post.sh, cab-curl-xpost.sh
* reapClient chost fix2
* daemonMode=fork for DTA::CAB::Server::HTTP
- only for POST queries
* xlit-http.plm : turned down logLevel
* server status tweaks
v1.57 2014-06-13 moocow
* added OpenThesaurus expander to dta chain (uses Analyzer::GermaNet class)
* added OpenThesaurus expander
v1.56 2014-06-11 moocow
* GermaNet : allow synset names as 'lemma' queries
* apache-cgi-wrap default host = localhost
* ExpandList/LemmaList alias fixes (no CODE refs in default formats)
* v1.56: added ExpandList aliases LemmaList,llist,ll,lemmata,lemmas,lemma
+ added Chain::DTA analyzers default.lemma, default.lemma1
* added LemmaList|llist|ll|lemmata|lemmas alias for ExpandList
+ using CODE-ref hack to extract non-root attribute moot/lemma
+ better solution would be to polish up and use (something like) Data::ZPath
v1.55 2014-05-27 moocow
* moved tagh-http.plm to taghx-http-9098.plm
* eliminated 'ge|' prefix removal hack for tagh-lemmatization
- for compatibility with dwds-kc20 lemmatization
v1.54 2014-05-15 moocow
* updated format docs
* replace 'xml' with 'txml' in demo list
* allow lowercase letters in morph tags parsed by Analyzer.pm accessor macro am_tagh_fst2moota
- fixes bogus VV* tags for new [roman] pseudo-analyses from dta-morph-additions
v1.53 2014-03-16 moocow
* set default CAB_SLEEP=5
- try to avoid restart failures on services (Cannot bind socket 0.0.0.0 port 9099: Address already in use);
- but SO_REUSEADDR ought to be set - what gives?
* don't set ReusePort, since it gives errors: "Your vendor has not defined Socket macro SO_REUSEPORT"
* documented ExpandList
* added csv1g formatter
* added moot/details field: best analysis, for saving tagh analyses
- new moot/details should be swept by analyzeClean
v1.52 2014-01-31 moocow
* tei: disabled debug
* added twTokenizeClass pass-through to DTA::TokWrap
* fixed tei rmtree() bug on multiple processes
* apostrophe-s handling
* v1.52: updated 'word-like' regex to include 's suffixes
+ centralized word-like regex to DTA::CAB::Analyzer::_am_wordlike_regex()
+ updated/unified email address to [email protected]
v1.51 2014-01-13 moocow
* Cab/Analyzer/MootSub
- fixed bug assigning lowercase lemma 'urteilen' to urteil/NN~urteil~en[VVIMP]
- CAB/Format/TT : fixed (d|m)oot analysis parsing
* TokPP/Waste: fixed again
* TokPP/Waste-related segfaults on services
* CAB/Analyzer/TokPP/Waste.pm : don't try to store annot key (avoid segfaults)
* basic redundancy handling for moot/analysis and dmoot/morph (mostly just aesthetic)
* TokPP analyzer re-factored to use Moot::Waste::Annotator by default
v1.50 2013-12-10 moocow
* dmoot fix for list-valued $w->{lang}
* new raw input modes
* improved raw-text input using moot/waste
- either locally (CAB::Format::Raw::Waste)
- or via http (CAB::Format::Raw::HTTP)
* added CAB::Format::Raw::Waste : waste tokenization
- currently only works by writing a temporary string buffer and passing to Format::TT for final document construction: UGLY
- we should probably use the waste buffer classes for this (making these visible to perl)
- better yet, this is a poster child for perl-level TokenWriter subclassing
* XmlTokWrapFast: read //w/moot/@* into $w->{moot}{$_}
v1.49 2013-12-09 moocow
* updated to v1.49
v1.48 2013-12-06 moocow
* added capsFallback automaton option; set by default for Analyzer::Morph
* cab automaton-based analyzers: set check_symbols=>0
v1.47 2013-12-05 moocow
* added system/dwds/ and system/init/dwds-http-9096.rc
* added dwds-http-9096.plm wrapper
- removed request-size limit (maxRequestSize=undef)
- disable autoclean modee
* fewer unknown-symbol warnings (once per symbol per object)
- XmlTokWrapFast: output //s/@pn
* CAB/Format/TEI: default tokenizer class back to http
* fix warning for missing content-length
* TCF: default to format level=1
* Moot:
- compatibility fix: apply tag-translation table BEFORE model lookup
* set global server maxRequestSize=512k for cab-http.plm
* added maxRequestSize key to CAB::Server::HTTP and CAB::Server::HTTP::Handler::Query
* allow TEI to support -fo=txmlfmt=XmlTokWrapFast
- 2x faster than default, but doesn't support all keys
* CAB/Chain.pm: propagate logTrace from opts if set there
v1.46 2013-10-10 moocow
* edited cab.cmdi-xml with local export (Edmund): sending to Frank
* removed bogus debug code from dta-cab-analyze.perl
* cab.plm: moot,dmoot use 'dtiger' infix instead of tiger
- centralized training source in moot-models/dta-dtiger
* Format/Raw.pm : handle U+00AD (SOFT HYPHEN)
* LangId::Simple : don't output lang_counts by default
* cab-rc-update.sh: update from kaskade
* Raw tokenizer: handle '[Formel]'
* improved LangId::Simple
- now counts number of stopword CHARACTERS (vs tokens)
- added better 'xy' rules, also added an xy 'stopword' list in
cab_automata/langid/data/xy.t
v1.45 2013-09-03 moocow
* CAB::Analyzer::LangId : got working again; results not very encouraging
* special handling for double-initial caps in Analyzer::Unicruft: updated version
* special handling for double-initial caps
* re-built logos using inkscape
* added new compatibility symlink cab-favicon.png
* removed old cab-favicon.png
* added new logos
* added caberr-64.png
* updated cab favicon
* MorphSafe badTypes map now maps (text=>isGood) rather than (text=>isBad)
- fixes bug in which badMorph heuristics were overriding a
__good__ entry in badTypes file (Gutherzigkeit)
v1.44 2013-07-22 moocow
* tcf / format fixes
v1.43 2013-07-11 moocow
* TCF format fix: reset temp variables ($pos,$lemma,$orth) between words
* added TCF to demo formats
* default TOKENIZE_CLASS='auto' for TEI via TokWrap
* checkin with updated Version.pm
* first version with TCF support
- how finicky do we need to be with offset-based tokens, sentences, etc?
- and how do we handle metadata?
* added basic TCF format (output only atm)
v1.42 2013-06-23 moocow
* -fc option added to dta-cab-splice-syncope.perl
* better version check
* TEI format debugging and tweaks
- can now set -fo=txmlfmt=XmlTokWrapFast for e.g. fast TEI-format input, but this slows down TEI-format output
- best results seem to be with -io=txmlfmt=XmlTokWrapFast
-oo=XmlTokWrap for plain convert; ymmv with actual analysis going on
* lots of debugging code
* better TEI format debugging with e.g. -fo teilog=debug
* removed Format::TEI debug flag
* fixed ugly regex-slowing $POSTMATCH in CAB::Format::XmlNative::blockScanFoot()
- use perl 5.10 /p modifier and ${^POSTMATCH} instead
v1.41 2013-06-05 moocow
* default xml format now resolves to tei
* cab.perl: read dirname($0)/.htcabrc for local overrides
* cab.perl: read cab.perl.rc
* demo.js: fix cab_url_base guessing regex if parameters are specified
- e.g. http://localhost:9099/?q=foo
* MootSub lemmatization: honor 'FM.*' tags
* cab demo: pass through 'file' parameter
* demo links seem to work now!
* demo init: fix links
* demo.js &-expansion woes
* workaround for Unify.pm choking on REGEXPs in Format::Registry
- implement STORABLE_(freeze|thaw) for Format::Registry
- allows rollback of Unify.pm changes in r9738 (explicit
DS-traversal with potential cycles, caused infinite allocation
loop and memory explosion in 'real' CAB servers)
* added /upload and /file paths to cab-http.plm
* demo/upload tweaks (don't call it 'upload')
* file upload updates
* merged in branch htdocs-1.41-upload -r9728:9736
* fixed YAML dispatch
* updated demo.js: make traffic-light frame work in proxy mode
* language guesser tests
* wrap various YAML implementations directly in YAML.pm (rather than subclass hacks)
* LangId::Simple: only use unicode character block hacks for words of length >= 2
* hasmorph for text-mode output
* updated DTAClean: added 'hasmorph' key
* prune analyzers in cab.perl wrapper
* dingler: try to enable autoclean
* cab-http-9099: auto-clean on
* trimmed cab-http-9099.plm to ignore authentication
* updates from kaskade2 for debian/wheezy
* lang-guesser updates: unicode hacks
* Morph::Latin : only analyze if isLatinExt
* Moot: use FM.$lang as tag for language-guesser hack
* XML formatting woes
* built in langid heuristics to Moot/Boltzmann and Moot
* added LangId::Simple analyzer, built into DTA chain as 'langid'
v1.40 2013-04-30 moocow
* smarter verbosity for cab-rc-update.sh
* updated to use (my own) GermaNet::Flat API module, rather than clunky google code variant
* added -begin and -end CODE options to dta-cab-analyze.perl
* Format::Raw : parse underscores as word-like
v1.39 2013-04-24 moocow
* removed xlemma stuff again
* MootSub: generate moot/xlemma field: raw TAGH segmentation for best lemma
* bugfix lemma(Christentum) -> Christenenum (cab lemmatizer ~e)
* lemmatizer: rename verb inflections
* GermaNet runs sentence-wise, in order to access moot/lemma
+ added GermanNet::Synonyms
+ changed GermaNet labels to:
- gn-syn (Synonyms)
- gn-isa (Hyperonyms~superclasses)
- gn-asi (Hyponyms~subclasses)
+ added GermaNet analyzer option LABEL_max_depth e.g. gn-syn_max_depth for some control of resolution
* oops: fixed multi-load of GermaNet and descendants
* added germanet hypoyms to DTA
* added and tested basic GermaNet relation closures
* added GermaNet/{RelationClosure,Hyperonyms,Hyponyms}.pm
* added Analyzer::GermaNet.pm
v1.38 2013-03-11 moocow
* added xlist format to demo
* ExpandList fix
* pretty-printing for ExpandList
* TokPP: replaced some bad [[:digit:]]* with [[:digit:]]+ regexes
- upshot: don't analyze empty string as CARD
* Analyzer::Morph::Latin::CDB : use _am_xlit rather than $_->{text} as key
- fixes caberr bug #66980 (Phaſmate -> Faßmate != Phasmate) b/c utf8 variant isn't in latin lexicon
v1.37 2013-03-08 moocow
* added dingler server, running on kaskade @ port 9097
* added dingler server configs
* fix typo
* add FM,XY moot analyses for words with non-latin characters
* v1.37: dmoot: leave as-is if !isLatinExt
v1.36 2013-02-22 moocow
* syncope csv format: let "'s" be LOWERCASE_WORD (python regex compatibility hack)
* v1.36: fixed moot bug resulting in e.g. --/NE
- problem was bad propagation of tokeinizer (toka) tags of the form [$(] through _am_tagh_list2moota rsp _am_tagh_fst2moota
v1.35 2013-02-11 moocow
* updated lemmatization heuristics: punish orgnames
v1.34 2013-02-05 moocow
* format/syncope/csv: 'digit' type now includes dotted numerics
* ignore dta-syncope-ner.*
* remove debug code from dta-cab-convert.perl
* Format::TEI fix: include PID in tmpdir name so parallelization works
* morph fst: check_symbols=>0
* Format/XmlXsl gone
* removed some debug code from cab.plm
* resource changes (dta-cabopt.mak: eqphox_xocoef* -> eqp_xocoef_*)
* ignore dta-cabopt.mak
* set dta-cabopt.mak.v0
* added dta-cabopt.mak.v0 (original parameters)
* cab.plm: parse RCDIR/cabopt.mak for cab-optimization parameters
* added Utils::(min2|max2)
* added missing chomp() to repaired tj
* fixed non-linear slowdown for Format::TJ
- problem seems to have been buffer-and-parse-string strategy
- likely related to the bizarre non-linear slow-regex-match-on-large-buffers we saw in TokWrap::tokenize1
- fix is to avoid buffer and parse filehandles directly
- TODO: port this approach to TT and Text
* Format.pm: pre-allocation string hacks for fromFh_str(): no joy
- problem is major non-linear slow-down for large TT-based formats (including TJ)
v1.33 2012-11-02 moocow
* better analyzePost fixes
* Analayzer::Automaton::analyzePost : run after analyzeSet() closure
+ Analyzer::accessClosure(): allow passing of HASH-refs for more flexibility in config-files
* added Format::TT I/O for raw-sentence text (either in sentence id-line with "\t=TEXT" or in dedicated "%% $stxt=TEXT" line
* high-level I/O wrappers DTA::CAB::Document::(from|to)(File|Fh|String)
* updated XmlTokWrapFast : include xb attribute if available
* updated for dta-tokwrap v0.37 - v0.38
v1.32 2012-10-04 moocow
* fixed more tokwrap v0.37 bugs (explicit <toka> grouping now output by tokwrap)
* fixes for dta-tokwrap v0.37
* updated Client::HTTP docs
* added 'ws' attribute to XmlTokWrapFast
* got Format::TEIws working
+ updated for dta-tokwrap v0.36
v1.31 2012-09-24 moocow
* moved gfsmxl parameters from old setLookupOptions() API to new 'analyzePre' key for Analyzer::Automaton subclasses
+ more flexible in general
+ updated cab.plm to reflect changes in semantics
+ old-style code using max_paths, max_weight, and max_ops should still work if no 'analyzePre' key is present
* updated cab-rc-update.sh: changed source url from 'dta2012' back to 'dta'
v1.30 2012-09-18 moocow
* content-length fixes for kaskade
* updated demo.hs, demo.html.tpl: fixes for apache-cgi-wrap/
* added generic apache cgi wrapper dir: system/apache-cgi-wrap
* updated CAB::Format::TEI for dta-tokwrap v0.35
v1.29 2012-09-05 moocow
* Format::SQLite updates for almost-ready eval-corpus
* syncope-tab alias for SynCoPe::CSV
* another name change: now in XmlTokWrapFast
* oops: another id->nid rename
* syncope/ner fixes: 'id' is a bad attribute name for subsequent splice
* syncope splice fixes
* added dta-cab-splice-syncope.perl
* use HYPHEN-MINUS instead of HYPHEN_MINUS for syncope csv
* add sid,wid numeric suffixes to syncope-csv location
* oops: mapclass was already in XmlTokWrapFast
* added mapclass attribute to Format::XmlTokWrapFast
* removed analyzeDebug option from Analyzer::Moot::Boltzmann
* copy fixes for dmoot
* empty sentence fix for moot,dmoot
* added dmoot flag 'lctags': bash dmoot tags to lower case
+ added moot flag 'lctext': bash text to lower-case
+ for use with new build hmms '*.lc.(1|12|123).hmm'
* abs() rule for TJ : level=-2 --> -text, +canonical
* added dta-cab-eval.perl
v1.28 2012-07-23 moocow
* SQLite changes: history now stored directly as json (TODO: move to version control)
* improved Format/SQLite parsing -- throughput up from <100 tok/sec to >15k tok/sec
* added CAB::Format::SQLite.pm for EvalCorpus
v1.27 2012-07-18 moocow
* updated default.(base|type) chains in CAB/Chain/DTA.pm
* map 'old' key to 'text' in Format::XmlTokWrap
* v1.27: blockScan fixes for Format::XmlNative (and by inheritance Format::XmlTokWrapFast)
- fixes mantis bug #543 : disappearing pages
- this worked with negative lookahead regexes, but those crash perl on some inputs (grr....)
v1.26 2012-07-06 moocow
* debug
* cab-rc-update.sh: pull from dta2012/cab rather than ddc/cab
* real new DTA-unknown-char U+FFFC (object replacement character), various bugfixes
v1.25 2012-07-04 moocow
* cab improvements for dealing with unicode replacement character (U+FFFD) as unknown-text marker
* workaround for blockScan() segfault: slower but works on plato
* segfault bughunt / kaskade:
- dying at Format/XmlNative.pm line 146 (regex match in blockScanFoot) for
ddc/dta2012/build/xml_tok/campe_robinson02_1780.TEI-P5.chr.ddc.t.xml
in build/cab_corpus
- only dying under make (make -j , -blockSize don't matter)
- segfault backtrace:
0x00002b26f788ef77 in ?? () from /usr/lib/libperl.so.5.10
(gdb) bt
#0 0x00002b26f788ef77 in ?? () from /usr/lib/libperl.so.5.10
#1 0x00002b26f7896fd0 in ?? () from /usr/lib/libperl.so.5.10
#2 0x00002b26f789ad29 in Perl_regexec_flags () from
/usr/lib/libperl.so.5.10
#3 0x00002b26f7837e76 in Perl_pp_match () from
/usr/lib/libperl.so.5.10
#4 0x00002b26f7831392 in Perl_runops_standard () from
/usr/lib/libperl.so.5.10
#5 0x00002b26f782c5df in perl_run () from
/usr/lib/libperl.so.5.10
#6 0x0000000000400d0c in main ()
* more choice stuff!
* 'null' analyzer fix
* add explicit 'null' analyzer (not just empty chain) to DTA
* tei re-fix (revision 7415:7416 broke DTAQ)
* added DTA pseudo-analyzer 'null'
* tei fix
* ner fix
* added NER to DTA chain
* moved nerec/ into tests/
* added nerec/ test directory for syncope ne-recognition
* added Analyzer::SynCoPe::NER : named-entity recognition via SynCoPe XML-RPC server
v1.24 2012-03-28 moocow
* dta-cab-analyze.perl -fo option fix
* even more msafe adaptation; use unicode class \p{Letter}
* more msafe adaptation
* typo fix
* updated MorphSafe:
- all-non-alphabetic tokens are now considered "safe" (replaces /^[[:punct:][:digit:]]*$/ heuristic)
* add U+A75B (r rotunda) to latin1x-safe symbols
* added rudimentary query handling to cab demo.js, demo.html.tpl
* improved lemmatization for XY (no lower-case bashing)
* added canonical option to Format::TJ if level>=0
* hack: remove ge\| prefixes in lemmatizer
* added live javascript demo.js to taghx-http.plm
* updated MANIFEST: remove CAB/Format/JSON/*.pm, CAB/Format/YAML/*.pm
* fixed cab/moot bug 'nachgesucht->VVFIN'
- problem was inconsitency between model (uses TAGH tags for lex
classes e.g. VVPP2) and CAB-generated input (used translated
tags, VVPP2->VVPP)
- CAB now uses raw (tagh) tags for input and applies the tag
translation dict __after__ tagging (so lemmatization should still work
* fixed utf-8 bug in dta-cab-http-client.perl
v1.23 2012-01-17 moocow
* sysv-ified dta-cab.sh
* improved demo: added arbitrary user options (JSON-encoded)
* allow non-refs in JSON input
+ also updated demo page to use backgrounded javascript-based queries a la cab error db
v1.22 2011-12-16 moocow
* services fixes
+ http server response logging option (srv->{logResponse})
* fixed "'frobble' is not a HASH reference in Format/TT.pm" bug with eqlemma as array-of-strings
v1.21 2011-12-09 moocow
* changed undef to 'off' in cab-http.plm (avoid unification glitch)
* fixed rmlog actions on check-ok
* improved cab-rc-update.sh cron script
* added caberr1, norm1 chains
* removed local ssh keys; use id_dsa by default
* changed default actions for cab-rc-update.sh to 'check update': no implicit restart
* fixed JSON format bug blowing up logs e.g. on services
* updated cab-rc-update.sh script for resources.new->resources renaming
* rc changes (services)
* moved resources.new/ pointers to resources/
* moved resources.new/ -> resources/
* removed stale resources/ dir
* turned up CAB_SLEEP to 3 in dta-cab-server.sh: auto-restart was failing
* cabEval fix (global %::analyzeOpts)
* added logResponse option to cab-http.plm
* default re-starteable servers
* TEI format fixes
* updated cab-rc-update.sh (added basic actions to command-line)
* added and tested CAB/Analyzer/EqRW/JsonCDB.pm
* added and tested CAB/Analyzer/EqPho/JsonCDB.pm
* added CAB/Analyzer/EqLemma/JsonCDB : new moot-only lemma-equivalence
v1.20 2011-09-15 moocow
* explicitly set static type keys
* static typeKeys fixes: auto-scan on prepareLoaded()
+ MootSub bug fix
* lemmatizer fixes
* updated MootSub: now basically tomasotath-compatible
* added stringsim/testme.perl : string similarity benchmarking
* more best-lemma updates:
- slowdown from 3.3 tok/sec to 2.9 tok/sec in dta/build/cab_corpus
* updated MootSub: added stupid unigram-based edit-similarity in best-lemma heuristics
* more lemmatizer fixes
* lemmatizer fix: remove '/p' infixes
* fixed typo in taghx-9098.rc server rc file
* added simple tagh expander class (EqTagh), server taghx-server.plm, init file taghx-9098.plm
* added taghx-http.plm: tagh expander
* added some deps to Makefile.PL for build on new services2
* added CDB_File dep to Makefile.PL
* ignore some stuff
* fixed list-mode argument parsing bug
* fixed stdin auto-spooling bug
* leak tests: inconclusive
+ installing to kaskade...
* json doesn't leak much at all
* added expat-base input to Format::XmlTokWrapFast
+ looks good, leaking some memory though (ftxml,txml,tj formats; even with Null analyzer)
* got Xml(Native|TokWrap) block-scanning working
+ TODO (?): write XmlTokWrapFast input mode using expat?
* tested api cleanup from carrot: scan seems to be working again
* block api cleanup from carrot (untested)
+ still todo: TT::blockFinish() override for block-final eos newline scanning
+ still todo: XmlNative::blockFinish() ? or can we use the defaults
+ todo: block testing?
* more block-scanning tests
- sentence-level blocking should work for XmlNative, XmlTokWrap
* moved block tests to tests/blockscan
* more block-scanning tests: moving to tests/blockscan/
* added test xmlbscan.perl: try to get blockScan(), blockMerge() working for flat XML files
* got cab-analyze.perl working with new UNIX-socket based queue
- block scan & merge works with TT, TJ formats, even in -list mode
- TODO (?): extend blockScan() + blockAppend() API to other (e.g. xml-based) formats?
v1.19 2011-08-31 moocow
* revised CAB/Fork/Pool.pm to use new CAB/Queue/Server.pm rather than clunky Queue::File
- started working new Fork/Pool.pm stuff into dta-cab-analyze.perl
- continue at or around line 407 (post queue population)
* more queue tests in (increasingly poorly-named) tests/sysv
+ looks good: should be ready to integrate into command-line analyzer
* JobManager update
- todo: JobManger::Client (in JobManager.pm), update analyze script
* added CAB/Queue/JobManager.pm for block-savvy DTA::CAB::Analyze queue management
* got basic blockScan(), blockAppend() APIs in place for Format::TT
* added tt-blockscan.perl
* got dta-cab-analyze.perl working with new format semantics
+ todo: UNIX socket queue, better block handling
* got HTTP, XmlRpc server and client working with new format semantics
* updated dta-cab-(http|xmlrpc)-client.perl to use new format semantics
* removed stale dta-cab-xml-format.perl
* removed statle cachegen, compile, dict-convert scripts
* removed old YAML directory: stick to YAML::XS
* finished updating toString,toFile,toFh semantics in CAB formats
* re-working CAB::Format API: toFh(), toString()
- done formats: JSON, Null, Sotrable, ExpandList, TJ, Text, TT, Raw, CSV, Perl
- todo: YAML, Xml*
+ next: kludge a generic block-handling API into DTA::CAB::Format (@blocks=->block_scan(); ->block_append(,))
* re-factored CAB/Queue/(Socket|Client|Server) to CAB/Socket, CAB/Socket/UNIX, CAB/Queue/(Client|Server)
* more UNIX socket queue tests
* more tests: tests/sysv/cq(test|client).perl -- working again (it seems)
* broke things
* socket queue-server work
* more queue tests
- best candidate so far: qsrv.perl : dedicated 'master' queue server using UNIX sockets
- idea: separate scan- and process- fork-pools (like now)
- scan pool scans for block boundaries (test: blockscan.perl: use yte offsets, lengths, seek(), tell())
- process pool does actual processing
(like current dta-cab-analyze.perl, but must send data BACK to server; see qsrv.perl)
- master process maintains queue (qsrv.perl) and merges processed blocks into final output files (blockmerge.perl)
* added qtest.perl: works (single-file binary-safe message queue using flock)
* more bdb/cdb fixes
* added sysv tests: semaphores ought to work; message queues look a bit dodgy...
* added Cache::Static; moved bdb->cdb
* added Analyzer::Cache::Static sub-hierarchy
* bdb->cdb: system/cab.plm
* bdb->cdb: analyzer aliases
v1.18 2011-08-22 moocow
* split ExLex into {BDB,CDB} subclasses: todo: replace BDB by CDB for db-based lookups (ca 25% faster)
* removed stale BDB directory
* added Format::XmlTokWrapFast : quick+dirty fast output for feeding to dtatw-xml2ddc.perl
* more fixes (short format alias 'bin' for Storable)
* kaskade fixes for big dta build
* fixed wide-character bug in tj output
* update script debugging
* added documentation to README.update
* changed alias structure in Chain::DTA (default->norm rather than norm->default)
- no functional difference
* don't start langid server by default
* README: newline at EOF
* fixed CAB_RCDIR
* cab_corpus/ build: fixes & adjustments
* fixed TJ format bug for sentence attributes
* version, analyze verbosity for spawn
* got forked block-processing working
* pre-split blocks in dta-cab-analyze.perl
v1.17 2011-08-12 moocow
* work on new system/resources/ dir (as system/resources.new)
* default update from kaskade
* added ssh keypair cab-rc-update.dsa
- pubkey must be authorized for update user on build host