-
Notifications
You must be signed in to change notification settings - Fork 15
/
TODO.txt
73 lines (52 loc) · 2.8 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
TODO:
==============
- Force crawler to stop after X amount of time of no progress.
- Have the GenericURLNormalizer offer a mix of replacements and canned rules
in desired order of execution.
- Performance:
- Keep tack of counts by having counters in memory instead of querying
for count. And have maxXXX for different types instead of just
"maxDocuments" which can be ambiguous.
- Have options for tracking progress:
- Do not track
- Track only %
- Track detailed (what is now)
- Track full/verbose (adding counts for each states)
- Have Collector add these new default fields:
- Collector start date
- Crawler start date
- Document fetch date
- Collector Id
- Crawler Id
- Document that, datastore database should be dedicated to a collector.
- Have new command line option for producing useful stats out of the crawl
store. Like the # of documents per each crawl state found in the store.
- Put back previous data store tests that now applies to CrawlReferenceService.
- Re-introduce CommitCommand? Or is it no longer applicable?
- Rename CrawlReference* to shorter CrawlObject (preferred), or CrawlItem.
- Create a MemoryDataStore for testing only
- Consider Lucene as a data store.
- Add ability to have multiple crawlers talk to the same crawl store
for managing their queue (maybe Kafka would be best?).
- AbstractCrawlerConfig.xsd has the anyComplexRequiredClassType "class" being
optional. See if we can make it required, except for self-closing tags.
- Similar to above, maybe create a FileResource object and provide a way to
"register" it when classes need to write files, labeling them as "backup",
"delete", "keep" when crawler is done/starts. And have that managed
automatically by crawler/collector. Also have a flag on that object to
mention its scope, to say if it can be shared between threads, crawler,
all or else (multiple collectors??).
- Rename RegexReferenceFilter to avoid confusion with class of the same name
in Importer.
- In Allow crawler to "expire" after configurable delay if
activeCount in AbstractCrawler#processNextReference is equal or less
than number of thread and the crawler has been running idle for too long.
- Refactor the whole approach of passing if new or modified to simplify it.
- Introduce full/incremental flag as part of collector framework
- Have document default value other than NEW (e.g. UNKNOWN, UNPROCESSED, etc)
- Consider using Hibernate for the JDBC data store, for both embedded and
client-server databases. Ship with no drivers
except maybe for testing (or 1 for convenience, like H2).
- Consider a way to merge documents by temporarily storing mergeable
docs in a queue until all mergable siblings are encountered.
Maybe this should be made a wrapping committer instead?