TODO.txt

TODO:
==============

- Force crawler to stop after X amount of time of no progress.

- Have the GenericURLNormalizer offer a mix of replacements and canned rules 
  in desired order of execution. 

- Performance: 
  - Keep tack of counts by having counters in memory instead of querying
    for count.  And have maxXXX for different types instead of just
    "maxDocuments" which can be ambiguous.
    - Have options for tracking progress:
      - Do not track
      - Track only %
      - Track detailed (what is now)
      - Track full/verbose (adding counts for each states) 

- Have Collector add these new default fields:
    - Collector start date
    - Crawler start date
    - Document fetch date
    - Collector Id
    - Crawler Id

- Document that, datastore database should be dedicated to a collector. 

- Have new command line option for producing useful stats out of the crawl
  store.  Like the # of documents per each crawl state found in the store.

- Put back previous data store tests that now applies to CrawlReferenceService.

- Re-introduce CommitCommand? Or is it no longer applicable? 

- Rename CrawlReference* to shorter CrawlObject (preferred), or CrawlItem.

- Create a MemoryDataStore for testing only

- Consider Lucene as a data store.

- Add ability to have multiple crawlers talk to the same crawl store
  for managing their queue (maybe Kafka would be best?).

- AbstractCrawlerConfig.xsd has the anyComplexRequiredClassType "class" being
  optional. See if we can make it required, except for self-closing tags.

- Similar to above, maybe create a FileResource object and provide a way to 
  "register" it when classes need to write files, labeling them as "backup", 
  "delete", "keep" when crawler is done/starts.  And have that managed 
  automatically by crawler/collector.  Also have a flag on that object to 
  mention its scope, to say if it can be shared between threads, crawler,
  all or else (multiple collectors??).

- Rename RegexReferenceFilter to avoid confusion with class of the same name
  in Importer.

- In Allow crawler to "expire" after configurable delay if 
  activeCount in AbstractCrawler#processNextReference is equal or less
  than number of thread and the crawler has been running idle for too long.

- Refactor the whole approach of passing if new or modified to simplify it.

- Introduce full/incremental flag as part of collector framework

- Have document default value other than NEW (e.g. UNKNOWN, UNPROCESSED, etc) 

- Consider using Hibernate for the JDBC data store, for both embedded and
  client-server databases.  Ship with no drivers
  except maybe for testing (or 1 for convenience, like H2).

- Consider a way to merge documents by temporarily storing mergeable
  docs in a queue until all mergable siblings are encountered.
  Maybe this should be made a wrapping committer instead?