Skip to content
turicas edited this page Jul 22, 2012 · 1 revision

The store implemented in pypln/stores/mongo.py currently can store corpora, documents and analysis. The specification is:

Corpus

Given a collection to store corpora (default: corpora), for each new corpus we need to add a document in the collection with these fields:

  • name: type: str (UTF-8), corpus name.
  • slug: type: str (UTF-8), slug of the name (generated automatically by pypln.utils.slug).
  • description: type: str (UTF-8), little description about the class.
  • owner: an identification to the owner of this document (what this identification is should be specified by the user of this collection - pyplnweb).
  • date_created: type: datetime, date when corpus was created.
  • last_modified: type: datetime, should be equal to date_created when the corpus is created and updated when something happens to this corpus, for instance: document added, document deleted etc.

Document

Documents are stored in 3 collections: 1 for meta-data (default: documents) and 2 for GridFS (defaults: files.files and files.chunks). We'll just talk about meta-data collection and "GridFS" (see the GridFS specification for more details on how it stores data).

The document meta-data collection should have these fields:

  • filename: type: str (UTF-8), original filename of uploaded file.
  • slug: type: str (UTF-8), slug of the filename (generated automatically by pypln.utils.slug).
  • owner: an identification to the owner of this document (what this identification is should be specified by the user of this collection - pyplnweb).
  • corpora: type: list of ObjectId, list of corpora _ids in which this document is inside.
  • date_created: type: datetime, date when document was created.

The "GridFS" should store the document with the optional field filename being the meta-data's _id for that document.

Analysis

Given a collection to store document and corpora analysis (default: analysis), for each new analysis we need to add a document in the collection with these fields:

  • name: type: str (UTF-8), name/type of analysis (examples: tokens, part-of-speech, tfidf etc.)
  • value: type: (ObjectId, int, float, str (UTF-8), list or dict), value returned by worker.
  • document: type: ObjectId, document's _id in which this analysis belongs to.
  • date_created: type: datetime, date when analysis was created.
Clone this wiki locally