MongoDB store

The store implemented in pypln/stores/mongo.py currently can store corpora, documents and analysis. The specification is:

Corpus

Given a collection to store corpora (default: corpora), for each new corpus we need to add a document in the collection with these fields:

name: type: str (UTF-8), corpus name.
slug: type: str (UTF-8), slug of the name (generated automatically by pypln.utils.slug).
description: type: str (UTF-8), little description about the class.
owner: an identification to the owner of this document (what this identification is should be specified by the user of this collection - pyplnweb).
date_created: type: datetime, date when corpus was created.
last_modified: type: datetime, should be equal to date_created when the corpus is created and updated when something happens to this corpus, for instance: document added, document deleted etc.

Document

Documents are stored in 3 collections: 1 for meta-data (default: documents) and 2 for GridFS (defaults: files.files and files.chunks). We'll just talk about meta-data collection and "GridFS" (see the GridFS specification for more details on how it stores data).

The document meta-data collection should have these fields:

filename: type: str (UTF-8), original filename of uploaded file.
slug: type: str (UTF-8), slug of the filename (generated automatically by pypln.utils.slug).
owner: an identification to the owner of this document (what this identification is should be specified by the user of this collection - pyplnweb).
corpora: type: list of ObjectId, list of corpora _ids in which this document is inside.
date_created: type: datetime, date when document was created.

The "GridFS" should store the document with the optional field filename being the meta-data's _id for that document.

Analysis

Given a collection to store document and corpora analysis (default: analysis), for each new analysis we need to add a document in the collection with these fields:

name: type: str (UTF-8), name/type of analysis (examples: tokens, part-of-speech, tfidf etc.)
value: type: (ObjectId, int, float, str (UTF-8), list or dict), value returned by worker.
document: type: ObjectId, document's _id in which this analysis belongs to.
date_created: type: datetime, date when analysis was created.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MongoDB store

Corpus

Document

Analysis

Clone this wiki locally