-
Notifications
You must be signed in to change notification settings - Fork 17
MongoDB store
The store implemented in pypln/stores/mongo.py
currently can store corpora, documents and analysis. The specification is:
Given a collection to store corpora (default: corpora
), for each new corpus we need to add a document in the collection with these fields:
-
name
: type:str
(UTF-8), corpus name. -
slug
: type:str
(UTF-8), slug of the name (generated automatically bypypln.utils.slug
). -
description
: type:str
(UTF-8), little description about the class. -
owner
: an identification to the owner of this document (what this identification is should be specified by the user of this collection - pyplnweb). -
date_created
: type:datetime
, date when corpus was created. -
last_modified
: type:datetime
, should be equal todate_created
when the corpus is created and updated when something happens to this corpus, for instance: document added, document deleted etc.
Documents are stored in 3 collections: 1 for meta-data (default: documents
) and 2 for GridFS (defaults: files.files
and files.chunks
). We'll just talk about meta-data collection and "GridFS" (see the GridFS specification for more details on how it stores data).
The document meta-data collection should have these fields:
-
filename
: type:str
(UTF-8), original filename of uploaded file. -
slug
: type:str
(UTF-8), slug of the filename (generated automatically bypypln.utils.slug
). -
owner
: an identification to the owner of this document (what this identification is should be specified by the user of this collection - pyplnweb). -
corpora
: type:list
ofObjectId
, list of corpora_id
s in which this document is inside. -
date_created
: type:datetime
, date when document was created.
The "GridFS" should store the document with the optional field filename
being the meta-data's _id
for that document.
Given a collection to store document and corpora analysis (default: analysis
), for each new analysis we need to add a document in the collection with these fields:
-
name
: type:str
(UTF-8), name/type of analysis (examples:tokens
,part-of-speech
,tfidf
etc.) -
value
: type: (ObjectId
,int
,float
,str
(UTF-8),list
ordict
), value returned by worker. -
document
: type:ObjectId
, document's_id
in which this analysis belongs to. -
date_created
: type:datetime
, date when analysis was created.