-
Notifications
You must be signed in to change notification settings - Fork 31
FileStore
Stores are Catmandu packages to store Catmandu Items in a database. A FileStore is a Store where you can store binary content (unstructured data). Out of the box, one FileStore implementation is provided: File::Simple which stores files in a directory structure on the local file system.
The command below stores the /tmp/myfile.txt
in the File::Simple
FileStore in the "container" 1234
with the file identifier myfile.txt
:
$ catmandu stream /tmp/myfile.txt to File::Simple --root t/data --bag 1234 --id myfile.txt
The root
parameter is mandatory for the File::Simple
FileStore. It defines the location where all stored
files are written. The other two parameters bag
and id
are mandatory for every FileStore (see below).
To extract a file from a FileStore the stream
command can be used in the opposite direction:
$ catmandu stream File::Simple --root t/data --bag 1234 --id myfile.txt to /tmp/myfile.txt
From the File::Simple
the file myfile.txt
is extracted from the container with identifier 1234
.
Every FileStore inherits the functionality of a Store. In this way the drop
and delete
commands can be used
to delete data from a FileStore:
# Delete a "file"
$ catmandu delete File::Simple --root t/data --bag 1234 --id myfile.txt
# Delete a "folder"
$ catmandu drop File::Simple --root t/data --bag 1234
A FileStore contains one or more Bags. These Bags are containers (or "folders") to store zero or more files.
The name of these container, indicated with the bag
option in the Catmandu commands, is an identifier. In the case of the File::Simple
this identifier needs to be a number, or when setting the uuid
option a UUID identifier.
The binary data (files) stored in these Bags also needs an identifier, indicated with the id
option. Usually the file name is a good choice to use.
Both the bag
name option and id
options are required when uploading or streaming data from a FileStore.
Within a FileStore Bag there is no deeper hierarchy possible. A Bag contains a flat list of files. To store deeply nested folders and files, mechanisms such as ZIP files need to be created and imported.
$ zip -r /tmp/files.zip /mnt/data/files
$ catmandu stream /tmp/files.zip --root t/data --bag 1234 --id files.zip
Every FileStore has a default Bag called index
which contains a list of all available Bags in the store (like the listing of all folders). Using the export
command a listing of bags can be requested from the FileStore:
$ catmandu export File::Simple --root t/data to YAML
To retrieve a listing of all files stored in a bag the bag
option needs to be provided:
$ catmandu export File::Simple --root t/data --bag 1234 to YAML
Each Bag ("container") in a FileStore contains at least the _id
as metadata. Some FileStores may contain more metadata. To retrieve a listing of all containers use the export
command on the FileStore:
$ catmandu export File::Simple --root t/data
[{"_id":"1234"},{"_id":"1235"},{"_id":"1236"}]
Every "file" in a FileStore contains at least the following fields:
- _id : the name of the file
- _stream : a callback function to download the contents of the file (pass it an IO::Handle)
- created : the creation date time of the file as a UNIX timestamp
- modified : the last modification date time of the file as a UNIX timestamp
- content_type : the content type of the file
- size : the file size in bytes
- md5 : an MD5 checksum if the FileStore support is, or an empty string
NOTE: Not every exporter can serialise the code reference in the stream
field. For instance, when exporting to
JSON this error message will be show up:
$ catmandu export File::Simple --root t/data --bag 1234
Oops! encountered CODE(0x7f99685f4390), but JSON can only represent references to arrays or hashes at /Users/hochsten/.plenv/versions/5.24.0/lib/perl5/site_perl/5.24.0/Catmandu/Exporter/JSON.pm line 36.
This field can be ignored from the output using the remove_field
fix:
$ catmandu export File::Simple --root t/data --bag 1234 --fix 'remove_field(_stream)'
[{"_id":"files.pdf","content_type":"application/pdf","modified":1498122646,"md5":"","size":883202,"created":1498122646}]
Always use the stream
command in Catmandu to extract files from a FileStore:
$ catmandu stream File::Simple --root t/data --bag 1234 --id 'files.pdf' > output.pdf
As for Stores, the configuration parameters for FileStore can be written in a catmandu.yml
configuration file. In this way the Catmandu commands can be shortened:
$ cat catmandu.yml
---
store:
files
package: File::Simple
options:
root: t/data
# Get a "directory" listing
$ catmandu export files to YAML
# Get a "file" listing
$ catmandu export files --bag 1234 to YAML
# Add a file
$ catmandu stream /tmp/myfile.txt to files --bag 1234 --id myfile.txt
# Download a file
$ catmandu stream files --bag 1234 --id myfile.txt to /tmp/myfile.txt
FileStore usually only contain a limited ammount of technical metadata. Metadata Stores can't be used to store unstructured data. Using the SideCar plugin FileStore and metadata Stores can be combined and presented as one endpoint to store and retrieve both structured and unstructured data. As an example, we will create a configuration below where metadata is stored in an ElasticSearch
engine and files in a File::Simple
store:
$ cat catmandu.yml
---
store:
combined:
package: ElasticSearch
options:
client: '1_0::Direct'
index_name: catmandu
bags:
data:
plugins:
- SideCar
sidecar:
package: File::Simple
options:
root: /data/test123
uuid: 1
sidecar_bag: index
...
In the example above a SideCar
has been added to the ElasticSearch
store for dealing with unstructured data. The uuid
parameter of the File::Simple
has been set to 1
because ElasticSearch requires UUID identifiers. Given the configuration above structured metadata can be added to the store as usual:
$ cat test.yml
---
name: Mary
hobbies:
- Coding
- Sports
- Piano
...
$ catmandu import YAML to combined < test.yml
$ catmandu export combined to YAML
---
_id: 78F2BECE-5730-11E7-B547-1F9DBF8A3C05
hobbies:
- Coding
- Sports
- Piano
name: Mary
...
Files can be added to the record with identifier 78F2BECE-5730-11E7-B547-1F9DBF8A3C05
using the stream
command:
$ catmandu stream PHYSICS/0901.0241v1.pdf to combined --bag 78F2BECE-5730-11E7-B547-1F9DBF8A3C05 --id 0901.0241v1.pdf