-
index.py: Index UniProtKB xml files
Tested with Swiss-Prot dataset only, (April 2021 release)
./nosqlbiosets/uniprot/index.py --help usage: index.py [-h] [--index INDEX] [--doctype DOCTYPE] [--host HOST] [--port PORT] [--db DB] infile Index UniProt xml files, with Elasticsearch or MongoDB positional arguments: infile Input file name for UniProt Swiss-Prot compressed xml dataset optional arguments: -h, --help show this help message and exit --index INDEX Name of the Elasticsearch index or MongoDB database --doctype DOCTYPE Document type name for Elasticsearch, collection name for MongoDB --host HOST Elasticsearch or MongoDB server hostname --port PORT Elasticsearch or MongoDB server port number --db DB Database: 'Elasticsearch' or 'MongoDB'
-
query.py: Query API, at its early stages of development
-
../../tests/test_uniprot_queries.py: Tests for the query API
Example command lines for downloading uniprot_sprot.xml
file and for indexing:
mkdir -p data
# ~760M(compressed), ~173.5 million lines, ~565,000 entries
wget -nc -P ./data ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/\
knowledgebase/complete/uniprot_sprot.xml.gz
If you have not already installed nosqlbiosets project see the Installation section of the readme.md file on project main folder.
Server default connection settings are read from ../../conf/dbservers.json
# Index with Elasticsearch, typically requires about 1 to 8 hours
./nosqlbiosets/uniprot/index.py ./data/uniprot_sprot.xml.gz\
--host localhost --db Elasticsearch --esindex uniprot
# Index with MongoDB, typically requires about 1 to 2 hours
./nosqlbiosets/uniprot/index.py ./data/uniprot_sprot.xml.gz\
--host localhost --db MongoDB --index biosets
- interpro.py: Index InterPro XML file [https://www.ebi.ac.uk/interpro/download/]
Elasticsearch, ~10m
./nosqlbiosets/uniprot/interpro.py \
~/data/interpro/interpro.xml.gz\
--esindex interpro\
--dbtype Elasticsearch --recreateindex true\
--host localhost
MongoDB ~3m
./nosqlbiosets/uniprot/interpro.py \
~/data/interpro/interpro.xml.gz\
--dbtype MongoDB --recreateindex true\
--mdbdb=biosets --mdbcollection interpro\
--host localhost
This folder also includes an index script for PSI-MI TAB protein interactions data files
- index_mitab.py Index PSI-MI TAB data files
with Elasticsearch or MongoDB
- At its early stages, field names were selected similart to the filed names in Molecular Interactions Query Language
- Tested with HIPPIE database only, Human Integrated Protein-Protein Interaction rEference
- https://wiki.thebiogrid.org/doku.php/psi_mitab_file
- http://psicquic.github.io/MITAB27Format.html
- https://wiki.reactome.org/index.php/PSI-MITAB_interactions
wget -P ./data http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/HIPPIE-current.mitab.txt
# Index with Elasticsearch
./nosqlbiosets/uniprot/index_mitab.py --infile ./data/HIPPIE-current.mitab.txt\
--db Elasticsearch
# Index with MongoDB
./nosqlbiosets/uniprot/index_mitab.py --infile ./data/HIPPIE-current.mitab.txt\
--db MongoDB
HIPPIE indexing takes ~8m with MongoDB, ~2m with Elasticsearch