Open source software packages to parse files in various formats from the Protein Data Bank (PDB) and manipulate protein structures exist in many languages, often as part of Bio* projects.
This repository aims to collate benchmarks for common tasks across various languages and packages. The collection of scripts may also be useful to get an idea how each package works.
Please feel free to contribute scripts from other packages, or submit improvements to the scripts already present - I'm looking for the fastest implementation for each software that makes use of the provided API.
Disclosure: I contributed the BioStructures.jl package to BioJulia and have made contributions to Biopython.
- Parsing 2 PDB entries, taken from the benchmarking in [1], in the PDB, mmCIF and MMTF formats:
- Counting the number of alanine residues in adenylate kinase (1AKE).
- Calculating the distance between residues 50 and 60 of chain A in adenylate kinase (1AKE).
- Calculating the Ramachandran phi/psi angles in adenylate kinase (1AKE).
[1] Gajda MJ, hPDB - Haskell library for processing atomic biomolecular structures in protein data bank format, BMC Research Notes 2013, 6:483 - link
The PDB files can be downloaded to directory data
by running julia tools/download_data.jl
from this directory. If you have all the software installed, and compiled where applicable, you can run sh tools/run_benchmarks.sh
from this directory to run the benchmarks and store the results in benchmarks.csv
. The mean over a number of runs is taken for each benchmark to obtain the values below.
Benchmarks were carried out on an Intel Xeon CPU E5-1620 v3 3.50GHz x 8 processor with 32 GB 2400 MHz DDR4 RAM. The operating system was CentOS v8.1. Time is the elapsed time.
Currently, 16 packages across 7 programming languages are included in the benchmarks:
- BioStructures v0.10.1 running on Julia v1.3.1; times measured after JIT compilation.
- MIToS v2.4.0 running on Julia v1.3.1; times measured after JIT compilation.
- Biopython v1.76 running on Python v3.7.6.
- ProDy v1.10.11 running on Python v3.7.6.
- MDAnalysis v0.20.1 running on Python v3.7.6.
- biotite v0.20.1 running on Python v3.7.6.
- atomium v1.0.2 running on Python v3.7.6.
- Bio3D v2.4.1 running on R v3.6.2.
- Rpdb v2.3 running on R v3.6.2.
- BioJava v5.3.0 running on Java v1.8.0.
- BioPerl v1.007002 running on Perl v5.26.3.
- BioRuby v2.0.1 running on Ruby v2.5.5.
- GEMMI v0.3.6 compiled with gcc v8.3.1; there is also a Python interface but benchmarking was done in C++.
- Victor v1.0 compiled with gcc v7.3.1.
- ESBTL v1.0-beta01 compiled with gcc v7.3.1.
- chemfiles v0.9.3 compiled with gcc v7.3.0 (C++ version) or running on Python v3.7.6 (Python version).
Note that direct comparison between these times should be treated with caution, as each package does something slightly different. For example, things that increase parsing time include:
- Parsing the header information.
- Accounting for disorder at both the atom and residue (point mutation) level.
- Forming a heirarchical model of the protein that makes access to specific residues, atoms etc. easier and faster after parsing.
- Allowing models in a file to have different atoms present.
- Checking that the file format is adhered to at various levels of strictness.
Each package supports these to varying degrees.
BioStructures | MIToS | Biopython | ProDy | MDAnalysis | biotite | atomium | Bio3D | Rpdb | BioJava | BioPerl | BioRuby | GEMMI | Victor | ESBTL | chemfiles-python | chemfiles-cxx | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parse PDB 1CRN / ms | 0.75 | 0.63 | 7.3 | 3.1 | 4.2 | 4.4 | 7.0 | 10.0 | 9.5 | 8.1 | 43.0 | 21.0 | 0.24 | 7.6 | 2.4 | 4.5 | 0.67 |
Parse PDB 1HTQ / s | 2.6 | 2.8 | 16.0 | 2.1 | 1.5 | 4.8 | 20.0 | 2.9 | 14.0 | 1.3 | 49.0 | 13.0 | 0.36 | 11.0 | - | - | - |
Parse mmCIF 1CRN / ms | 2.0 | - | 16.0 | - | - | 4.8 | 13.0 | - | - | 40.0 | - | - | 0.97 | - | - | 3.8 | 0.99 |
Parse mmCIF 1HTQ / s | 8.0 | - | 45.0 | - | - | 9.0 | 36.0 | - | - | 17.0 | - | - | 1.5 | - | - | 2.0 | 2.0 |
Parse MMTF 1CRN / ms | 1.1 | - | 4.5 | - | - | 1.2 | 4.6 | - | - | 4.1 | - | - | - | - | - | 3.2 | 0.44 |
Parse MMTF 1HTQ / s | 3.6 | - | 16.0 | - | - | 0.16 | 43.0 | - | - | 0.74 | - | - | - | - | - | - | - |
Count / ms | 0.17 | 0.017 | 0.21 | 8.8 | 0.068 | - | - | 0.16 | 0.2 | - | 0.42 | 0.073 | 0.004 | - | - | 0.75 | 0.092 |
Distance / ms | 0.012 | 0.0044 | 0.25 | 50.0 | 0.62 | - | - | 19.0 | 1.3 | - | 0.53 | 0.32 | 0.001 | - | - | 0.55 | 0.19 |
Ramachandran / ms | 1.4 | - | 120.0 | 210.0 | 1200.0 | - | - | - | - | - | - | - | - | - | - | 7.4 | 2.1 |
Language | Julia | Julia | Python | Python | Python | Python | Python | R | R | Java | Perl | Ruby | C++/Python | C++ | C++ | Python | C++ |
License | MIT | MIT | Biopython | MIT | GPLv2 | BSD 3-Clause | MIT | GPLv2 | GPLv2/GPLv3 | LGPLv2.1 | GPL/Artistic | Ruby | MPLv2/LGPLv3 | GPLv3 | GPLv3 | BSD 3-Clause | BSD 3-Clause |
Hierarchichal parsing | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
Supports disorder | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ |
Writes PDBs | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |
Parses PDB header | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
Superimposition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
PCA | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
Benchmarks as a plot, sorted by increasing time to parse PDB 1CRN:
It is instructive to run parsers over the whole PDB to see where errors arise. This approach has led to me submitting corrections for small mistakes (e.g. duplicate atoms, residue number errors) in a few PDB structures. As of July 2018, the PDB entries that error with the Biopython (permissive mode) and BioJulia parsers are:
- 4UDF - mmCIF file errors in Biopython and BioJulia due to duplicate C and O atoms in Lys91 of chains B, F etc.
- 1EJG - mmCIF file errors in Biopython due to blank and non-blank alt loc IDs at residue Pro22/Ser22.
- 5O61 - mmCIF file errors in Biopython due to an incorrect residue number at line 165,223.
Running Biopython in non-permissive mode picks up more potential problems such as broken chains and mixed blank/non-blank alt loc IDs. For further discussion on errors in PDB files see the Biopython documentation. The scripts to reproduce the whole PDB checking can be found in checkwholepdb
. There is also a script to check recent PDB changes that can be run as a CRON job.
- For most purposes, particularly work on small numbers of files, the speed of the programs will not hold you back. In this case use the language/package you are most familiar with.
- For fast parsing, use a binary format such as MMTF or binaryCIF.
- Whilst mmCIF became the standard PDB archive format in 2014, and is a very flexible archive format, that does not mean that it is the best choice for all of bioinformatics. mmCIF files take up a lot of space on disk, are slowest to read and do not yet work with many bioinformatics tools.
- If you are analysing ensembles of proteins then use packages with that functionality, such as ProDy or Bio3D, rather than writing the code yourself.
If you use these benchmarks, please cite the BioStructures.jl paper where they appear:
Greener JG, Selvaraj J and Ward BJ. BioStructures.jl: read, write and manipulate macromolecular structures in Julia, Bioinformatics 36(14):4206-4207 (2020) - link - PDF
If you want to contribute benchmarks for a package, please make a pull request with the script(s) in a directory like the other packages. I will run the benchmarks again and change the README, thanks.
- Information on file formats for PDB, mmCIF and MMTF.
- Benchmarks for mmCIF parsing can be found here.
- A list of PDB parsing packages, particularly in C/C++, can be found here.
- The Biopython documentation has a useful discussion on disorder at the atom and residue level.
- Sets of utility scripts exist including pdbtools, pdb-tools and PDBFixer.