Skip to content

Commit

Permalink
INit
Browse files Browse the repository at this point in the history
  • Loading branch information
ivargr committed Dec 6, 2023
1 parent 573e70b commit c94c962
Show file tree
Hide file tree
Showing 7 changed files with 10,116 additions and 382 deletions.
64 changes: 1 addition & 63 deletions content/00.front-matter.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,71 +9,9 @@ _A DOI-citable version of this manuscript is available at <https://doi.org/DOI_H
##}

{## Template to insert build date and source ##}
<small><em>
This manuscript
{% if manubot.ci_source is defined and manubot.ci_source.provider == "appveyor" -%}
([permalink]({{manubot.ci_source.artifact_url}}))
{% elif manubot.html_url_versioned is defined -%}
([permalink]({{manubot.html_url_versioned}}))
{% endif -%}
was automatically generated
{% if manubot.ci_source is defined -%}
from [{{manubot.ci_source.repo_slug}}@{{manubot.ci_source.commit | truncate(length=7, end='', leeway=0)}}](https://github.com/{{manubot.ci_source.repo_slug}}/tree/{{manubot.ci_source.commit}})
{% endif -%}
on {{manubot.generated_date_long}}.
</em></small>


{% if manubot.date_long != manubot.generated_date_long -%}
Published: {{manubot.date_long}}
{% endif %}

## Authors

{## Template for listing authors ##}
{% for author in manubot.authors %}
+ **{{author.name}}**
{% if author.corresponding is defined and author.corresponding == true -%}^[](#correspondence)^{%- endif -%}
<br>
{%- set has_ids = false %}
{%- if author.orcid is defined and author.orcid is not none %}
{%- set has_ids = true %}
![ORCID icon](images/orcid.svg){.inline_icon width=16 height=16}
[{{author.orcid}}](https://orcid.org/{{author.orcid}})
{%- endif %}
{%- if author.github is defined and author.github is not none %}
{%- set has_ids = true %}
· ![GitHub icon](images/github.svg){.inline_icon width=16 height=16}
[{{author.github}}](https://github.com/{{author.github}})
{%- endif %}
{%- if author.twitter is defined and author.twitter is not none %}
{%- set has_ids = true %}
· ![Twitter icon](images/twitter.svg){.inline_icon width=16 height=16}
[{{author.twitter}}](https://twitter.com/{{author.twitter}})
{%- endif %}
{%- if author.mastodon is defined and author.mastodon is not none and author["mastodon-server"] is defined and author["mastodon-server"] is not none %}
{%- set has_ids = true %}
· ![Mastodon icon](images/mastodon.svg){.inline_icon width=16 height=16}
[\@{{author.mastodon}}@{{author["mastodon-server"]}}](https://{{author["mastodon-server"]}}/@{{author.mastodon}})
{%- endif %}
{%- if has_ids %}
<br>
{%- endif %}
<small>
{%- if author.affiliations is defined and author.affiliations|length %}
{{author.affiliations | join('; ')}}
{%- endif %}
{%- if author.funders is defined and author.funders|length %}
· Funded by {{author.funders | join('; ')}}
{%- endif %}
</small>
{% endfor %}

::: {#correspondence}
✉ — Correspondence possible via {% if manubot.ci_source is defined -%}[GitHub Issues](https://github.com/{{manubot.ci_source.repo_slug}}/issues){% else %}GitHub Issues{% endif %}
{% if manubot.authors|map(attribute='corresponding')|select|max -%}
or email to
{% for author in manubot.authors|selectattr("corresponding") -%}
{{ author.name }} \<{{ author.email }}\>{{ ", " if not loop.last else "." }}
{% endfor %}
{% endif %}
:::
41 changes: 40 additions & 1 deletion content/01.abstract.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,42 @@
## Abstract {.page_break_before}
Supplementary Material
======================================


Benchmarks
---------------------

We compare the speed of BioNumPy against other existing Python packages and commonly used non-Python tools on a set of typical bioinformatics tasks. As seen in Figure 1, we find that BioNumPy is generally considerably faster than vanilla Python solutions, as well as the commonly used Python packages BioPython and Biotite, which mostly rely on Python for-loops to perform operations on datasets. On problems where designated efficient bioinformatics tools are widely used (intersection of BED-files, kmer counting and VCF operations), we find that BioNumPy is close to, or as efficient as, tools written in C/C++ (BEDTools [@bedtools], Jellyfish [@jellyfish] and BCFTools [@bcftools]). A Snakemake pipeline for reproducing the results can be found at <https://github.com/bionumpy/bionumpy/tree/master/benchmarks>, along with an open invitation to expand the benchmark with additional tools and cases.

![**Benchmarking BioNumPy against other tools and methods on various typical bioinformatics tasks.**](images/benchmarks.png){#fig:benchmarks}

Reproducing a machine learning benchmark using BioNumPy
------------------------------------------------------------------------------------
TODO

To show how BioNumPy can be used to easily process and parse various biology data formats, we used BioNumPy to reproduce a recently published benchmark of a machine learning method [@sasse]. In the original work, the authors are using a combination of custom Python code and common bioinformatics tools (such as BCFTools [@bcftools]) to extract sequences around the transcription start sites of genes. We have forked the original repository and replaced all this code, which consisted of X lines of Python code and X lines of shell scripts, with only a few calls to BioNumPy (in total Y lines).

Our fork is available at <https://….> We believe this shows that BioNumPy quite easily can be used to cleanly integrate various biology file formats.

BioNumPy Implementation details
----------------------------------------------------

BioNumPy internally stores sequence data (e.g. nucleotides or amino acids) as numeric values, allowing the use of standard NumPy arrays for data representation and processing. A key way BioNumpy achieves high performance is by storing multiple data entries in shared NumPy arrays. To illustrate the benefit of this approach, consider the example where we want to count the number of Gs and Cs in a large set of DNA sequences. Existing Python packages like BioPython and Biotite, do this by iterating over the sequences using Python for-loops, which is slow when the number of sequences is large. BioNumPy, however, stores all sequences in only one or a few shared NumPy arrays (Figure 3a), meaning that vectorized NumPy operations can be used to do the counting in a fraction of the time.

Storing multiple elements in shared arrays is trivial if the elements all have the same size, since a matrix representation can be used. However, for biological data, it is common that data elements vary in size. For instance, sequences in FASTA files are rarely all of the exact same size. BioNumPy uses the RaggedArray data structure from the npstructures package (<https://github.com/bionumpy/npstructures>, developed in tandem with BioNumPy) to tackle this problem (Figure @fig:ragged_array). The RaggedArray can be seen as a matrix where rows can have different lengths. The npstructures RaggedArray implementation is compatible with most common NumPy operations, like indexing (Figure @fig:ragged_array b), vectorized operations (Figure @fig:ragged_array c), and reductions (Figure @fig:ragged_array d). As far as possible, objects in BioNumPy follow the array interoperability protocols defined by NumPy (<https://numpy.org/doc/stable/user/basics.interoperability.html>)


![ **Overview of the RaggedArray and EncodedRaggedArray data structures**. A RaggedArray is similar to a NumPy array/matrix but can represent a matrix consisting of rows with varying lengths (a). This makes it able to represent data with varying lengths efficiently in a shared data structure. A RaggedArray supports many of the same operations as NumPy arrays, such as indexing (b), vectorization (c) and reduction (d). An EncodedRaggedArray is a RaggedArray that supports storing and operating on non-numeric data (e.g. DNA sequences) by encoding the data and keeping track of the encoding (e). An EncodedRaggedArray supports the same operations as RaggedArrays (f). This figure is an adopted and modified version of Figure 1 in :cite:`numpy` and is licensed under a Creative Commons Attribution 4.0 International License (<http://creativecommons.org/licenses/by/4.0/>).
](images/ragged_array_figure.png){#fig:ragged_array}





BioNumPy has been developed following the principles of continuous integration and distribution. The codebase is thoroughly and automatically tested through an extensive collection of unit tests, application tests, integration tests and property-based tests [@hypothesis]. New code changes are automatically benchmarked and tested before being automatically published, ensuring that updates can be frequent while high code quality is maintained. This workflow makes it safe and easy to allow contributions from new contributors, which is important for longevity and community adoption of the package.



[@jellyfish]: doi:10.1093/bioinformatics/btr011
[@bedtools]: doi: 10.1093/bioinformatics/btq033
[@bcftools]: doi:10.1093/gigascience/giab008
[@hypothesis]: doi:10.21105/joss.01891
Loading

0 comments on commit c94c962

Please sign in to comment.