Merge remote-tracking branch 'origin/staging'

griffithlab · Aug 7, 2019 · 75867e8 · 75867e8
2 parents fdedebd + c79efab
commit 75867e8
Show file tree

Hide file tree

Showing 1,880 changed files with 279,844 additions and 68,830 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -67,9 +67,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = '1.4'
+version = '1.5'
 # The full version, including alpha/beta/rc tags.
-release = '1.4.5'
+release = '1.5.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/docs/images/pVACbind_logo_trans-bg_sm_v4b.png b/docs/images/pVACbind_logo_trans-bg_sm_v4b.png
diff --git a/docs/images/pVACbind_logo_trans-bg_v4b.png b/docs/images/pVACbind_logo_trans-bg_v4b.png
diff --git a/docs/index.rst b/docs/index.rst
@@ -5,7 +5,10 @@ pVACtools is a cancer immunotherapy tools suite consisting of the following
 tools:
 
 **pVACseq**
-   A cancer immunotherapy pipeline for identifying and prioritizing neoantigens from a list of tumor mutations.
+   A cancer immunotherapy pipeline for identifying and prioritizing neoantigens from a VCF file.
+
+**pVACbind**
+   A cancer immunotherapy pipeline for identifying and prioritizing neoantigens from a FASTA file.
 
 **pVACfuse**
    A tool for detecting neoantigens resulting from gene fusions.
@@ -28,6 +31,7 @@ tools:
    :maxdepth: 2
 
    pvacseq
+   pvacbind
    pvacfuse
    pvacvector
    pvacviz
@@ -44,48 +48,81 @@ tools:
    mailing_list
 
 
-New in release |release|
-------------------------
-
-This is a hotfix release. It fixes the following issues:
-
-- In a previous version we implemented a faster method for reading data from
-  the database in pVACapi. However, this would fail if the postgres user is
-  not a superuser. This version fixes this issue by using the previous
-  database file read method in this situation.
-- This version marks certain columns of the output reports as not visualizable
-  in pVACviz/pVACapi because they contain string content that cannot be
-  plotted in a scatterplot.
-
 New in version |version|
 ------------------------
 
 This version adds the following features:
 
-- pVACvector now tests spacers iteratively. During the first iteration, the
-  first spacer in the list of ``--spacers`` gets tested. In the next
-  iteration, the next spacer in the list gets added to the pool of spacers to
-  tests, and so on. If at any point a valid ordering is found, pVACvector will
-  finish its run and output the result. This might result in slightly
-  less optimal (but still valid) ordering but improves runtime significantly.
-- If, after testing all spacers, no valid ordering if found, pVACvector will
-  clip the beginning and/or ends of problematic peptides by one amino acid.
-  The ordering finding process is then repeated on the updated list of
-  peptides. This process may be repeated up to a maximum set by the
-  ``--max-clip-length`` parameter.
-- This version adds a standalone command to create the pVACvector
-  visualizations that can be run by calling ``pvacvector visualize`` using a
-  pVACvector result file as the input.
-- We removed the ``--aditional-input-file-list`` option to pVACseq. Readcount and
-  expression information are now taken directly from the VCF annotations.
-  Instructions on how to add these annotations to your input VCF can be found
-  on the :ref:`prerequisites_label` page.
-- We added support for variants to pVACseq that are only annotated as
-  ``protein_altering_variant`` without a more specific consequence of
-  ``missense_variant``, ``inframe_insertion``, ``inframe_deletion``, or ``frameshift_variant``.
-- We resolved some syntax differences that prevented pVACtools from being run
-  under python 3.6 or python 3.7. pVACtools should now be compatible with all
-  python3 versions.
+- This version introduces a new tool, ``pVACbind``, which can be used
+  to run our immunotherapy pipeline with a peptides
+  FASTA file as input. This new tool is similar to pVACseq but certain
+  options and filters are removed:
+
+  - All input sequences are interpreted in isolation so corresponding
+    wildtype sequence and score information are not assigned. As a consequence,
+    the filter threshold option on fold change is removed.
+  - Because the input format doesn't allow for association of readcount,
+    expression or transcript support level data, pVACbind doesn't run the coverage
+    filter or transcript support level filter.
+  - No condensed report is generated.
+
+  Please see the :ref:`pvacbind` documentation for more information.
+
+- pVACfuse now support annotated fusion files from `AGFusion <https://github.com/murphycj/AGFusion>`_ as input. The
+  :ref:`pvacfuse` documentation has been updated with instructions on how to
+  run AGFusion in the Prerequisites section.
+- The top score filter has been updated to take into account alternative known
+  transcripts that might result in non-indentical peptide sequences/epitopes.
+  The top score filter now picks the best epitope for every available transcript of a
+  variant. If the resulting list of epitopes for one variant is not identical,
+  the filter will output all eptiopes. If the resulting list of epitopes for one
+  variant are identical, the filter only outputs the epitope for the transcript with the highest
+  transcript expression value. If no expression data is available, or if
+  multiple transcripts remain, the filter outputs the epitope for the
+  transcripts with the lowest transcript Ensembl ID.
+- This version adds a few new options to the ``pvacseq
+  generate_protein_fasta`` command:
+
+  - The ``--mutant-only`` option can be used to only output mutant peptide
+    sequences instead of mutant and wildtype sequences.
+  - This command now has an option to provide a pVACseq all_eptiopes or
+    filtered TSV file as an input (``--input-tsv``). This will limit the
+    output fasta to only sequences that originated from the variants in that file.
+
+- This release adds a ``pvacfuse generate_protein_fasta`` command that works
+  similarly to the ``pvacseq generate_protein_fasta`` command but works with
+  Integrate-NEO or AGFusion input files.
+- We removed the sorting of the all_epitopes result file in order to reduce
+  memory usage. Only the filtered files will be sorted. This version also updates the sorting algorithm of the
+  filtered files as follows:
+
+  - If the ``--top-score-metric`` is set to ``median`` the results are first
+    filtered by the ``Median MT Score``. If multiple epitopes have the same
+    ``Median MT Score`` they are then filtered by the ``Corresponding Fold
+    Change``. The last sorting criteria is the ``Best MT Score``.
+  - If the ``--top-score-metric`` is set to ``lowest`` the results are first
+    filtered by the ``Best MT Score``. If multiple epitopes have the same
+    ``Best MT Score`` they are then filtered by the ``Corresponding Fold
+    Change``. The last sorting criteria is the ``Median MT Score``.
+
+- pVACseq, pVACfuse, and pVACbind now calculcate manufacturability metrics
+  calculated for the predicted epitopes. Manufacturability metrics are also
+  calculcated for all protein sequences when running the ``pvacseq generate_protein_fasta``
+  and ``pvacfuse generate_protein_fasta`` commands. They are saved in the ``.manufacturability.tsv``
+  along to the result fasta.
+- The pVACseq score that gets calculated for epitopes in the condensed report
+  is now converted into a rank. This will hopefully remove any confusion about
+  whether the previous score could be treated as an absolute measure of
+  immunogencity, which it was not intended for. Converting this score to a
+  rank ensures that it gets treated in isolation for only the epitopes in the
+  condensed file.
+- The condensed report now also outputs the mutation position as well as the
+  full set of lowest and median wildtype and mutant scores.
+- This version adds a clear cache function to pVACapi that can be called by
+  running ``pvacapi clear_cache``. Sometimes pVACapi can get into a state
+  where the cache file contains conflicting data compared to the actual
+  process outputs which results in errors. Clearing the cache using the ``pvacapi clear_cache``
+  function can be used in that situation to resolve these errors.
 
 Past release notes can be found on our :ref:`releases` page.
 

diff --git a/docs/pvacbind.rst b/docs/pvacbind.rst
@@ -0,0 +1,20 @@
+.. image:: images/pVACbind_logo_trans-bg_sm_v4b.png
+    :align: right
+    :alt: pVACbind logo
+
+.. _pvacbind:
+
+pVACbind
+====================================
+
+This component of the pVACtools is used to predict neoantigens for the peptides in a FASTA file.
+
+.. toctree::
+   :glob:
+
+   pvacbind/prerequisites
+   pvacbind/getting_started
+   pvacbind/run
+   pvacbind/output_files
+   pvacbind/filter_commands
+   pvacbind/additional_commands
diff --git a/docs/pvacbind/additional_commands.rst b/docs/pvacbind/additional_commands.rst
@@ -0,0 +1,25 @@
+.. image:: ../images/pVACbind_logo_trans-bg_sm_v4b.png
+    :align: right
+    :alt: pVACbind logo
+
+Additional Commands
+===================
+
+To make using pVACbind easier, several convenience methods are included in the package.
+
+.. _pvacbind_example_data:
+
+Download Example Data
+---------------------
+
+.. program-output:: pvacbind download_example_data -h
+
+List Valid Alleles
+------------------
+
+.. program-output:: pvacbind valid_alleles -h
+
+List Allele-Specific Cutoffs
+----------------------------
+
+.. program-output:: pvacbind allele_specific_cutoffs -h
diff --git a/docs/pvacbind/filter_commands.rst b/docs/pvacbind/filter_commands.rst
@@ -0,0 +1,50 @@
+.. image:: ../images/pVACbind_logo_trans-bg_sm_v4b.png
+    :align: right
+    :alt: pVACbind logo
+
+Filtering Commands
+=============================
+
+pVACbind currently offers two filters: a binding filter and a top score filter.
+
+These filters are always run automatically as part
+of the pVACbind pipeline using default cutoffs.
+
+All filters can also be run manually on the filtered.tsv file to narrow the results down further,
+or they can be run on the all_epitopes.tsv file to apply different filtering thresholds.
+
+The binding filter is used to remove neoantigen candidates that do not meet desired peptide:MHC binding criteria.
+The top score filter is used to select the most promising peptide candidate for each variant. 
+Multiple candidate peptides from a single somatic variant can be caused by multiple peptide lengths, registers, HLA alleles,
+and transcript annotations.
+
+Further details on each of these filters is provided below.
+
+Binding Filter
+--------------
+
+.. program-output:: pvacbind binding_filter -h
+
+The binding filter removes variants that don't pass the chosen binding threshold.
+The user can chose whether to apply this filter to the ``lowest`` or the ``median`` binding
+affinity score by setting the ``--top-score-metric`` flag. The ``lowest`` binding
+affinity score is recorded in the ``Best MT Score`` column and represents the lowest
+ic50 score of all prediction algorithms that were picked during the previous pVACseq run.
+The ``median`` binding affinity score is recorded in the ``Median MT Score`` column and
+corresponds to the median ic50 score of all prediction algorithms used to create the report.
+Be default, the binding filter runs on the ``median`` binding affinity.
+
+By default, entries with ``NA`` values will be included in the output. This
+behavior can be turned off by using the ``--exclude-NAs`` flag.
+
+Top Score Filter
+----------------
+
+.. program-output:: pvacbind top_score_filter -h
+
+This filter picks the top epitope for a variant. By default the
+``--top-score-metric`` option is set to ``median`` which will apply this
+filter to the ``Median MT Score`` column and pick the epitope with the lowest
+median mutant ic50 score for each variant. If the ``--top-score-metric``
+option is set to ``lowest``, the ``Best MT Score`` column is instead used to
+make this determination.
diff --git a/docs/pvacbind/getting_started.rst b/docs/pvacbind/getting_started.rst
@@ -0,0 +1,23 @@
+.. image:: ../images/pVACbind_logo_trans-bg_sm_v4b.png
+    :align: right
+    :alt: pVACbind logo
+
+Getting Started
+---------------
+
+pVACbind provides a set of example data to show the expected format of input and output files.
+You can download the data set by running the ``pvacbind download_example_data`` :ref:`command <pvacbind_example_data>`.
+
+The example data output can be reproduced by running the following command:
+
+.. code-block:: none
+
+   pvacbind run \
+   <example_data_dir>/input.fasta \
+   Test \
+   HLA-A*02:01,HLA-B*35:01,DRB1*11:01 \
+   MHCflurry MHCnuggetsI MHCnuggetsII NNalign NetMHC PickPocket SMM SMMPMBEC SMMalign \
+   <output_dir> \
+   -e 8,9,10
+
+A detailed description of all command options can be found on the :ref:`Usage <pvacbind_run>` page.
diff --git a/docs/pvacbind/output_files.rst b/docs/pvacbind/output_files.rst
@@ -0,0 +1,89 @@
+.. image:: ../images/pVACbind_logo_trans-bg_sm_v4b.png
+    :align: right
+    :alt: pVACbind logo
+
+Output Files
+============
+
+The pVACbind pipeline will write its results in separate folders depending on
+which prediction algorithms were chosen:
+
+- ``MHC_Class_I``: for MHC class I prediction algorithms
+- ``MHC_Class_II``: for MHC class II prediction algorithms
+- ``combined``: If both MHC class I and MHC class II prediction algorithms were run, this folder combines the neoeptiope predictions from both
+
+Each folder will contain the same list of output files (listed in the order
+created):
+
+.. list-table::
+   :header-rows: 1
+
+   * - File Name
+     - Description
+   * - ``<sample_name>.tsv``
+     - An intermediate file with variant information parsed from the input files.
+   * - ``<sample_name>.tsv_<chunks>`` (multiple)
+     - The above file but split into smaller chunks for easier processing with IEDB.
+   * - ``<sample_name>.all_epitopes.tsv``
+     - A list of all predicted epitopes and their binding affinity scores, with
+       additional variant information from the ``<sample_name>.tsv``.
+   * - ``<sample_name>.filtered.tsv``
+     - The above file after applying all filters, with cleavage site and stability
+       predictions added.
+
+all_epitopes.tsv and filtered.tsv Report Columns
+------------------------------------------------
+
+.. list-table::
+   :header-rows: 1
+
+   * - Column Name
+     - Description
+   * - ``Mutation``
+     - The FASTA ID of the peptide sequence the epitope belongs to
+   * - ``HLA Allele``
+     - The HLA allele for this prediction
+   * - ``Sub-peptide Position``
+     - The one-based position of the epitope in the protein sequence used to make the prediction
+   * - ``Epitope Seq``
+     - The epitope sequence
+   * - ``Median Score``
+     - Median ic50 binding affinity of the epitope of all prediction algorithms used
+   * - ``Best Score``
+     - Lowest ic50 binding affinity of all prediction algorithms used
+   * - ``Best Score Method``
+     - Prediction algorithm with the lowest ic50 binding affinity for this epitope
+   * - ``Individual Prediction Algorithm Scores`` (multiple)
+     - ic50 scores for the ``Epitope Seq`` for the individual prediction algorithms used
+   * - ``cterm_7mer_gravy_score``
+     - Mean hydropathy of last 7 residues on the C-terminus of the peptide
+   * - ``max_7mer_gravy_score``
+     - Max GRAVY score of any kmer in the amino acid sequence. Used to determine if there are any extremely
+       hydrophobic regions within a longer amino acid sequence.
+   * - ``difficult_n_terminal_residue`` (T/F)
+     - Is N-terminal amino acid a Glutamine, Glutamic acid, or Cysteine?
+   * - ``c_terminal_cysteine`` (T/F)
+     - Is the C-terminal amino acid a Cysteine?
+   * - ``c_terminal_proline`` (T/F)
+     - Is the C-terminal amino acid a Proline?
+   * - ``cysteine_count``
+     - Number of Cysteines in the amino acid sequence. Problematic because they can form disulfide bonds across
+       distant parts of the peptide
+   * - ``n_terminal_asparagine`` (T/F)
+     - Is the N-terminal amino acid a Asparagine?
+   * - ``asparagine_proline_bond_count``
+     - Number of Asparagine-Proline bonds. Problematic because they can spontaneously cleave the peptide
+   * - ``Best Cleavage Position`` (optional)
+     - Position of the highest predicted cleavage score
+   * - ``Best Cleavage Score`` (optional)
+     - Highest predicted cleavage score
+   * - ``Cleavage Sites`` (optional)
+     - List of all cleavage positions and their cleavage score
+   * - ``Predicted Stability`` (optional)
+     - Stability of the pMHC-I complex
+   * - ``Half Life`` (optional)
+     - Half-life of the pMHC-I complex
+   * - ``Stability Rank`` (optional)
+     - The % rank stability of the pMHC-I complex
+   * - ``NetMHCstab allele`` (optional)
+     - Nearest neighbor to the ``HLA Allele``. Used for NetMHCstab prediction
diff --git a/docs/pvacbind/prerequisites.rst b/docs/pvacbind/prerequisites.rst
@@ -0,0 +1,8 @@
+.. image:: ../images/pVACbind_logo_trans-bg_sm_v4b.png
+    :align: right
+    :alt: pVACbind logo
+
+Prerequisites
+=============
+
+The input to pVACbind is a FASTA file of peptide sequences.
diff --git a/docs/pvacbind/run.rst b/docs/pvacbind/run.rst
@@ -0,0 +1,16 @@
+.. image:: ../images/pVACbind_logo_trans-bg_sm_v4b.png
+    :align: right
+    :alt: pVACbind logo
+
+.. _pvacbind_run:
+
+Usage
+====================================
+
+.. warning::
+   Using a local IEDB installation is strongly recommended for larger datasets
+   or when the making predictions for many alleles, epitope lengths, or
+   prediction algorithms. More information on how to install IEDB locally can
+   be found on the :ref:`Installation <iedb_install>` page.
+
+.. program-output:: pvacbind run -h