update

milaboratory · May 9, 2024 · a146f9b · a146f9b
1 parent d0e6fd1
commit a146f9b
Show file tree

Hide file tree

Showing 20 changed files with 2,982 additions and 32 deletions.
diff --git a/docs/mixcr/getting-started/docker.md b/docs/mixcr/getting-started/docker.md
@@ -69,6 +69,27 @@ For those who rely on other tools inside the image, beware, new build relies on
 
 `mixcr` startup script is added to `PATH` environment variable, so even if you specify custom entrypoint, there is no need in using of full path to run `mixcr`.
 
+## Using external libraries with docker
+
+To use an external library, place the `.json(.gz)` file in the same directory that you mount to `/work` inside Docker. Then use the `--library` parameter with the MiXCR command. In the example below, `/path/to/put/results` contains the `phocoena-IGH.json.gz` library file.
+
+```shell
+docker run --rm \
+ -e MI_LICENSE="...license-token..." \
+ -v /path/to/raw/data:/raw:ro \
+ -v /path/to/put/results:/work \
+ghcr.io/milaboratory/mixcr/mixcr:latest \
+mixcr analyze generic-amplicon \
+    --library phocoena-IGH \
+    --species phocoena \
+    --rna \
+    --rigid-left-alignment-boundary \
+    --floating-right-alignment-boundary C \
+    /raw/input_R1.fastq.gz \
+    /raw/input_R2.fastq.gz \
+    output
+```
+
 ## License notice for IMGT images
 
 Images with IMGT reference library contain data imported from IMGT and is subject to terms of use listed on http://www.imgt.org site.

diff --git a/docs/mixcr/reference/mixcr-assemble.md b/docs/mixcr/reference/mixcr-assemble.md
@@ -340,8 +340,11 @@ Parameters that control clustering procedure and determines the rules for the fr
 `-OcloneClusteringParameters.searchParameters=twoMismatchesOrIndels`
 : Parameters that control fuzzy match criteria between clones in adjacent layers. Available predefined values: `oneMismatch`, `oneIndel`, `oneMismatchOrIndel`, `twoMismatches`, `twoIndels`, `twoMismatchesOrIndels`,  ..., `fourMismatchesOrIndels`. By default, `twoMismatchesOrIndels` allows two mismatches or indels (not more than two errors of both types) between two adjacent clones (parent and direct child).
 
-`-OcloneClusteringParameters.clusteringFilter.specificMutationProbability=1E-3`
-: Probability of a single nucleotide mutation in clonal sequence which has non-hypermutation origin (i.e. PCR or sequencing error). This parameter controls relative counts between two clones in adjacent  layers: a smaller clone can be attached to a larger one if its count smaller than count of parent multiplied by `(clonalSequenceLength * specificMutationProbability) ^ numberOfMutations`
+`-OcloneClusteringParameters.clusteringFilter.backgroundSubstitutionRate=0.002` and `-OcloneClusteringParameters.clusteringFilter.backgroundIndelRate=0.0002`
+: These parameters set the error probability in case Phred quality is high. For example, if a `backgroundSubstitutionRate` is set to `0.001` and a Phred quality of a certain nucleotide is 20, which indicates the error probability of `0.01`, MiXCR will use `0.01`. If the Phred quality is 40 (error probability is `0.0001`, which is lower than the `backgroundSubstitutionRate`), MiXCR will use the `backgroundSubstitutionRate` and set the error probability to `0.002`. The higher the `backgroundSubstitutionRate` the more aggressive the correction will be.
+
+`-OcloneClusteringParameters.clusteringFilter.correctionPower=0.001`
+: Indicates the False Discovery Rate for the correction process, approximating the percentage of actual sequences that might be compromised during correction. The default value is 0.001.
 
 Usage example: change maximum allowed number of mutations:
 ```shell
@@ -355,4 +358,4 @@ Turn clustering off:
 
 ## Hardware recommendations
 
-Assembly step is memory consuming. Reading and decompression of `.vdjca` file is handled in parallel and highly efficient way. MiXCR needs amount of RAM sufficient to store clonotype table in memory. In an exterme case of one million of full-length UMI-assembled clonotypes, it is recommended to supply at least 32GB of RAM. Speed almost does not scale with the increase of CPU.
+Assembly step is memory consuming. Reading and decompression of `.vdjca` file is handled in parallel and highly efficient way. MiXCR needs amount of RAM sufficient to store clonotype table in memory. In an extreme case of one million of full-length UMI-assembled clonotypes, it is recommended to supply at least 32GB of RAM. Speed almost does not scale with the increase of CPU.
diff --git a/docs/mixcr/reference/mixcr-exportShmTrees.md b/docs/mixcr/reference/mixcr-exportShmTrees.md
@@ -646,9 +646,6 @@ for the full list and formatting)
 `-isotype [(primary|subclass|auto)]`
 : Export isotype for IGH chains if it's distinguishable. `primary` will resolve 'IgA', 'IgD', 'IgG', 'IgE', 'IgM'. `subtype` will try resolve isotypes like 'IgA1' or 'IgA2'. Default `auto` will automatically decide whether to resolve the primary or subtype isotype based on the level of detail distinguishable for each clone.
 
-`-topChains`
-: Top chains
-
 `-geneLabel <label>`
 : Export gene label (i.e. ReliableChain)
 

diff --git a/docs/mixcr/reference/mixcr-findAlleles.md b/docs/mixcr/reference/mixcr-findAlleles.md
@@ -1,16 +1,14 @@
 # `mixcr findAlleles`
 
-Finds V- and J-gene allelic variants in a given sample(s). As result MiXCR creates a new [repseq.io](ref-repseqio-json-format.md) reference library and re-aligns clonotypes against it.
+By default, all build-it MiXCR reference libraries have a single *00 allele for each gene (e.g. IGHV7-4-1*00,  IGHV3-47*00
+etc.). Because it is quite complicated to distinguish a true allelic variant from sample preparation errors or hypermutations (for B cells),including those in hot spot positions, MiXCR uses a dedicated algorithm that looks at the presence of certain gene sequence across multiple different clones from the same organism to validate allelic variants with sufficient statistical significance. `mixcr findAlleles` finds V- and J-gene allelic variants in a given sample(s), creates a new [repseq.io](ref-repseqio-json-format.md) reference library and re-aligns clonotypes against it, inferring alleles in place of original *00. 
 
 ![](pics/findAlleles-light.svg#only-light)
 ![](pics/findAlleles-dark.svg#only-dark)
 
-Note that clontypes passed as input must be cut by and fully covered by the same [gene feature](mixcr-assemble.md#core-assembler-parameters). So, for example `.clns` files with [contigs](overview-analysis-overview.md#contig-assemblymixcr-assemblecontigsmd), must be assembled using [`assembleContigs`](mixcr-assembleContigs.md) with `--assemble-contigs-by` option.
-
-Also, all inputs must have the same align library, the same scoring of V and J genes and the same features to align.
-
-Allele inference algorithms applies different strategies to identify allelic variants with sufficient statistical significance. The algorithm for B-cells reliably discriminate between somatic hypermutations (including those in hot spot positions) and real allelic variants.
+Clonotypes passed as input must be assembled by the same [gene feature](mixcr-assemble.md#core-assembler-parameters). So, for example `.clns` files with [contigs](overview-analysis-overview.md#contig-assemblymixcr-assemblecontigsmd), must be assembled using [`assembleContigs`](mixcr-assembleContigs.md) with `--assemble-contigs-by` option. All input '.clns' files must have been generated using the same initial reference library, with the same scoring of V and J genes and the same features to align.
 
+Note, that allelic inference requires presence of a substantial amount of clones for a given V/J gene to return a statistically significant result. If the information from the data was not enough to determine an allele for a certain gene, this gene will retain original *00 allele number.
 
 ## Command line options
 
@@ -66,7 +64,7 @@ For `.fasta` library will be written in FASTA format with gene name and reliable
 : Overrides default build SHM parameter values
 
 `-r, --report <path>`
-: [Report](./report-findAlleles.md) file (human readable version, see `-j / --json-report` for machine readable report).
+: [Report](./report-findAlleles.md) file (human-readable version, see `-j / --json-report` for machine-readable report).
 
 `-j, --json-report <path>`
 : JSON formatted [report](./report-findAlleles.md) file.
@@ -83,6 +81,9 @@ For `.fasta` library will be written in FASTA format with gene name and reliable
 `-nw, --no-warnings`
 : Suppress all warning messages.
 
+`--dont-remove-unused-genes`
+: do not remove genes that were not found in the sample(s) from the new library.
+
 `--verbose`
 : Verbose messages.
 
@@ -116,43 +117,53 @@ mixcr findAlleles \
 Summary table produced with `--export-alleles-mutations` contain the following columns:
 
 `alleleName`
-: allele name in a resulting library; for novel allelic variants will contain count of mutations from known allele and number of mutations in CDR3
+: allele name in a resulting library; for novel allelic variants will contain count of mutations from known allele and number of mutations in CDR3. Some alleles will still have *00 due to the lack of data for statistical significant identification.
 
 `geneName`
 : gene name; the same for heterozygous
 
 `type`
-: V or J
+: `Variable` or `Joining` gene segment.
+
+`status`
+: `DE_NOVO` - new allelic variant aligned to the known one with mismatches
+  `FOUND_KNOWN_VARIANT` - known allele from the database
+  `ALIGNED_ON_KNOWN_VARIANT` - not enough info to search all present alleles. But the one found is a known variant from the library
+  `NOT_CHANGED_AFTER_SEARCH` - the allele was originally identified correctly
+  `COULD_NOT_BE_ALIGNED_ON_KNOWN_VARIANT` - the search was done, but there was not enough info to identify the allele correctly
+  `NO_CLONES_TO_SEARCH` - not enough clones to perform the search
+  `REMOVED_BECAUSE_NO_TOP_HITS_IN_RESULT_FILES` - there is no clone where this gene is the top hit
+  `REMOVED_BECAUSE_NOT_REPRESENTED_IN_SOURCE_FILES` - this gene was not present in the original data
 
 `enoughInfo`
-: Is there was enough info to infer an allele or it's absence.
+: `true` or `false` value states if there was enough info to infer an allele.
 
-`alleleMutationsReliableGeneFeatures`
+`reliableRegion`
 : gene features inside which allele was found (including CDR3 part that was used for search)
 
-`alleleMutationsReliableRanges`
-: ranges in genome of `alleleMutationsReliableGeneFeatures`
-
 `mutations`
-: allele mutations from germline
+: allele mutations from known other allelic variant
 
-`clonesCount`
-: clones count that was aligned to this allele
+`varianceOf`
+: refers to the identifier of a known allelic variant from which the current variant has mutated.
 
 `naivesCount`
-: count of clones with no mutations in V and J
+: count of clones with no hypermutations in V and J
 
 `lowerDiversityBound`
-: lower bound of diversity of clones
+: lower bound of diversity of clones. The number of combinations of a J for V gene( and V for gene) with different CDR3 length.
+
+`clonesCount`
+: clones count that was aligned to this allele
 
 `totalClonesCountForGene`
-: total clones count of this allele and its zygotes (the same `geneName`)
+: total clones count of this allele and its zygotes (the same `geneName`). Before realignment.
 
 `clonesCountWithNegativeScoreChange`
 : count of clones that align better on original library than on build one
 
 `filteredForAlleleSearchNaivesCount`
-: counts of clones with no mutations in V and J after `useClonesWithCountGreaterThen` filter
+: counts of clones with no mutations in V and J after `useClonesWithCountGreaterThen` filter.
 
 `filteredForAlleleSearchClonesCount`
 : counts of clones after `useClonesWithCountGreaterThen` filter
@@ -161,7 +172,7 @@ Summary table produced with `--export-alleles-mutations` contain the following c
 : count of clones that align better on original library than on build one after `useClonesWithCountGreaterThen` filter
 
 `scoreDelta`
-: stats of score change of clones (size, sum, min, max, avg, quadraticMean, stdDeviation)
+: stats of score change of clones after realignment (size, sum, min, max, avg, quadraticMean, stdDeviation)
 
 
 ## Allele inference algorithm parameters
@@ -206,3 +217,4 @@ Below one can find parameters of inference algorithms that may be tuned.
 
 `-OsearchMutationsInCDR3=null`
 : If searchMutationsInCDR3 set to null there will be no search for mutations in CDR3
+
diff --git a/docs/mixcr/reference/mixcr-groupClones.md b/docs/mixcr/reference/mixcr-groupClones.md
@@ -1,9 +1,11 @@
 # `mixcr groupClones`
 
-Groups clones in .clna/.clns files by Cell tags. Grouped clones can be exported using [`mixcr exportCloneGroups`](./mixcr-export.md#clone-groups-by-cell). 
+Groups clones in .clna/.clns files by Cell tags. Grouped clones can be exported using [`mixcr exportCloneGroups`](./mixcr-export.md#clone-groups-by-cell). Each group represents a reliable set of clones (chains) present in all cells within the group. Some clones cannot be assigned to any group and will be labeled as `undefined`. Additionally, some clones may be labeled as `contamination` if they are evenly spread across multiple different cell groups. See the detailed explanation below.
+
 
 ```
-mixcr exportClonesOverlap 
+mixcr groupClones 
+    [-O <key=value>]
     [--report <path>] 
     [--json-report <path>] 
     [--use-local-temp]
@@ -41,3 +43,34 @@ mixcr exportClonesOverlap
 
 `-h, --help`
 : Show this help message and exit.
+
+`-O  <key=value>`
+: Overrides for the clone grouping parameters.
+
+
+This function looks at how a set of clones in one cell corresponds to the set of clones across other cells. Note, that it does not rely on the read/umi count of the clone, as abundance filters have been already applied during barcodes correction and clonotype assemble.
+
+The process begins with a clone that has the highest count of **cells** associated with it. The algorithm then attempts to find all clones that can be connected to this base clone through cell IDs and form a group. To include a connected clone in a group, two conditions must be satisfied:
+
+    - The cells IDs of the base clone should overlap with those of the connected clone by at least 80%.
+    - The original cell IDs (before calculations) of the connected clone should overlap with the cell IDs of the base clone by at least 20%.
+
+If there are no connected clones, or if all connected clones meet the first requirement but not the second, a group consisting solely of the base clone is formed. This approach addresses situations where only one chain was expressed, and all cells are contaminated.
+
+Cell IDs excluded during the intersection should be saved for future rounds of calculation, as a single clone can be part of several groups. The process is repeated for each clone. The threshold for guaranteed overlap should be sufficient to filter out cross-contamination by clones represented randomly in many cells.
+
+Therefore, grouping results from merging similar cells into a group, and the crucial aspect is the clonotype content of the cells rather than the read count. `Undefined` means there was not enough information across the dataset to assign a certain clone to a group. This might still be a valid cell or perhaps two cells in one well, but we can't really tell for sure. There is also a possible `Contamination` value, that indicates that more than 80% of the cells in which this clone has been found have already been assigned to other groups with different sets of clones, suggesting random contamination of different cells.
+
+Example:
+
+Imagine we have a few cells
+
+| Number of cells | Set of clones |
+|-----------------|---------------|
+| 10              | a, b, c       |
+| 5               | d, g          |
+| 5               | e, f          |
+| 1               | a, c, d, e    |
+
+Clones `a`, `b`, and `c` will be assigned to group 1 because we have 10 cells with this set of clones. Clones `d` and `g` will be assigned to group 2, and clones `e` and `f` to group 3. However, for the last cell, we can't assign a and c to group 1 because group 1 does not include clones `d` and `e`. Additionally, clone 'd' has only been observed in a set with 'g', and clone 'e' only with 'f'. Therefore, all four clones from this last cell can't be assigned to any group and will be labeled as `Undefined`. It could be that this cell is a doublet of two cells: 'a', 'c' and 'd', 'e', or maybe 'a', 'e' and 'd', 'c', etc., but we can't be sure. In this example we describe the logic behind the algorithm, the actual thresholds described above may not be met here.
+
diff --git a/docs/mixcr/reference/mixcr-mergeLibrary.md b/docs/mixcr/reference/mixcr-mergeLibrary.md
@@ -1,11 +1,11 @@
 # `mixcr mergeLibrary`
 
-Merge list of custom V/D/J/C gene segment libraries into one library. See [how to create custom library](../guides/create-custom-library.md).
+Merge a list of custom V/D/J/C gene segment libraries into one library. See [how to create custom library](../guides/create-custom-library.md).
 
 ## Command line options
 
 ```
-mixcr buildLibrary 
+mixcr mergeLibrary 
     [--force-overwrite]
     [--no-warnings]
     [--verbose]

diff --git a/docs/mixcr/reference/overview-built-in-presets.md b/docs/mixcr/reference/overview-built-in-presets.md
@@ -32,8 +32,10 @@ Bellow you one can find a variety of presets for different types of input data a
 --8<-- "reference/presets/_thermofisher.md_"
 --8<-- "reference/presets/_invivoscribe.md_"
 --8<-- "reference/presets/_takara.md_"
+--8<-- "reference/presets/_idt.md_"
 --8<-- "reference/presets/_bd.md_"
 --8<-- "reference/presets/_nanopore.md_"
+--8<-- "reference/presets/_pacbio.md_"
 --8<-- "reference/presets/_parsebio.md_"
 --8<-- "reference/presets/_singleron.md_"