Skip to content

Commit

Permalink
Improve and optimize pgs calculation (#130)
Browse files Browse the repository at this point in the history
* Update to pgs-calc v0.6.0

* Disable data export in pgs mode

* Fix phasing only mode and add missing test-data

* Add `pgsCategory` filter

* Fix wrong path in meta file and update yaml

* Fix copy and paste error in includeScores filename

* Update cloudgene.yaml file for pgs

* Improve error reporting in ancestry estimation

* Fix syntax error in exception handling

* Set population to mixed and hide input control

* Make ancestry estimation optional

* Remove prsweb test since we support only score collections

* Remove unused r2 filtering because is now done by Minimac

* Update pgsCategory filter

* Disable password encryption for pgs results

* Send email without password in pgs mode

* Update info about pgs in input validation

* Fix issue in zip file creation

* Fix issue in ancestry estimation

* Update status message

* Update pages

* Add polygenic score calculation to pages

* Add PGS documentation

* Update documentation and add link to testdata

---------

Co-authored-by: seppinho <[email protected]>
  • Loading branch information
lukfor and seppinho authored Dec 18, 2023
1 parent b56d7e5 commit 4d0e7d8
Show file tree
Hide file tree
Showing 38 changed files with 912 additions and 471 deletions.
16 changes: 16 additions & 0 deletions docs/pgs/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Frequently Asked Questions

## Can I use the Polygenic Score Calculation extension without an email address?
Yes, the extension can also be used with a username without an email. However, without an email, notifications are not sent, and access to genotyped data may be limited.

## Extending expiration date or reset download counter
Your data is available for 7 days. In case you need an extension, please let [us](/contact) know.

## How can I improve the download speed?
[aria2](https://aria2.github.io/) tries to utilize your maximum download bandwidth. Please keep in mind to raise the k parameter significantly (-k, --min-split-size=SIZE). You will otherwise hit the Michigan Imputation Server download limit for each file (thanks to Anthony Marcketta for point this out).

## Can I download all results at once?
We provide wget command for all results. Please open the results tab. The last column in each row includes direct links to all files.

## Can I perform PGS calculation locally?
Imputationserveris using a standalone tool called pgs-calc. It reads the imputed dosages from VCF files and uses them to calculate scores. It supports imputed genotypes from Michigan Imputation Server or TOPMed Imputation Server out of the box and score files from PGS Catalog or PRSWeb instances. In addition, own created score files containing chromosomal positions, both alleles and the effect size can be used easily. pgs-calc uses the chromosomal positions and alleles to find the corresponding dosages in genotype files, but provides also tools to resolve rsIDs in score files using dbSNP. Therefore, it can be applied to genotype files with variants that were not annotated with rsIDs. Moreover, the standalone version provides options to improve the coverage by using the provided proxy mapping file for Europeans or a custom population specific mapping file. pgs-calc is available at https://github.com/lukfor/pgs-calc.
109 changes: 109 additions & 0 deletions docs/pgs/getting-started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Polygenic Score Calculation

We provide an easy to use and user-friendly web interface to apply thousands of published polygenic risk scores to imputed genotypes in an efficient way.
By extending the popular Michigan Imputation Server the module integrates it seamless into the existing imputation workflow and enables users without knowledge in that field to take advantage of this method.
The graphical report includes all meta-data about the scores in a single place and helps users to understand and screen thousands of scores in an easy and intuitive way.

![pipeline.png](images%2Fpipeline.png)

An extensive quality control pipeline is executed automatically to detect and fix possible strand-flips and to filter out missing SNPs to prevent systematic errors (e.g. lower scores for individuals with missing or wrong aligned genetic data).

## Getting started

To utilize the Polygenic Score Calculation extension on ImputationServer, you must first [register](https://imputationserver.sph.umich.edu/index.html#!pages/register) for an account.
An activation email will be sent to the provided address. Once your email address is verified, you can access the service at no cost.

**Please note that the extension can also be used with a username without an email. However, without an email, notifications are not sent, and access to genotyped data may be limited.**

No dataset at hand? No problem, download our example dataset to test the PGS extension: [50-samples.zip](https://imputationserver.sph.umich.edu/downloads/50-samples.zip).


When incorporating the Polygenic Score Calculation extension in your research, please cite the following papers:

> Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, Vrieze S, Chew EY, Levy S, McGue M, Schlessinger D, Stambolian D, Loh PR, Iacono WG, Swaroop A, Scott LJ, Cucca F, Kronenberg F, Boehnke M, Abecasis GR, Fuchsberger C. [Next-generation genotype imputation service and methods](https://www.ncbi.nlm.nih.gov/pubmed/27571263). Nature Genetics 48, 1284–1287 (2016).
> Samuel A. Lambert, Laurent Gil, Simon Jupp, Scott C. Ritchie, Yu Xu, Annalisa Buniello, Aoife McMahon, Gad Abraham, Michael Chapman, Helen Parkinson, John Danesh, Jacqueline A. L. MacArthur and Michael Inouye. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics. doi: 10.1038/s41588-021-00783-5 (2021).
## Setting up your first Polygenic Score Calculation job

1. [Log in](https://imputationserver.sph.umich.edu/index.html#!pages/login) with your credentials and navigate to the **Run** tab to initiate a new Polygenic Score Calculation job.
2. Please click on **"Polygenic Score calculation"** and the submission dialog appears.
3. The submission dialog allows you to specify job properties.

![](images/submit-job01.png)

The following options are available:


### Reference Panel

Our PGS extension offers genotype imputation from different reference panels. The most accurate and largest panel is **HRC (Version r1.1 2016)**. Please select one that fulfills your needs and supports the population of your input data:

- HRC (Version r1.1 2016)
- 1000 Genomes Phase 3 (Version 5)
- 1000 Genomes Phase 1 (Version 3)
- HapMap 2

More details about all available reference panels can be found [here](/pgs/reference-panels/).

### Upload VCF files from your computer

When using the file upload, data is uploaded from your local file system to Michigan Imputation Server. By clicking on **Select Files** an open dialog appears where you can select your VCF files:

![](images/upload-data01.png)

Multiple files can be selected using the `ctrl`, `cmd` or `shift` keys, depending on your operating system.
After you have confirmed your choice, all selected files are listed in the submission dialog:

![](images/upload-data02.png)

Please make sure that all files fulfill the [requirements](/prepare-your-data).


!!! important
Since version 1.7.2 URL-based uploads (sftp and http) are no longer supported. Please use direct file uploads instead.

### Build
Please select the build of your data. Currently the options **hg19** and **hg38** are supported. Michigan Imputation Server automatically updates the genome positions (liftOver) of your data. All reference panels are based on hg19 coordinates.

### Scores and Trait Category

Choose the precomputed Polygenic Score repository relevant to your study from the available options. Based on the selected repository, different trait categories appear and can be selected (e.g. Cancer scores):

![](images/pgs-repository.png)

More details about all available PGS repositories can be found [here](/pgs/scores/).

### Ancestry Estimation

You can enable ancestry estimation by selecting a reference population used to classify your uploaded samples. Currently, we support a worldwide panel based on HGDP.

## Start Polygenic Score Calculation

After agreeing to the *Terms of Service*, initiate the calculation by clicking on **Submit job**. The system will perform Input Validation and Quality Control immediately. If your data passes these steps, the job is added to the queue for processing.

![](images/queue01.png)

## Monitoring and Retrieving Results

- **Input Validation**: Verify the validity of your uploaded files and review basic statistics.

![](images/input-validation01.png)

- **Quality Control**: Examine the QC report and download statistics after the system filters variants based on various criteria.

![](images/quality-control02.png)

- **Polygenic Score Calculation**: Monitor the progress of the imputation and polygenic scores calculation in real time for each chromosome.

![](images/imputation01.png)

## Downloading Results

Upon completion, you will be notified by email if you enter your address on registration. A zip archive containing results can be downloaded directly from the server.

![](images/job-results.png)

Click on the filename to download results directly via a web-browser. For command line downloads, use the **share** symbol to obtain private links.

**Important**: All data is automatically deleted after 7 days. Download needed data within this timeframe. A reminder is sent 48 hours before data deletion.
Binary file added docs/pgs/images/imputation01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pgs/images/input-validation01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pgs/images/pgs-repository.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pgs/images/pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pgs/images/quality-control02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pgs/images/report-01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pgs/images/report-02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pgs/images/submit-job01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pgs/images/upload-data01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pgs/images/upload-data02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 38 additions & 0 deletions docs/pgs/output-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Output Files

The Polygenic Score Calculation Results CSV file provides Polygenic Score (PGS) values for different samples and associated identifiers.
Users can leverage this CSV file to analyze and compare Polygenic Score values across different samples. The data facilitates the investigation of genetic associations and their impact on specific traits or conditions.

## CSV Format

The CSV file consists of a header row and data rows:

### Header Row

- **sample**: Represents the identifier for each sample.
- **PGS000001, PGS000002, PGS000003, ...**: Columns representing different Polygenic Score values associated with the respective identifiers.

### Data Rows

- Each row corresponds to a sample and provides the following information:
- **sample**: Identifier for the sample.
- **PGS000001, PGS000002, PGS000003, ...**: Polygenic Score values associated with the respective identifiers for the given sample.

### Example

Here's an example row:

```csv
sample, PGS000001, PGS000002, PGS000003, ...
sample1, -4.485780284301654, 4.119604924228042, 0.0, -4.485780284301654
```

- **sample1**: Sample identifier.
- **-4.485780284301654**: Polygenic Score value for `PGS000001`.
- **4.119604924228042**: Polygenic Score value for `PGS000002`.
- **0.0**: Polygenic Score value for `PGS000003`.

**Note:**

- Polygenic Score values are provided as floating-point numbers.
- The absence of values (e.g., `0.0`) indicates a lack of Polygenic Score information for a particular identifier in a given sample.
11 changes: 11 additions & 0 deletions docs/pgs/pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Pipeline

![pipeline.png](images%2Fpipeline.png)






## Ancestry estimation
We use LASER to perform principal components analysis (PCA) based on the genotypes of each sample and to place them into a reference PCA space which was constructed using a set of reference individuals [14]. We built reference coordinates based on 938 samples from the Human Genome Diversity Project (HGDP) [15] and labeled them by the ancestry categories proposed by the GWASCatalog [16] which are also used in PGS Catalog.
45 changes: 45 additions & 0 deletions docs/pgs/reference-panels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Reference Panels for PGS Calculation

Our server offers PGS calculation from the following reference panels:


## HRC (Version r1.1 2016)

The HRC panel consists of 64,940 haplotypes of predominantly European ancestry.

| ||
| | |
| Number of Samples | 32,470 |
| Sites (chr1-22) | 39,635,008 |
| Chromosomes | 1-22, X|
| Website | [http://www.haplotype-reference-consortium.org](http://www.haplotype-reference-consortium.org); [HRC r1.1 Release Note](https://imputationserver.sph.umich.edu/start.html#!pages/hrc-r1.1) |

## 1000 Genomes Phase 3 (Version 5)

Phase 3 of the 1000 Genomes Project consists of 5,008 haplotypes from 26 populations across the world.

| ||
| | |
| Number of Samples | 2,504 |
| Sites (chr1-22) | 49,143,605 |
| Chromosomes | 1-22, X|
| Website | [http://www.internationalgenome.org](http://www.internationalgenome.org) |


## 1000 Genomes Phase 1 (Version 3)

| ||
| | |
| Number of Samples | 1,092 |
| Sites (chr1-22) | 28,975,367 |
| Chromosomes | 1-22, X|
| Website | [http://www.internationalgenome.org](http://www.internationalgenome.org) |

## HapMap 2

| ||
| | |
| Number of Samples | 60 |
| Sites (chr1-22) | 2,542,916 |
| Chromosomes | 1-22 |
| Website: | [http://www.hapmap.org](http://www.hapmap.org) |
14 changes: 14 additions & 0 deletions docs/pgs/report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Interactive Report

The created report contains a list of all scores, where each score has a different color based on its coverage. The color green indicates that the coverage is very high and nearly all SNPs from the score were also found in the imputed dataset. The color red indicates that very few SNPs were found and the coverage is therefore low.

![report.png](images/report-01.png)

In addition, the report includes detailed metadata for each score such as the number of variants, the number of well-imputed genotypes and the population used to construct the score. A direct link to PGS Catalog, Cancer PRSWeb or ExPRSWeb is also available for further investigation (e.g. for getting information about the method that was used to construct the score). Further, the report displays the distribution of the scores of all uploaded samples and can be interactively explored. This allows users to detect samples with either a high or low risk immediately.

Moreover, the report gives an overview of all estimated ancestries from the uploaded genotypes and compares them with the populations of the GWAS that was used to create the score.

![report.png](images/report-02.png)


If an uploaded sample with an unsupported population is detected, a warning message is provided and the sample is excluded from the summary statistics.
21 changes: 21 additions & 0 deletions docs/pgs/scores.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Scores

We support currently the following PGS repositories out of the box:

## PGS-Catalog

We use PGS Catalog as the source of scores for PGS Server (version 19. Jan 2023). the PGS Catalog is an online database that collects and annotates published scores and currently provides access to over 3,900 scores encompassing more than 580 traits.

> Samuel A. Lambert, Laurent Gil, Simon Jupp, Scott C. Ritchie, Yu Xu, Annalisa Buniello, Aoife McMahon, Gad Abraham, Michael Chapman, Helen Parkinson, John Danesh, Jacqueline A. L. MacArthur and Michael Inouye. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics. doi: 10.1038/s41588-021-00783-5 (2021).
## Cancer-PRSweb

Collection of scores for major cancer traits.

> Fritsche LG, Patil S, Beesley LJ, VandeHaar P, Salvatore M, Ma Y, Peng RB, Taliun D, Zhou X, Mukherjee B: Cancer PRSweb: An Online Repository with Polygenic Risk Scores for Major Cancer Traits and Their Evaluation in Two Independent Biobanks. Am J Hum Genet 2020, 107(5):815-836.
## ExPRSweb

Collection of scores for common health-related exposures like body mass index or alcohol consumption.

> Ma Y, Patil S, Zhou X, Mukherjee B, Fritsche LG: ExPRSweb: An online repository with polygenic risk scores for common health-related exposures. Am J Hum Genet 2022, 109(10):1742-1760.
41 changes: 28 additions & 13 deletions files/imputationserver-pgs.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
id: imputationserver-pgs
name: Genotype Imputation (PGS Calc Integration)
description: This is the new Michigan Imputation Server Pipeline using <a href="https://github.com/statgen/Minimac4">Minimac4</a>. Documentation can be found <a href="http://imputationserver.readthedocs.io/en/latest/">here</a>.<br><br>If your input data is <b>GRCh37/hg19</b> please ensure chromosomes are encoded without prefix (e.g. <b>20</b>).<br>If your input data is <b>GRCh38hg38</b> please ensure chromosomes are encoded with prefix 'chr' (e.g. <b>chr20</b>).
name: Polygenic Score Calculation
description: "You can upload genotyped data and the application imputes your genotypes, performs ancestry estimation and finally calculates Polygenic Risk Scores.<br><br>No dataset at hand? No problem, download our example dataset: <a href=\"https://imputationserver.sph.umich.edu/downloads/50-samples.zip\" class=\"btn btn-sm btn-secondary\" style=\"color:#ffffff !important\"><i class=\"fa fa-file\"></i> 50-samples.zip</a><br><br>"


version: 1.8.0
website: https://imputationserver.readthedocs.io
website: https://imputationserver.readthedocs.io/en/latest/pgs/getting-started
category:

installation:
Expand Down Expand Up @@ -53,11 +55,13 @@ workflow:
generates: $local $outputimputation $logfile $hadooplogs
binaries: ${app_hdfs_folder}/bin

#if( $reference != "disabled")
- name: Ancestry Estimation
jar: imputationserver.jar
classname: genepi.imputationserver.steps.ancestry.TraceStep
binaries: ${app_hdfs_folder}/bin
references: ${app_hdfs_folder}/references
#end

- name: Data Compression and Encryption
jar: imputationserver.jar
Expand Down Expand Up @@ -95,6 +99,7 @@ workflow:
0.1: 0.1
0.2: 0.2
0.3: 0.3
visible: false

- id: phasing
description: Phasing
Expand All @@ -103,14 +108,13 @@ workflow:
values:
eagle: Eagle v2.4 (phased output)
no_phasing: No phasing
visible: false

- id: population
description: Population
type: list
values:
bind: refpanel
property: populations
category: RefPanel
value: mixed
type: text
visible: false

- id: mode
description: Mode
Expand All @@ -120,6 +124,7 @@ workflow:
qconly: Quality Control Only
imputation: Quality Control & Imputation
phasing: Quality Control & Phasing Only
visible: false

- id: aesEncryption
description: AES 256 encryption
Expand All @@ -129,7 +134,7 @@ workflow:
values:
true: yes
false: no
visible: true
visible: false

- id: meta
description: Generate Meta-imputation file
Expand All @@ -138,7 +143,7 @@ workflow:
values:
true: yes
false: no
visible: true
visible: false

- id: myseparator0
type: separator
Expand All @@ -154,14 +159,24 @@ workflow:
required: true
category: PGSPanel

- id: pgsCategory
description: Trait Category
type: list
values:
bind: pgsPanel
property: categories
category: PGSPanel

- id: reference
description: Reference Populations
description: Ancestry Estimation
type: list
required: true
value: HGDP_938_genotyped
value: disabled
values:
disabled: "Disabled"
HGDP_938_genotyped: Worldwide (HGDP)
HGDP_938_imputed: Worldwide (imputed HGDP)
#HGDP_938_imputed: Worldwide (imputed HGDP)
visible: true

- id: dim
description: Number of principal components to compute
Expand Down
21 changes: 15 additions & 6 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,21 @@ theme: readthedocs

nav:
- Home: index.md
- Getting Started: getting-started.md
- Data Preparation: prepare-your-data.md
- Reference Panels: reference-panels.md
- Pipeline Overview: pipeline.md
- Security: data-sensitivity.md
- FAQ: faq.md
- Genotype Imputation:
- Getting Started: getting-started.md
- Data Preparation: prepare-your-data.md
- Reference Panels: reference-panels.md
- Pipeline Overview: pipeline.md
- Security: data-sensitivity.md
- FAQ: faq.md
- Polygenic Score Calculation:
- Getting Started: pgs/getting-started.md
- Interactive Report: pgs/report.md
- Output Files: pgs/output-files.md
- Reference Panels: pgs/reference-panels.md
- Available Scores: pgs/scores.md
- Pipeline Overview: pgs/pipeline.md
- FAQ: pgs/faq.md
- Developer Documentation:
- API: api.md
- Docker: docker.md
Expand Down
Loading

0 comments on commit 4d0e7d8

Please sign in to comment.