Skip to content

Code used for "k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean"

License

Notifications You must be signed in to change notification settings

malemay/soybean_kmer_gwas

Repository files navigation

Code for the analysis of k-mer- and structural variant-based GWAS in soybean

Overview

This repository contains all the code needed to reproduce the analyses presented in the paper titled "k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean".

As a disclaimer, readers should be aware that most of the code was reorganized and integrated into the Makefile only after analyses were performed. Therefore, those trying to run the analyses might run into issues related to paths or software version incompatibilities. We encourage users who encounter problems while trying to run this code to open a GitHub issue or contact the repo maintainer directly. We believe that the code in this repository and the associated Makefile should still be useful to help those interested in understanding the analyses that were performed.

Software dependencies

The following software should be installed to reproduce the analyses. Some of these programs may themselves have additional dependencies. It is assumed that all these programs are found in your $PATH for the analyses to run properly.

Some programs needed for reproducing analyses were modified from existing software:

Data availability

Sequencing data

  • The Illumina data used in this project is available from the SRA using the BioProject accession numbers PRJNA257011, PRJNA289660, and PRJNA639876. This data should be placed under illumina_data/raw_fastq/ in compressed fastq format to reproduce the analyses.

High-quality assemblies for SV calling

  • The assemblies of Liu et al. (2020) are available on the Genome Warehouse through Accession Number PRJCA002030.

  • The assemblies of ZH13, W05 and Lee are available on SoyBase.

All these assemblies should be placed in external_data/genome_assemblies/.

Structural variants called from Oxford Nanopore data

The SVs identified from Oxford Nanopore data by Lemay et al. (2022) are available on figshare. These should be placed under external_data/nanopore_svs/.

SoySNP50K calls

SoySNP50K genotype calls are available from SoyBase. These should be placed under external_data/soysnp50k_wm82.a2_41317.vcf.gz.

Reference data

The following datasets are available from the Web and should be added to the repository to reproduce the analyses:

  • The reference genome sequence and annotation of soybean cultivar Williams82, assembly version 4 can be downloaded from Phytozome. The files needed (Gmax_508_v4.0.fa, Gmax_508_Wm82.a4.v1.gene_exons.gff3) should be placed under refgenome/.

  • The soybean chloroplast and mitochondrion genome sequences can be downloaded from SoyBase. They should also be concatenated to Gmax_508_v4.0.fa and placed under the name refgenome/Gmax_508_v4.0_mit_chlp.fasta

Data generated by the analysis

Some of the intermediate datasets generated as part of this analysis are available on figshare.

Other datasets

  • Reference Illumina adapters are distributed with BBDUK and should be placed under external_data/adapters.fa.

  • The I locus contig assembled by Tuteja and Vodkin (2008) is available from NCBI and should be placed under external_data/BAC77G7-a.fasta.

  • The signals discovered by Bandillo et al. (2017) and Bandillo et al. (2015) are available from their publications and were placed under reference_signals/bandillo2017_signals_curated.tsv and reference_signals/bandillo2015_table1_curated.csv, respectively.

Querying the Makefile

Please refer to the corresponding section in our previous work for instructions on how to use the Makefile to understand the sequence of the analyses performed as part of this work.

Citation

If you use this software, plase cite our publication:

Lemay, M.-A., de Ronne, M., Bélanger, R., & Belzile, F. (2023). k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean. The Plant Genome, 16, e20374. doi:10.1002/tpg2.20374

About

Code used for "k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published