Code for the analysis of k-mer- and structural variant-based GWAS in soybean

Overview

This repository contains all the code needed to reproduce the analyses presented in the paper titled "k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean".

As a disclaimer, readers should be aware that most of the code was reorganized and integrated into the Makefile only after analyses were performed. Therefore, those trying to run the analyses might run into issues related to paths or software version incompatibilities. We encourage users who encounter problems while trying to run this code to open a GitHub issue or contact the repo maintainer directly. We believe that the code in this repository and the associated Makefile should still be useful to help those interested in understanding the analyses that were performed.

Software dependencies

The following software should be installed to reproduce the analyses. Some of these programs may themselves have additional dependencies. It is assumed that all these programs are found in your $PATH for the analyses to run properly.

Some programs needed for reproducing analyses were modified from existing software:

We forked svmu and slightly modified it to improve memory usage and execution speed. This version can be installed from our fork by using the commit 378719b on branch malemay-fork.
The script scripts/addMissingPaddingGmax4.py was adapted from addMissingPaddingHg38.py to use the soybean reference genome instead of the human reference genome.

Data availability

Sequencing data

The Illumina data used in this project is available from the SRA using the BioProject accession numbers PRJNA257011, PRJNA289660, and PRJNA639876. This data should be placed under illumina_data/raw_fastq/ in compressed fastq format to reproduce the analyses.

High-quality assemblies for SV calling

The assemblies of Liu et al. (2020) are available on the Genome Warehouse through Accession Number PRJCA002030.
The assemblies of ZH13, W05 and Lee are available on SoyBase.

All these assemblies should be placed in external_data/genome_assemblies/.

Structural variants called from Oxford Nanopore data

The SVs identified from Oxford Nanopore data by Lemay et al. (2022) are available on figshare. These should be placed under external_data/nanopore_svs/.

SoySNP50K calls

SoySNP50K genotype calls are available from SoyBase. These should be placed under external_data/soysnp50k_wm82.a2_41317.vcf.gz.

Reference data

The following datasets are available from the Web and should be added to the repository to reproduce the analyses:

The reference genome sequence and annotation of soybean cultivar Williams82, assembly version 4 can be downloaded from Phytozome. The files needed (Gmax_508_v4.0.fa, Gmax_508_Wm82.a4.v1.gene_exons.gff3) should be placed under refgenome/.
The soybean chloroplast and mitochondrion genome sequences can be downloaded from SoyBase. They should also be concatenated to Gmax_508_v4.0.fa and placed under the name refgenome/Gmax_508_v4.0_mit_chlp.fasta

Data generated by the analysis

Some of the intermediate datasets generated as part of this analysis are available on figshare.

Other datasets

Reference Illumina adapters are distributed with BBDUK and should be placed under external_data/adapters.fa.
The I locus contig assembled by Tuteja and Vodkin (2008) is available from NCBI and should be placed under external_data/BAC77G7-a.fasta.
The signals discovered by Bandillo et al. (2017) and Bandillo et al. (2015) are available from their publications and were placed under reference_signals/bandillo2017_signals_curated.tsv and reference_signals/bandillo2015_table1_curated.csv, respectively.

Querying the Makefile

Please refer to the corresponding section in our previous work for instructions on how to use the Makefile to understand the sequence of the analyses performed as part of this work.

Citation

If you use this software, plase cite our publication:

Lemay, M.-A., de Ronne, M., Bélanger, R., & Belzile, F. (2023). k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean. The Plant Genome, 16, e20374. doi:10.1002/tpg2.20374

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
additional_files		additional_files
cnv_analysis		cnv_analysis
external_data/nanopore_svs		external_data/nanopore_svs
figures		figures
filtered_variants		filtered_variants
gwas_results		gwas_results
illumina_data		illumina_data
kmers_table		kmers_table
phenotypic_data		phenotypic_data
reference_signals		reference_signals
refgenome		refgenome
scripts		scripts
sv_genotyping/paragraph		sv_genotyping/paragraph
tables		tables
utilities		utilities
variant_calling		variant_calling
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
make_symlinks.sh		make_symlinks.sh
touch_files.sh		touch_files.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for the analysis of k-mer- and structural variant-based GWAS in soybean

Overview

Software dependencies

Data availability

Sequencing data

High-quality assemblies for SV calling

Structural variants called from Oxford Nanopore data

SoySNP50K calls

Reference data

Data generated by the analysis

Other datasets

Querying the Makefile

Citation

About

Releases

Packages

Languages

License

malemay/soybean_kmer_gwas

Folders and files

Latest commit

History

Repository files navigation

Code for the analysis of k-mer- and structural variant-based GWAS in soybean

Overview

Software dependencies

Data availability

Sequencing data

High-quality assemblies for SV calling

Structural variants called from Oxford Nanopore data

SoySNP50K calls

Reference data

Data generated by the analysis

Other datasets

Querying the Makefile

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages