This repository contains all the code needed to reproduce the analyses presented in the paper titled "k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean".
As a disclaimer, readers should be aware that most of the code was reorganized and integrated into the Makefile only after analyses were performed. Therefore, those trying to run the analyses might run into issues related to paths or software version incompatibilities. We encourage users who encounter problems while trying to run this code to open a GitHub issue or contact the repo maintainer directly. We believe that the code in this repository and the associated Makefile should still be useful to help those interested in understanding the analyses that were performed.
The following software should be installed to reproduce the analyses. Some of
these programs may themselves have additional dependencies. It is assumed that
all these programs are found in your $PATH
for the analyses to run properly.
- AsmVar
- BBDuk
- bamaddrg
- BayesTyperTools
- bwa
- bcftools
- gwask R package
- htslib
- katcher and associated binaries
- KMC
- k-mer GWAS binaries
- kmers_ld
- LAST
- MAFFT
- manta
- MUMmer
- paragraph
- Platypus
- PLINK
- R programming language
- samtools
- smoove
- SOAPdenovo2
- SPAdes
- SvABA
- SVanalyzer
- svmutools R package
- TASSEL command-line tools
- TeX Live
- vcftools
Some programs needed for reproducing analyses were modified from existing software:
-
We forked svmu and slightly modified it to improve memory usage and execution speed. This version can be installed from our fork by using the commit 378719b on branch malemay-fork.
-
The script scripts/addMissingPaddingGmax4.py was adapted from addMissingPaddingHg38.py to use the soybean reference genome instead of the human reference genome.
- The Illumina data used in this project is available from the
SRA using the BioProject accession
numbers PRJNA257011,
PRJNA289660,
and PRJNA639876.
This data should be placed under
illumina_data/raw_fastq/
in compressed fastq format to reproduce the analyses.
-
The assemblies of Liu et al. (2020) are available on the Genome Warehouse through Accession Number PRJCA002030.
-
The assemblies of ZH13, W05 and Lee are available on SoyBase.
All these assemblies should be placed in external_data/genome_assemblies/
.
The SVs identified from Oxford Nanopore data by Lemay et al. (2022)
are available on figshare.
These should be placed under external_data/nanopore_svs/
.
SoySNP50K genotype calls are available from SoyBase.
These should be placed under external_data/soysnp50k_wm82.a2_41317.vcf.gz
.
The following datasets are available from the Web and should be added to the repository to reproduce the analyses:
-
The reference genome sequence and annotation of soybean cultivar Williams82, assembly version 4 can be downloaded from Phytozome. The files needed (
Gmax_508_v4.0.fa
,Gmax_508_Wm82.a4.v1.gene_exons.gff3
) should be placed underrefgenome/
. -
The soybean chloroplast and mitochondrion genome sequences can be downloaded from SoyBase. They should also be concatenated to
Gmax_508_v4.0.fa
and placed under the namerefgenome/Gmax_508_v4.0_mit_chlp.fasta
Some of the intermediate datasets generated as part of this analysis are available on figshare.
-
Reference Illumina adapters are distributed with BBDUK and should be placed under
external_data/adapters.fa
. -
The I locus contig assembled by Tuteja and Vodkin (2008) is available from NCBI and should be placed under
external_data/BAC77G7-a.fasta
. -
The signals discovered by Bandillo et al. (2017) and Bandillo et al. (2015) are available from their publications and were placed under
reference_signals/bandillo2017_signals_curated.tsv
andreference_signals/bandillo2015_table1_curated.csv
, respectively.
Please refer to the corresponding section in our previous work for instructions on how to use the Makefile to understand the sequence of the analyses performed as part of this work.
If you use this software, plase cite our publication:
Lemay, M.-A., de Ronne, M., Bélanger, R., & Belzile, F. (2023). k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean. The Plant Genome, 16, e20374. doi:10.1002/tpg2.20374