Hackathon goals

Tests

We have set up a variety of tests:

"Spiked references": create 4 modified versions of M. tuberculosis reference genome with increasing numbers of SNPs added (distributed uniformly). Use real/empirical Illumina data from the same type strain: a pipeline should call precisely the SNPs we put into the reference, in theory. In practise, one does need to handle mutations that have occurred in the strain since the reference was assembled.
Sanity test: Walker cluster data: a dataset of ~50 samples from a well studied outbreak. Dataset contains two clusters, plus non-cluster samples.
Resistotyping: the Walker cluster dataset also has standard DST data for comparison with in silico predictions.
Single-sample with "gold-standard" truth. [yadda yadda there is no truth etc]: the F11 sample from A. Earle's Pileon paper has a high quality reference genome, plus manually curated whole genome alignment and associated variant calls, ranging across the mutation spectrum from SNPs, through indels, to large SVs and regions where roughly decribed events occur. Goal: measure proportion of genome accessible, false negative rate across the mutation spectrum, false positive rate across the mutation spectrum.
Species identification: dataset of ~300 Mycobacterial samples with HAIN assay truth for species.
If we have time we also have replicates

There are some extra deliverables, over and above what we said above:

Automated analysis of callsets against the above datasets/tests
Maybe reorganise this repo so one directory per test, plus a summary markup file per test?

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
METADATA		METADATA
images		images
jenn_results		jenn_results
masks		masks
references		references
results		results
scripts/f11/map_alleles_to_truth		scripts/f11/map_alleles_to_truth
README.md		README.md
REPLICATES.md		REPLICATES.md
REPLICATES.tsv		REPLICATES.tsv
allsamples_cov.tsv		allsamples_cov.tsv
allsamplesflagstat.tsv		allsamplesflagstat.tsv
comid_to_SRA.txt		comid_to_SRA.txt
ctx-31-41-51.tsv		ctx-31-41-51.tsv
insilico_covs.tsv		insilico_covs.tsv
insilicoflagstats.tsv		insilicoflagstats.tsv
insilicotests.tsv		insilicotests.tsv
sarah_curated_vcfallelicprimitives.vcf		sarah_curated_vcfallelicprimitives.vcf
sarah_curated_vcfallelicprimitives_repmasked.vcf		sarah_curated_vcfallelicprimitives_repmasked.vcf
sarahcuration.txt		sarahcuration.txt
sarahcuration_ignore_DELETE_lines.vcf		sarahcuration_ignore_DELETE_lines.vcf
shortenVcf.py		shortenVcf.py
summary.md		summary.md