You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For lack of a better place for this, our collaboration with Be The Match will require
Download BAM files from s3, transform to ADAM Avro+Parquet, and upload to s3 (transform_alignments)
Download ADAM Avro+Parquet alignments for multiple samples from s3, update record groups to prevent collision, merge into a single multi-sample ADAM Avro+Parquet alignments data set, and upload to s3 (merge_alignments)
Report BAM file sizes, single sample ADAM Avro+Parquet alignments file sizes, and merged ADAM Avro+Parquet alignments file size
Download VCF files from s3, transform to ADAM Avro+Parquet variants and genotypes, and upload to s3 (transform_variants, transform_genotypes)
Download ADAM Avro+Parquet variants for multiple samples, merge into a single sites-only ADAM Avro-Parquet variants data set, and upload to s3 (merge_variants)
Download ADAM Avro+Parquet genotypes for multiple samples, merge into a single multi-sample ADAM Avro-Parquet genotypes data set, and upload to s3 (merge_genotypes)
Report VCF file sizes, single sample ADAM Avro+Parquet variants and genotypes file sizes, and merged ADAM Avro+Parquet variants and genotypes file sizes
Notebook with queries to compare native file via s3 vs. transformed via s3 access performance
Documentation on how to run this stuff
Short manuscript on transformation process, storage requirements, and access performance
There hasn't been an ask for realigning reads, recalling variants, annotating variants with SnpEff, or joint genotyping yet, but there could be in the near future.
The text was updated successfully, but these errors were encountered:
In a meeting this afternoon, they've decided to use Apache Zeppelin on Amazon EMR for this use case.
With some clicking around we got ADAM installed on Zeppelin using Maven Central coordinates. Need to do a bit more digging to figure out where to set the Kryo Spark configuration parameters, and create a separate EMR step for Conductor (we used s3-dist-cp).
All the transformations to ADAM Avro+Parquet have been run on EMR clusters, downloading from s3 to HDFS using conductor, and uploading from HDFS to s3 using s3-dist-cp, using bash scripts at https://github.com/heuermh/hook.
Notebooks have been implemented in Zeppelin and RStudio on EMR.
The conversation about merging samples into larger data sets has not happened yet.
fnothaft
pushed a commit
to fnothaft/workflows
that referenced
this issue
Sep 7, 2017
For lack of a better place for this, our collaboration with Be The Match will require
There hasn't been an ask for realigning reads, recalling variants, annotating variants with SnpEff, or joint genotyping yet, but there could be in the near future.
The text was updated successfully, but these errors were encountered: