Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The newer version of spark,adam,sparklingwater for "Genomic Analysis Using ADAM, Spark and Deep Learning" to the people who want to reproduce the test #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

car2008
Copy link

@car2008 car2008 commented Aug 26, 2016

Now i have some advice for Genomic Analysis Using ADAM, Spark and Deep Learning to the people who want to reproduce the test using the newer version tools:

@car2008
Copy link
Author

car2008 commented Aug 26, 2016

Hi @nfergu ,i have some advice for Genomic Analysis Using ADAM, Spark and Deep Learning to the people who want to reproduce the test .So i post all the changes here ,and i hope it's helpful to others:
first, in the .pom file :

  • Spark version 1.6.1 replacing 1.2.0
  • ADAM version 0.19.0 replacing 0.16.0
  • Sparkling Water version 1.6.5 replacing 1.2.5
  • H2O version 3.8.2.6 replacing 3.0.0.8(we can only modify the version and don't install it after we have installed Sparkling Water)
<dependency>
        <groupId>org.bdgenomics.adam</groupId>
        <artifactId>adam-core</artifactId>
        <version>${adam.version}</version>
</dependency>
<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-apis</artifactId>
         <version>${adam.version}</version>
</dependency>

is modified to

<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-core_2.10</artifactId>
         <version>${adam.version}</version>
</dependency>
<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-apis_2.10</artifactId>
         <version>${adam.version}</version>
</dependency>

then ,in the codes :

val header = StructType(Array(StructField("Region", StringType)) ++
      sortedVariantsBySampleId.first()._2.map(variant => {StructField(variant.variantId.toString, IntegerType)}))

is modified to

val header = DataTypes.createStructType(Array(DataTypes.createStructField("Region", DataTypes.StringType,false)) ++
      sortedVariantsBySampleId.first()._2.map(variant => {DataTypes.createStructField(variant.variantId.toString,DataTypes.IntegerType,false)}))
// Create the SchemaRDD from the header and rows and convert the SchemaRDD into a H2O dataframe
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    val schemaRDD = sqlContext.applySchema(rowRDD, header)
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._
    val dataFrame = h2oContext.toDataFrame(schemaRDD)

is modified to

// Create the SchemaRDD from the header and rows and convert the SchemaRDD into a H2O dataframe
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    //val dataFrame=sqlContext.createDataFrame(rowRDD, header)
    val schemaRDD = sqlContext.applySchema(rowRDD, header)
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._ 
    val dataFrame1 =h2oContext.asH2OFrame(schemaRDD)
    val dataFrame=H2OFrameSupport.allStringVecToCategorical(dataFrame1)
// Split the dataframe into 50% training, 30% test, and 20% validation data
    val frameSplitter = new FrameSplitter(dataFrame, Array(.5, .3), Array("training", "test", "validation").map(Key.make), null)

is modified to

// Split the dataframe into 50% training, 30% test, and 20% validation data
   val frameSplitter = new FrameSplitter(dataFrame, Array(.5, .3), Array("training", "test", "validation").map(Key.make[Frame](_)), null)
// Set the parameters for our deep learning model.
    val deepLearningParameters = new DeepLearningParameters()
    deepLearningParameters._train = training
    deepLearningParameters._valid = validation

is modified to

// Set the parameters for our deep learning model.
    val deepLearningParameters = new DeepLearningParameters()
    deepLearningParameters._train = training._key
    deepLearningParameters._valid = validation._key
// Score the model against the entire dataset (training, test, and validation data)
    // This causes the confusion matrix to be printed
    deepLearningModel.score(dataFrame)('predict)

is modified to

// Score the model against the entire dataset (training, test, and validation data)
    // This causes the confusion matrix to be printed
    deepLearningModel.score(dataFrame)

    Add

import org.apache.spark.sql.types.DataTypes
import hex._
import water.fvec._
import water.support._
import _root_.hex.Distribution.Family
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.tree.gbm.GBMModel
import _root_.hex.{Model, ModelMetricsBinomial}

ok ,that's all, i have tested it successfully ,it will be better if you have other advice . Thank you again !

@car2008 car2008 changed the title The newer version of spark,adam,sparklingwater for Genomic Analysis Using ADAM, Spark and Deep Learning to the people who want to reproduce the test The newer version of spark,adam,sparklingwater for "Genomic Analysis Using ADAM, Spark and Deep Learning" to the people who want to reproduce the test Aug 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant