Skip to content
This repository has been archived by the owner on Nov 28, 2020. It is now read-only.

BAMBDGDataSource genomic intervals predicate pushdowns using BAI

Marek Wiewiórka edited this page Jul 25, 2018 · 4 revisions
spark-shell --master=local[4] \
--driver-memory=8g \
--jars /Users/marek/git/forks/bdg-sequila/target/scala-2.11/bdg-sequila-assembly-0.4.1-SNAPSHOT.jar
import org.apache.spark.sql.SequilaSession
import org.biodatageeks.utils.{SequilaRegister, UDFRegister}

val ss = SequilaSession(spark)
/*inject bdg-granges strategy*/
SequilaRegister.register(ss)

ss.sql("""
CREATE TABLE reads_exome USING org.biodatageeks.datasources.BAM.BAMDataSource OPTIONS(path '/Users/marek/Downloads/data/NA12878.ga2.exome.maq.recal.bam')""")

spark.time{
 ss.sqlContext.setConf("spark.biodatageeks.bam.predicatePushdown","false")
  ss.sql("SELECT count(*) FROM reads_exome WHERE contigName='chr1' AND start=20138").show
}

18/07/25 12:57:44 WARN BAMRelation: GRanges: chr1:20138-20138, false
+--------+                                                                      
|count(1)|
+--------+
|      20|
+--------+

Time taken: 186045 ms


spark.time{
  ss.sqlContext.setConf("spark.biodatageeks.bam.predicatePushdown","true")
  ss.sql("SELECT count(*) FROM reads_exome WHERE contigName='chr1' AND start=20138").show
}
18/07/25 13:01:40 WARN BAMRelation: GRanges: chr1:20138-20138, true
18/07/25 13:01:40 WARN BAMRelation: Interval query detected and predicate pushdown enabled, trying to do predicate pushdown using intervals chr1:20138-20138
Using Java builtin Inflater
+--------+
|count(1)|
+--------+
|      20|
+--------+

Time taken: 732 ms




Clone this wiki locally