Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory error #102

Open
lachlancoin opened this issue Mar 12, 2019 · 8 comments
Open

out of memory error #102

lachlancoin opened this issue Mar 12, 2019 · 8 comments
Labels

Comments

@lachlancoin
Copy link

I am testing the bam input option and getting following error:

java.lang.OutOfMemoryError: Java heap space
java.util.Arrays.copyOf(Arrays.java:3236)
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
com.google.api.client.util.ByteStreams.copy(ByteStreams.java:55)
com.google.api.client.util.IOUtils.copy(IOUtils.java:94)
com.google.api.client.util.IOUtils.copy(IOUtils.java:63)
com.google.api.client.http.HttpResponse.download(HttpResponse.java:421)
com.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:585)
com.google.cloud.storage.StorageImpl$16.call(StorageImpl.java:464)
com.google.cloud.storage.StorageImpl$16.call(StorageImpl.java:461)
com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
com.google.cloud.RetryHelper.run(RetryHelper.java:76)
com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:461)
com.google.cloud.storage.Blob.getContent(Blob.java:478)
com.google.allenday.nanostream.gcs.GetDataFromFastQFile.processElement(GetDataFromFastQFile.java:37)
com.google.allenday.nanostream.gcs.GetDataFromFastQFile$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:609)
com.google.allenday.nanostream.gcs.ParseGCloudNotification.processElement(ParseGCloudNotification.java:16)
com.google.allenday.nanostream.gcs.ParseGCloudNotification$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)

@obsh
Copy link
Collaborator

obsh commented Mar 12, 2019

I wonder what was the bam file size in your case?
At the moment whole content of the uploaded file is fetched into the process memory.
As a straight-forward solution you can try to specify Dataflow worker machine with more memory using following option:
—workerMachineType=n1-highmem-4
n1-highmem-4 has 26GB RAM while the default Dataflow worker machine for streaming mode is n1-standard-4 with 15GB RAM.

@lachlancoin
Copy link
Author

so using smaller bams solved the memory issue, but it still seems to be going down the bwa-mem route. I also tried uploading SAM files, but no more luck. I couldnt figure out how the code base distinguishes a fastq input from a bam/sam input, the pipeline seems to be the same in either case

@obsh
Copy link
Collaborator

obsh commented Mar 13, 2019

Currently this feature is in the separate git branch bam_files, from stackrace it looks that you are using code compiled from master branch

@lachlancoin
Copy link
Author

Oh yes, sorry. Using that branch I get the following error many times:

java.lang.StringIndexOutOfBoundsException: String index out of range: 23
java.lang.String.substring(String.java:1963)
com.google.allenday.nanostream.pubsub.GCSSourceData.fromGCloudNotification(GCSSourceData.java:49)
com.google.allenday.nanostream.gcs.ParseGCloudNotification.processElement(ParseGCloudNotification.java:16)

Seems to be something to do with the location of the sam files?

@lachlancoin
Copy link
Author

the object location is objectId=Uploads/ICUNEW/out.sam which shouldnt cause any issues. It looks like a problem with a trailing / but I cant find one in this case

@obsh
Copy link
Collaborator

obsh commented Mar 13, 2019

I believe I've made off-by-1 error, I'll do a fix now

@lachlancoin
Copy link
Author

lachlancoin commented Mar 13, 2019 via email

@obsh obsh added the tracked label Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants