Question: a way to extract read-only k-mers #9

bmansfeld · 2020-06-26T19:17:40Z

Hi again Dr. Rhie,
I'm having a great time using merqury to asses quality and completeness of different versions of our assembly. It's an awesome tool and it is allowing us to make more informed decisions about haplotig purging and other parameter tweaks. So thanks again!

As I was examining our spectra I noticed our assembly is missing some heterozygosity ie a black (read-only) peak at 1-copy.

I was wondering if there is a way to pull out those kmers specifically and then the reads they came from. My thought is that we could then assemble at least part of the missing heterozygosity using the short reads then add that to our haplotigs to get a more complete assembly.

Can you suggest a way to extract those kmers and their sequence for such a purpose?

Thanks in advance,
Ben

arangrhie · 2020-06-26T19:38:19Z

Hi Ben,

Glad to hear that merqury is being useful!

1. Collect all k-mers in your assembly

meryl count k=$k asm1.fasta output asm1.meryl
# If you have 2 haplotype assemblies, do the next as well
meryl count k=$k asm2.fasta output asm2.meryl
meryl union-sum asm1.meryl asm2.meryl output asm1_and_asm2.meryl

2. Subtract them from the read set

meryl difference read.meryl asm1.meryl output read.0.meryl
# or
meryl difference read.meryl asm1_and_asm2.meryl output read.0.meryl

read.0.meryl contains k-mer counts in the reads, not observed in your assemblies.
You can double check if the histogram matches what you saw in the spectra-cn or spectra-asm plot. meryl histogram generates the histogram.

3. Extract reads containing the read-only mers

meryl-lookup -include is designed for this purpose.

# For paired-end reads (which I assume you are planning to apply):
meryl-lookup -include -sequence my_R1.fastq.gz -mers read.0.meryl -sequence2 R2.fastq.gz -r2 read_only_R2.fastq.gz | pigz -c > read_only_R1.fastq.gz

# For single-end reads (useful if you have long-reads):
meryl-lookup -include -sequence my_read.fastq.gz -mers read.0.meryl | pigz -c > read_only.fastq.gz

Hope this helps!
Arang

bmansfeld · 2020-06-26T19:54:25Z

Awesome! thanks so much for the quick and thoroughly laid out answer!

arangrhie · 2020-06-26T20:04:34Z

Forgot to add the -c in pigz. Fixed in my above comment. Feel free to come back in case there are any other errors.
Will leave this issue open so I remember to document this somewhere.

bmansfeld · 2020-07-03T19:35:26Z

Arang, I realize that this is out of the scope of Merqury issues, but I'm having some memory issues running the read extraction steps.

Meryl-lookup was using something like 260G of RAM (out of 400+ available) but crashed with a cannot allocate memory error:

-- Loading kmers from 'read.0.meryl' into lookup table.

 p       prefixes             bits gigabytes (allowed: 472 GB)
-- -------------- ---------------- ---------
22        4194304     137555580373    16.014
23        8388608     134303832626    15.635
24       16777216     131320520335    15.288
25       33554432     128874078956    15.003
26       67108864     127501379401    14.843 (smallest)
27      134217728     128276163494    14.933
28      268435456     133345914883    15.524
29      536870912     147005600864    17.114
30     1073741824     177845156029    20.704
31     2147483648     243044449562    28.294
32     4294967296     376963219831    43.884
33     8589934592     648320943572    75.474
34    17179869184    1194556574257   139.065
35    34359738368    2290548018830   266.655 (used)
36    68719476736    4486051091179   522.245
37   137438953472    8880577419080  1033.835
38   274877906944   17673150258085  2057.425
39   549755813888   35261816119298  4105.016
-- -------------- ---------------- ---------

For 3520183203 distinct 20-mers (with 35 bits used for indexing and 5 bits for tags):
  266.655 GB memory
  256.000 GB memory for index (34359738368 elements 64 bits wide)
    2.049 GB memory for tags  (3520183203 elements 5 bits wide)
    8.606 GB memory for data  (3520183203 elements 21 bits wide)

Will load 3520183203 kmers.  Skipping 0 (too low) and 0 (too high) kmers.
Allocating space for 3520183203 suffixes of 5 bits each -> 17600916015 bits (2.049 GB) in blocks of 32.000 MB
                     3520183203 values   of 21 bits each -> 73923847263 bits (8.606 GB) in blocks of 32.000 MB
Loaded 3520183203 kmers.  Skipped 0 (too low) and 0 (too high) kmers.
-- Opening sequences in 'R1.uniq.fastq.gz'.
ERROR:  Failed to open input file 'R1.uniq.fastq.gz': Cannot allocate memory

I'm using meryl from here: https://github.com/marbl/meryl

Any advice?
Thanks in advance!
Ben

bmansfeld · 2020-07-06T19:09:39Z

Arang, I've just upped the memory requested to 512G and seems to be running well.
I'll update if something still fails.
Thanks in anycase
Ben

arangrhie · 2020-07-07T14:36:43Z

Hi Ben,

Sorry for the delayed reply. You can set -memory option in meryl-lookup.
For your example, you could use -memory 20 which will limit to use at maximum 20G.

Arang

bmansfeld · 2020-07-07T14:41:40Z

Not a delayed response at all - Nothing to apologize for.
Thanks I found that -memory option and raised it to match my requested memory from the scheduler and all seems to run well now but rather slow.
I wonder if we are wasting time extracting a lot of "error-kmer" reads (ie those w/ less multiplicity than the 1-copy mers)?
I'm looking in to first clipping the data base using meryl greater-than N and then running lookup.
What do you think? is this a smart strategy or am I going to loose a lot of data that might be useful for the purpose of assembling those reads?
Thanks in advance,
Ben

arangrhie · 2020-07-07T14:52:35Z

I think that's a fair strategy. I'd have done the same thing.
From the histogram (meryl histogram read.0.meryl), I'll pick the multiplicity N (first col) which has the minimum count (2nd col).
All k-mers below this N are likely erroneous anyway.

You could also limit the high-copy k-mers, with meryl less-than, to only include the single copy k-mers to avoid reads from repetitive regions not assembled.

Arang

bmansfeld · 2020-07-07T14:56:46Z

Awesome thanks!

kneubehl · 2021-08-25T14:56:50Z

Hi there,

Has the syntax for meryl-lookup changed? I updated to v1.3 and ran the same commands as I have in the past that worked fine but now it seems parts to the command aren't being recognized. I have attached a screenshot of the error. Any help would be appreciated. Thanks.

arangrhie · 2021-09-15T12:58:14Z

Hello @kneubehl , it seems like your meryl is not updated, or used from the previous path. Could you double check this is the path where v1.3 was installed?

kneubehl · 2021-09-15T19:00:55Z

I checked the version of both meryl and meryl-lookup and they are v1.3. Meryl-1.3 is located in /usr/local/bin/ in path. Regards, Alex Kneubehl PhD Candidate Laboratory of Dr. Job Lopez Vector Biology and Bacterial Pathogens Lab Department of Pediatrics National School of Tropical Medicine at Baylor College of Medicine Twitter: @AlexKneubehl

…

________________________________ From: Arang Rhie ***@***.***> Sent: Wednesday, September 15, 2021 7:58 AM To: marbl/merqury ***@***.***> Cc: Kneubehl, Alexander Robert ***@***.***>; Mention ***@***.***> Subject: Re: [marbl/merqury] Question: a way to extract read-only k-mers (#9) ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.

________________________________ Hello @kneubehl<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kneubehl&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=1uCrRAd4twH1SiJW__K4hyA_zjt6hoGJEPLTrig080I&e=> , it seems like your meryl is not updated, or used from the previous path. Could you double check this is the path where v1.3 was installed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_marbl_merqury_issues_9-23issuecomment-2D919995260&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=Le9LdB8fp51INPoZi9EH0mbMceeK9vmoWvIzuLRPOF8&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AOBL37DUOH76MPJZ3ZVK5K3UCCJ7DANCNFSM4OJTDNIA&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=AWNlUsmBzOQIczPYBm4Woz_lIxhdDVWweC1_R54G4Wg&e=>. Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=q5Xj_oQVXLG45JyYm8EpRajrgIxOcEllXPfImVJin4Y&e=> or Android<https://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMCaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=99MiTEeG3M-IttXHprvlxw&m=bdS1Nkhnzdc_2d2o6VTBypZXmqvqP0ungGdUenQpcj0&s=_wd1MVrVilyIjJXohbw57iX2LkbqwMUFXL1b8sBP1gw&e=>.

arangrhie · 2021-09-15T19:08:26Z

Alex, this is the help message for meryl-lookup in v1.3:

usage: ./meryl-lookup <report-type> \
         -sequence <input1.fasta> [<input2.fasta>] \
         -output   <output1>      [<output2>] \
         -mers     <input1.meryl> [<input2.meryl>] [...] [-estimate] \
         -labels   <input1name>   [<input2name>]   [...]

  Compare kmers in input sequences against kmers in input meryl databases.

  Input sequences (-sequence) can be FASTA or FASTQ, uncompressed, or
  compressed with gzip, xz, or bzip2.

  Report types:

  -bed:
     Generate a BED format file showing the location of kmers in
     any input database on each sequence in 'input1.fasta'.

  -wig-count:
     Generate a WIGGLE format file showing the multiplicity of the
     kmer starting at each position in the sequence, if it exists in
     an input kmer database.

  -wig-depth:
     Generate a WIGGLE format file showing the number of kmers in
     any input database that cover each position in the sequence.

  -existence:
     Generate a tab-delimited line for each input sequence with the
     number of kmers in the sequence, in the database and common to both.

  -include:
  -exclude:
     Copy sequences from 'input1.fasta' (and 'input2.fasta') to the
     corresponding output file if the sequence has at least one kmer
     present (include) or no kmers present (exclude) in 'input1.meryl'.

Run `./meryl-lookup <report-type> -help` for details on each method.

No report-type (-bed, -wig-count, -wig-depth, -existence, -include, -exclude) supplied.

How did you install meryl?

arangrhie · 2021-09-15T19:29:01Z

Also please use this script for excluding reads having kmers from a given kmer db.
The syntax is:

meryl-lookup -exclude -sequence $read1 $read2 -mers $meryl -output ${out_1} ${out_2}

for paired-end (short) reads

or

meryl-lookup -exclude -sequence $read1 -mers $meryl -output ${out_1}

for single-end (long) reads.

arangrhie added the documentation Improvements or additions to documentation label Jun 26, 2020

dluecke mentioned this issue May 28, 2024

Out Of Memory job failures with --force-rs xfengnefx/hifiasm-meta#32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: a way to extract read-only k-mers #9

Question: a way to extract read-only k-mers #9

bmansfeld commented Jun 26, 2020

arangrhie commented Jun 26, 2020 •

edited

Loading

bmansfeld commented Jun 26, 2020

arangrhie commented Jun 26, 2020

bmansfeld commented Jul 3, 2020 •

edited

Loading

bmansfeld commented Jul 6, 2020

arangrhie commented Jul 7, 2020

bmansfeld commented Jul 7, 2020

arangrhie commented Jul 7, 2020

bmansfeld commented Jul 7, 2020

kneubehl commented Aug 25, 2021 •

edited

Loading

arangrhie commented Sep 15, 2021

kneubehl commented Sep 15, 2021 via email

arangrhie commented Sep 15, 2021 •

edited

Loading

arangrhie commented Sep 15, 2021

Question: a way to extract read-only k-mers #9

Question: a way to extract read-only k-mers #9

Comments

bmansfeld commented Jun 26, 2020

arangrhie commented Jun 26, 2020 • edited Loading

1. Collect all k-mers in your assembly

2. Subtract them from the read set

3. Extract reads containing the read-only mers

bmansfeld commented Jun 26, 2020

arangrhie commented Jun 26, 2020

bmansfeld commented Jul 3, 2020 • edited Loading

bmansfeld commented Jul 6, 2020

arangrhie commented Jul 7, 2020

bmansfeld commented Jul 7, 2020

arangrhie commented Jul 7, 2020

bmansfeld commented Jul 7, 2020

kneubehl commented Aug 25, 2021 • edited Loading

arangrhie commented Sep 15, 2021

kneubehl commented Sep 15, 2021 via email

arangrhie commented Sep 15, 2021 • edited Loading

arangrhie commented Sep 15, 2021

arangrhie commented Jun 26, 2020 •

edited

Loading

bmansfeld commented Jul 3, 2020 •

edited

Loading

kneubehl commented Aug 25, 2021 •

edited

Loading

arangrhie commented Sep 15, 2021 •

edited

Loading