Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deconstructSigs and hg38 #27

Closed
vladsavelyev opened this issue May 3, 2018 · 19 comments
Closed

deconstructSigs and hg38 #27

vladsavelyev opened this issue May 3, 2018 · 19 comments
Labels

Comments

@vladsavelyev
Copy link
Contributor

Hi Sigve,

Thanks for the new version and especially hg38 support! Worked fantastic, however when tried to enable mutational_signatures, I ran into this issue:

2018-05-03 01:09:56 [INFO] Identifying weighted contributions of known mutational signatures using deconstructSigs
2018-05-03 01:09:56 [INFO] deconstructSigs normalization method ('tri.counts.method'): genome
Error in .Call2("solve_user_SEW", refwidths, start, end, width, translate.negative.coord,  :
  solving row 14999: 'allow.nonnarrowing' is FALSE and the supplied start (180930244) is > refwidth + 1
Calls: <Anonymous> ... .extractFromBSgenomeSingleSequences -> solveUserSEW -> .Call2 -> .Call
Execution halted

Not sure, but this might to be coming from deconstructSigs not fully supporting hg38? There is a fork that has this fix: temizna/deconstructSigs@596cec6

Maybe as a quick fix, PCGR could pull the fixed deconstructSigs branch rather than the main one. Though no rush for us in UMCCR - currently we generate mutational signature separately. But would be nice to have all in PCGR in the future!

Vlad

@vladsavelyev
Copy link
Contributor Author

Worried that the object bsg is not used in signature_contributions_single_sample function: https://github.com/sigven/pcgr/blob/master/src/R/pcgrr/R/mutational_signatures.R#L13

And it actually not passed here: https://github.com/sigven/pcgr/blob/master/src/R/pcgrr/R/mutational_signatures.R#L80-L85

@sigven
Copy link
Owner

sigven commented May 3, 2018

Hi Vlad,
Thank you so much for reporting this! An obvious mistake in my code. I'll get back to you for testing a fixed version.

Sigve

@sigven
Copy link
Owner

sigven commented May 3, 2018

Btw: would you be able to share your calls (VCF)? As a sanity check for my fix.

@vladsavelyev
Copy link
Contributor Author

Hey, sure, attaching the bundle with VCF, CNV calls, and a toml. Note that will not pass the validation, so my toml specified validation disabled.

COLO829__COLO829T-somatic.zip

@vladsavelyev
Copy link
Contributor Author

And while we are on it, attaching one more. This one crashes with the following error:

2018-05-03 15:11:11 - pcgr-summarise - INFO - Converting VCF to TSV with https://github.com/sigven/vcf2tsv
Traceback (most recent call last):
  File "/pcgr/vcf2tsv.py", line 259, in <module>
    if __name__=="__main__": __main__()
  File "/pcgr/vcf2tsv.py", line 23, in __main__
    vcf2tsv(args.query_vcf, args.out_tsv, args.skip_info_data, args.skip_genotype_data, args.keep_rejected_calls, args.compress, args.print_data_type_header)
  File "/pcgr/vcf2tsv.py", line 220, in vcf2tsv
    out.write('\t'.join(line_elements) + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 6803: ordinal not in range(128)

Again, does not pass validation, so might be something wrong with the VCF. But I can't reproduce the error by running vcf2tsv manually neither with the input VCF, nor with the resulting .pass.vcf.gz.

COLO829__COLO829T-normal.zip

@sigven
Copy link
Owner

sigven commented May 3, 2018

Hi again Vlad,

I did not see any CNV calls in your zip, but I believe I have now fixed the bugs you encountered in 0.6.1 (deconstructSigs for hg38, and an ASCII-encoding issue in the vcf2tsv transformation (the lack of proper encoding caused a mistake for "Café-au-lait_macules_with_pulmonary_stenosis":-))

Output files for SNVs/InDels in COLO829 can be downloaded here.

A couple of comments to your run and config (not sure how carefully this was configured, but anyways):

  • the target size should refer to the coding region, implying that this will be similar for WES/WGS, and limited to approx 35-40 Mb (Ideally one should get hold at the size of the callable coding target, correcting for uneven coverage across the regions that have been sequenced)
  • although the VCF works without validation, the vcf2tsv found an additional bug in one of your pre-existing INFO tags (clinvar_sig, which is a mix of String/Flag), take a look at the log file (COLO829.pcgr_acmg.log)
  • from the allelic distribution plot (and the signatures) in your report this clearly seems like a tumor-only run, in that case it should be pre-filtered or run with tumor_only=true

@vladsavelyev
Copy link
Contributor Author

vladsavelyev commented May 4, 2018

Sigve, thanks for the reply, really appreciate the quick fixes. Café-au-lait_macules_with_pulmonary_stenosis is hilarious. There should be a special place in hell for people creating such kind of identifiers 😆

There were two zip files that I attached above, in 2 different comments. COLO829__COLO829T-somatic.zip is the VCF representing the mut sig error, for the original issue. It's tumor somatic alls and include a CNV file. The next one - COLO829__COLO829T-normal.zip - which represents the encoding issue - has the germline calls, that's why it appears like tumor-only.

Are absolutely right about the target size, in fact PCGR warned me about that and stopped running, so I changed the value back to 36 for the rerun (nice error handling btw!), however for some reason I accidentally sent you the bad version of the toml. Anyway, thanks for pointing out!

Looking forward to the new commits, will pull and rerun my samples!

@vladsavelyev
Copy link
Contributor Author

vladsavelyev commented May 4, 2018

Speaking about germline, should we be running this instead? https://github.com/sigven/pcgr_predispose

@akhtar4ever
Copy link

Hi Sigve,

I turned off the input validation and ran the tool, and facing somewhat similar issue , exact error mentioned below:

2018-05-04 07:10:41 [INFO] deconstructSigs normalization method ('tri.counts.method'): default
Error in .Call2("solve_user_SEW", refwidths, start, end, width, translate.negative.coord, :
solving row 15428: 'allow.nonnarrowing' is FALSE and the supplied start (180947483) is > refwidth + 1
Calls: ... .extractFromBSgenomeSingleSequences -> solveUserSEW -> .Call2 -> .Call
Execution halted

I have turned off input validation as it was not processing the vcf file and flagging below error:
2018-05-04 07:22:41 - pcgr-validate-input - INFO - Validating VCF file with EBIvariation/vcf-validator
2018-05-04 07:22:42 - pcgr-validate-input - ERROR -
2018-05-04 07:22:42 - pcgr-validate-input - ERROR - According to the VCF specification, the VCF file (/workdir/input_vcf/TOtB_5724_index13_AGTCAA_L001-L002_001.tnscope_filtered_GRCh38.vcf) is NOT valid:
ERROR: Line 254: Format is not a colon-separated list of alphanumeric strings
ERROR: Line 255: Format is not a colon-separated list of alphanumeric strings
This continued till last line of vcf file.

Any help/suggestion would be appreciated.

Thanks

@vladsavelyev
Copy link
Contributor Author

Sigve,

On this again:

did not see any CNV calls in your zip

I attached 2 different zip files above. Sorry for making it confusing. The first one, COLO829__COLO829T-somatic.zip, was to illustrate the original issue on the mutational signatures. It's tumor somatic alls and include a CNV file.

The next zip file - COLO829__COLO829T-normal.zip - is another one, and represents an unrelated encoding issue. That file has the germline calls only and lacks the CNV file.

brainstorm added a commit to brainstorm/pcgr-deploy that referenced this issue May 7, 2018
@sigven sigven added the bug label May 7, 2018
@sigven
Copy link
Owner

sigven commented May 9, 2018

Hi Vlad and @akhtar4ever ,

The mutational signatures bug (grch38) that you were both observing should how have been fixed (0.6.2).

Akhtar; your other error (the VCF validation error) has been filed two times previously (issues #28, #22). I am afraid the MuTect2 output VCF does not adhere to the VCF restrictions, so in this case you should turn VCF validation off.

Vlad: you are correct regarding the interpretation of germline calls, for this I am developing the predisposition report engine. Not that this is somewhat preliminary, work in progress, with a number of enhancements coming. Either way, your test case results (COLO829), with somatic and predisposition reports can be downloaded here

@sigven sigven closed this as completed May 9, 2018
@akhtar4ever
Copy link

Thanks Sigve,

Thanks for the fixes.
Currently using the latest released code 0.6.2 , but facing below issue:

2018-05-11 07:26:59 [INFO] Identifying weighted contributions of known mutational signatures using deconstructSigs
2018-05-11 07:26:59 [INFO] deconstructSigs normalization method ('tri.counts.method'): default
Error in .Call2("solve_user_SEW", refwidths, start, end, width, translate.negative.coord, :
solving row 405: 'allow.nonnarrowing' is FALSE and the supplied start (139391184) is > refwidth + 1
Calls: ... .extractFromBSgenomeSingleSequences -> solveUserSEW -> .Call2 -> .Call
Execution halted

Any help/suggestion would be appreciated.

Thanks
Akhtar

@sigven
Copy link
Owner

sigven commented May 11, 2018

Ah, I see. I need your input (VCF + the config file) to do error-checking. Would you be able to share that with me?

BTW: Have you specified the correct build (GRCh37/GRCh38) for your input?

@sigven sigven reopened this May 11, 2018
@akhtar4ever
Copy link

Hi Sigve,

I have specified the build, command used is:
** python /home/cwtools/pcgr-0.6.2/pcgr.py --input_vcf test_new.vcf /home/cwtools/pcgr-0.6.2 /home/cwtools/for_upload/test_new grch38 /home/cwtools/pcgr-0.6.2/pcgr.toml test_new --force_overwrite**
config file is the default toml file in which I have just changed validation to false.

@sigven
Copy link
Owner

sigven commented May 11, 2018

Hi Akhtar,
Thanks. I would have to look at your VCF file ('test_new.vcf') to figure out why it fails. Could you share that file with me?

@akhtar4ever
Copy link

Hi Sigve,

Apologies for the delay.

Attached the vcf and config file.

Thanks
Akhtar

@akhtar4ever
Copy link

Hi Sigve,

The files were not getting uploaded, so zipped them.

for_pcgr_share.zip
Please find the zip file, which contains vcf and toml file I used.

Let me know if anything else is required.

Thanks
Aktar

@sigven
Copy link
Owner

sigven commented May 14, 2018

Hi Akhtar,

I asked previously if you had specified the correct build (GRCh37/GRCh38) for your input, making sure that the build you specify is matching your input data. In your test case (test_new.vcf), this is not GRCh38, and this is why it's failing with the command you specified above (the huge number of warnings you receive during VEP annotation should also give you a hint that the assembly you specified is wrong).

I used GRCh37, and it ran with no errors. You can see from line 4 in your VCF that test_new.vcf is hg19/GRCh37:
##reference=file:///b/LF1/seq/data_external/bwahg19/hg19_basicChromosomes.fa

Although PCGR in principle could make an assembly check for the input, I believe the users should carry the main responsibility for making sure that there is a correspondence between the build/assembly that is specified and the actual build/assembly of the input files. It's also highly appreciated if this matter is interrogated before any issues are filed.

kind regards,
Sigve

@sigven sigven closed this as completed May 14, 2018
@akhtar4ever
Copy link

Thanks a lot Sigve for your findings.
Will make a note of it.

Appreciate your efforts and quick response.

Thanks
Aktar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants