Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: beagle AP field #224

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

fix: beagle AP field #224

wants to merge 12 commits into from

Conversation

nicholema
Copy link
Contributor

@nicholema nicholema commented Jul 23, 2024

This PR ensures that mergeSTR will output format fields from hipstr-beagle. Specifically, it adds support for the AP1, AP2, and DS fields.

Note that only hipstr input is supported for now. We did this because the imputation panels all seem to be produced by hipstr, anyway.

Checklist

  • I've checked to ensure there aren't already other open pull requests for the same update/change
  • I've prefixed the title of my PR according to the conventional commits specification. If your PR fixes a bug, please prefix the PR with fix: . Otherwise, if it introduces a new feature, please prefix it with feat: . If it introduces a breaking change, please add an exclamation before the colon, like feat!: . If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.
  • At the top of the PR, I've listed any open issues that this PR will resolve. For example, "resolves #0" if this PR resolves issue #0
  • I've explained my changes in a manner that will make it possible for both users and maintainers of TRTools to understand them
  • I've added tests for any new functionality. Or, if this PR fixes a bug, I've added test(s) that replicate it
  • All directories with large test files are listed in the "exclude" section of our pyproject.toml so that they do not appear in our PyPI distribution. All new files are also smaller than 0.5 MB.
  • I've updated the relevant REAMDEs with any new usage information and checked that the newly built documentation is formatted properly
  • All functions, modules, classes etc. still conform to numpy docstring standards
  • (if applicable) I've updated the pyproject.toml file with any changes I've made to TRTools's dependencies, and I've run poetry lock --no-update to ensure the lock file stays up to date and that our dependencies are locked to their minimum versions
  • In the body of this PR, I've included a short address to the reviewer highlighting one or two items that might deserve their focus

@nicholema nicholema self-assigned this Jul 23, 2024
@nicholema nicholema requested review from aryarm and gymreklab July 23, 2024 19:18
@nicholema nicholema removed their assignment Jul 23, 2024
Copy link
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be a good idea to add a check at the end of the another test that will ensure the AP1, AP2, and DS flags were properly added to the merged VCF.

To do that, you can probably just reread the file with TRHarmonizer and then assert that the contents of the fields file are the same as what you would expect like this

def test_hipstr_output(args, mrgvcfdir):
fname1 = os.path.join(mrgvcfdir, "test_file_hipstr1.vcf.gz")
fname2 = os.path.join(mrgvcfdir, "test_file_hipstr2.vcf.gz")
args.vcftype = "hipstr"
args.vcfs = fname1 + "," + fname2
assert main(args) == 0
assert_same_vcf(args.out + '.vcf', mrgvcfdir + "/hipstr_merged.vcf")

Copy link
Contributor

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 500000 bytes, as set in the .yml configuration files:

    trtools/testsupport/sample_vcfs/mergeSTR_vcfs/hipstr_imputed_merged.vcf

    Consider using git-lfs to manage large files.

@github-actions github-actions bot added the large-files Warning Label for use when LFS is detected in the commits of a Pull Request label Jul 25, 2024
@github-actions github-actions bot removed the large-files Warning Label for use when LFS is detected in the commits of a Pull Request label Jul 25, 2024
@gymreklab
Copy link
Contributor

Thanks for adding this PR!

Upon reviewing, I realized things might be slightly more complicated than we originally thought and we need to take the following into account:

  • Beagle is not specific to HipSTR (or any caller) in theory. You could potentially have Beagle imputed calls from ExpansionHunter or any other caller. Therefore I don't think we should add "AP1", "AP2", "DS" only for HipSTR, but instead have a check for whether the VCFs being meregd are from Beagle (using the IsBeagleVCF() function from TR Harmonizer) and if so can add those fields. Even then I think AP1/AP2 should not be treated as required since those are not added to Beagle VCFs by default, only if ap=true. By the way we could also add the GP field here. I am not sure if DS is always output, we should check

  • The other issue is that for AP/GP fields (need to check on DS): they probably only make sense to include in the merged file if the set of ref/alt alleles in all files being merged is identical. we would need to add a check for that. they in theory should be, but I recall we have seen some recent cases where the alleles seemed to differ across multiple VCFs imputed with the same ref panel.

All of these are addressable and we can discuss further how to integrate the required changes.

Copy link
Contributor

@gymreklab gymreklab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments left in the PR Conversation

@aryarm
Copy link
Member

aryarm commented Jul 26, 2024

thanks for explaining all of that, @gymreklab !

so, for those cases where the ref/alt fields aren't identical between the two or more VCFs, maybe we can do something like this?

  1. if the set of ref/alt alleles are the same but simply reordered, then we can correspondingly reorder the AP/GP/DS fields
  2. for any ref/alt alleles that aren't present in one of the VCFs, we can simply assign AP/GP/DS values of 0 in the merged output

are there any other situations I'm not thinking of? I suppose there might also be cases where the alleles are the same but are represented slightly differently (ex: not left-aligned)?

@gymreklab
Copy link
Contributor

It could make sense to do a first pass where we just refuse to merge those fields if the ref/alt aren't identical and in the same order and output a warning for those loci. My impression was that in most cases ref/alt should be exactly the same for different VCFs imputed from the same reference panel. If that is the case we might be able to get away with avoiding the more complex cases. However if those complex cases happen for more than a handful of loci we might have to.

For the case where the alleles are identical but reordered: this should be handle-able. But I don't think we currently merge any other fields like this where we would need to reorder so would need to add some logic for that, making sure order is correct.

For the case where the alleles don't match: in theory we could assign 0. But that might result in weird batch effects. I doubt those probabilities are truly 0 - if those alleles had been considered by Beagle they would probably have non-zero values. Another option would be to put a missing value there which I would prefer. In theory none of those cases should affect our main application of computing dosages. But again since this case is complex and in theory shouldn't happen I am hesitant to handle it for now.

@github-actions github-actions bot added the large-files Warning Label for use when LFS is detected in the commits of a Pull Request label Aug 1, 2024
Copy link
Contributor

github-actions bot commented Aug 1, 2024

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 500000 bytes, as set in the .yml configuration files:

    trtools/mergeSTR/tests/mydir/test_GangSTRRightFile0/test.vcf, trtools/mergeSTR/tests/mydir/test_MissingFieldWarnings0/test.vcf, trtools/mergeSTR/tests/mydir/test_PopSTRRightFile0/test.vcf

    Consider using git-lfs to manage large files.

@github-actions github-actions bot removed the large-files Warning Label for use when LFS is detected in the commits of a Pull Request label Aug 1, 2024
Copy link
Contributor

@gymreklab gymreklab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is almost there. In addition to comments on individual lines we should:

  • add tests for CheckIdenticalAlleleOrder
  • Add back the deleted test file hipstr_merged.vcf causing tests to fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add this file back it looks like maybe it was deleted by accident

trtools/mergeSTR/mergeSTR.py Outdated Show resolved Hide resolved
trtools/mergeSTR/mergeSTR.py Outdated Show resolved Hide resolved
trtools/mergeSTR/tests/test_mergeSTR.py Outdated Show resolved Hide resolved
Copy link
Contributor

@gymreklab gymreklab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to update the test for CheckIdenticalAlleles to directly call that function

assert ("Conflicting alt alleles found at" in capsys.readouterr().err)

#check if identical allele order
def test_CheckIdenticalAlleleOrder(args,mrgvcfdir,capsys):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test should call the function CheckIdenticalAlleleOrder directly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants