-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create new repeat definition file #49
Comments
Hello, TRGT is flexible on the repeat definition passed as input, and the BED files provided in the code are simply the most common pathogenic TRs of interest in hg38, but the program should work with other references and custom panels as well. The program will use reads that map to that loci and realing them to the flanking regions around each entry in the BED input. In order to use different repeats, you just need a BED file with the genome coordinates and a fourth field definind
If you are interested in annotated loci in other genomes, we recommend the following resource, which has various relevant loci in multiple human assemblies and other mammalian organisms: https://strchive.org/ |
Hi, I want to build repeat definition file for multi-motif repeat region, how to write the structure filed. |
Thanks for reporting, this shouldn't have happened and we'll look into the problem (we should be discontinuing the
Thank you! |
Hi, |
Hello, I want to know the differences between "pathogenic_repeats.hg38.bed" and "repeat_catalog.hg38.bed" that you provided. It seems that the larger one is still much less than the region contained in human_GRCh38_no_alt_analysis_set.trf.bed. Are these two bed files together being "simply the most common pathogenic TRs of interest in hg38"? I want to call TR variants but I do not have a target, which one should I use? Or use both? Thanks in advance. |
Thanks for the question. If you are interested in general analysis of tandem repeats, it might be best to profile both known pathogenic repeats and also all/most repeats across the genome. We are in the process of updating repeat catalogs for TRGT. However, for pathogenic repeats, I'd recommend using the latest set from STRchive (you can download them from here in the TRGT format: https://strchive.org/database/index.html). For genome-wide repeat catalog, you could use either adotto_repeats.hg38.bed.gz or variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz. Although the latter catalog will most likely change in the near future. Did I answer your questions? Please let me know if there is anything else you'd like to know. |
@egor-dolzhenko Great! I will try it following your advice. But I think STRchive website is not available now, maybe I can try it later. Thanks again! |
@egor-dolzhenko Hello, me again. Today I checked several TR definition files you mentioned and I got much more confused. I tried to select several the coordinates(most from the head of the bed file) in the bed files by manually seeing the TR sequences on the refseq (GCA_000001405.15_GRCh38_no_alt_analysis_set.fa) I used but it could not match. Actually I can hardly find a pair of coordinates which are totally same from different TR definition files. For example, pathogenic_repeats.hg38.bed has a little bit different at the start and end position with hg38.STRchive-disease-loci.TRGT.bed at one gene/disease that both definition files have, I found there was a similar issue before #28. variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz almost match the refseq but not exactly because its coordinate ends not well. adotto_repeats.hg38.bed.gz seems to give a region that include the TR sequence rather than always being a TR region(similar to human_GRCh38_no_alt_analysis_set.trf.bed?). repeat_catalog.hg38.bed seems to be the best one during randomly checking but the entire region is too small to perform genome-wide analysis, I think. I want to know if I made something wrong? And I also want to know what will happen if I use an inappropriate TR definition file? Will it generate more false negative? How does TRGT use these definition files, just give a ambiguous coordinate which is overlapped with TR should be OK? I am sorry to ask so much, and I am just starting out in this field and have a lot to learn, so I would really appreciate it if you could point out my misunderstandings. Thank you! |
Sorry about that. You can access STRchive's data directly here: https://github.com/dashnowlab/STRchive. |
Thanks for the comment @Checunmily. There are multiple ongoing projects that aim to define better repeat catalogs and are starting to tackle these representational issues (for example see this paper and this one). One challenge is that some repeat regions can be defined in multiple equally good ways. Also some people choose to include flanking sequences surrounding the repeats into their repeat definitions and some don't. These and other issues create disagreements between repeat catalogs and so it is usually a good idea to pick one genome-wide catalog for all your analyses. If you'd like, please feel free to pick one or two discrepant repeat definitions that we could review together. If we find that one definition is clearly better than the other, we could pass this feedback to the developers of these catalogs. As for TRGT itself, it will work with different repeat definitions, especially if they capture the entire repeat region / variation (take a look at the "analysis of variation clusters" section in this paper for more information about this). |
@Checunmily |
That will be a great help for me @egor-dolzhenko. Exactly I should read some paper first. I will contact you if I find more details in the future. Thank you so much! |
This sounds great, @Checunmily! |
Hi,
Excellent tool, I checked the repeat file you provided and find that some of our repeat region were not included in your file. I wonder if this tool can create our own repeat definition file and how?
best,
Lai.
The text was updated successfully, but these errors were encountered: