Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new repeat definition file #49

Open
BrendaLee1 opened this issue Nov 23, 2024 · 13 comments
Open

Create new repeat definition file #49

BrendaLee1 opened this issue Nov 23, 2024 · 13 comments

Comments

@BrendaLee1
Copy link

Hi,
Excellent tool, I checked the repeat file you provided and find that some of our repeat region were not included in your file. I wonder if this tool can create our own repeat definition file and how?

best,
Lai.

@pbsena
Copy link
Contributor

pbsena commented Nov 25, 2024

Hello,

TRGT is flexible on the repeat definition passed as input, and the BED files provided in the code are simply the most common pathogenic TRs of interest in hg38, but the program should work with other references and custom panels as well. The program will use reads that map to that loci and realing them to the flanking regions around each entry in the BED input.

In order to use different repeats, you just need a BED file with the genome coordinates and a fourth field definind ID and MOTIFS. The examples in the BED files provided in the repeats directory also contain the STRUCT field but those are not necessary. So this for instance would be sufficient:

chr1	57367043	57367119	ID=DAB1;MOTIFS=AAAAT,GAAAT

If you are interested in annotated loci in other genomes, we recommend the following resource, which has various relevant loci in multiple human assemblies and other mammalian organisms: https://strchive.org/

@BrendaLee1
Copy link
Author

BrendaLee1 commented Nov 26, 2024

Hi,
I tried the repeat definition format as you suggested, but got the error below:
Locus Processing: Error at BED line 60365: STRUC field missing, version 1.4.1.

I want to build repeat definition file for multi-motif repeat region, how to write the structure filed.

@pbsena
Copy link
Contributor

pbsena commented Nov 26, 2024

Thanks for reporting, this shouldn't have happened and we'll look into the problem (we should be discontinuing the STRUC field). In the meantime, could you add STRUC=<LOCUS> to the end of your definition? Here LOCUS should be the name given in the ID field. In the above example it should look like this:

chr1	57367043	57367119	ID=DAB1;MOTIFS=AAAAT,GAAAT;STRUC=<DAB1>

Thank you!

@BrendaLee1
Copy link
Author

Hi,
Thank you for your advice, it works.

@Checunmily
Copy link

TRGT is flexible on the repeat definition passed as input, and the BED files provided in the code are simply the most common pathogenic TRs of interest in hg38,

Hello, I want to know the differences between "pathogenic_repeats.hg38.bed" and "repeat_catalog.hg38.bed" that you provided. It seems that the larger one is still much less than the region contained in human_GRCh38_no_alt_analysis_set.trf.bed. Are these two bed files together being "simply the most common pathogenic TRs of interest in hg38"? I want to call TR variants but I do not have a target, which one should I use? Or use both?

Thanks in advance.

@egor-dolzhenko
Copy link
Collaborator

Thanks for the question. If you are interested in general analysis of tandem repeats, it might be best to profile both known pathogenic repeats and also all/most repeats across the genome.

We are in the process of updating repeat catalogs for TRGT. However, for pathogenic repeats, I'd recommend using the latest set from STRchive (you can download them from here in the TRGT format: https://strchive.org/database/index.html). For genome-wide repeat catalog, you could use either adotto_repeats.hg38.bed.gz or variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz. Although the latter catalog will most likely change in the near future.

Did I answer your questions? Please let me know if there is anything else you'd like to know.

@Checunmily
Copy link

@egor-dolzhenko Great! I will try it following your advice. But I think STRchive website is not available now, maybe I can try it later. Thanks again!

@Checunmily
Copy link

@egor-dolzhenko Hello, me again. Today I checked several TR definition files you mentioned and I got much more confused. I tried to select several the coordinates(most from the head of the bed file) in the bed files by manually seeing the TR sequences on the refseq (GCA_000001405.15_GRCh38_no_alt_analysis_set.fa) I used but it could not match.

Actually I can hardly find a pair of coordinates which are totally same from different TR definition files. For example, pathogenic_repeats.hg38.bed has a little bit different at the start and end position with hg38.STRchive-disease-loci.TRGT.bed at one gene/disease that both definition files have, I found there was a similar issue before #28. variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz almost match the refseq but not exactly because its coordinate ends not well. adotto_repeats.hg38.bed.gz seems to give a region that include the TR sequence rather than always being a TR region(similar to human_GRCh38_no_alt_analysis_set.trf.bed?). repeat_catalog.hg38.bed seems to be the best one during randomly checking but the entire region is too small to perform genome-wide analysis, I think.

I want to know if I made something wrong? And I also want to know what will happen if I use an inappropriate TR definition file? Will it generate more false negative? How does TRGT use these definition files, just give a ambiguous coordinate which is overlapped with TR should be OK?

I am sorry to ask so much, and I am just starting out in this field and have a lot to learn, so I would really appreciate it if you could point out my misunderstandings. Thank you!

@egor-dolzhenko
Copy link
Collaborator

@egor-dolzhenko Great! I will try it following your advice. But I think STRchive website is not available now, maybe I can try it later. Thanks again!

Sorry about that. You can access STRchive's data directly here: https://github.com/dashnowlab/STRchive.

@egor-dolzhenko
Copy link
Collaborator

@egor-dolzhenko Hello, me again. Today I checked several TR definition files you mentioned and I got much more confused. I tried to select several the coordinates(most from the head of the bed file) in the bed files by manually seeing the TR sequences on the refseq (GCA_000001405.15_GRCh38_no_alt_analysis_set.fa) I used but it could not match.

Actually I can hardly find a pair of coordinates which are totally same from different TR definition files. For example, pathogenic_repeats.hg38.bed has a little bit different at the start and end position with hg38.STRchive-disease-loci.TRGT.bed at one gene/disease that both definition files have, I found there was a similar issue before #28. variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz almost match the refseq but not exactly because its coordinate ends not well. adotto_repeats.hg38.bed.gz seems to give a region that include the TR sequence rather than always being a TR region(similar to human_GRCh38_no_alt_analysis_set.trf.bed?). repeat_catalog.hg38.bed seems to be the best one during randomly checking but the entire region is too small to perform genome-wide analysis, I think.

I want to know if I made something wrong? And I also want to know what will happen if I use an inappropriate TR definition file? Will it generate more false negative? How does TRGT use these definition files, just give a ambiguous coordinate which is overlapped with TR should be OK?

I am sorry to ask so much, and I am just starting out in this field and have a lot to learn, so I would really appreciate it if you could point out my misunderstandings. Thank you!

Thanks for the comment @Checunmily. There are multiple ongoing projects that aim to define better repeat catalogs and are starting to tackle these representational issues (for example see this paper and this one). One challenge is that some repeat regions can be defined in multiple equally good ways. Also some people choose to include flanking sequences surrounding the repeats into their repeat definitions and some don't. These and other issues create disagreements between repeat catalogs and so it is usually a good idea to pick one genome-wide catalog for all your analyses. If you'd like, please feel free to pick one or two discrepant repeat definitions that we could review together. If we find that one definition is clearly better than the other, we could pass this feedback to the developers of these catalogs. As for TRGT itself, it will work with different repeat definitions, especially if they capture the entire repeat region / variation (take a look at the "analysis of variation clusters" section in this paper for more information about this).

@hdashnow
Copy link
Collaborator

hdashnow commented Nov 28, 2024

@egor-dolzhenko Great! I will try it following your advice. But I think STRchive website is not available now, maybe I can try it later. Thanks again!

@Checunmily
Thanks so much for reporting this! It looks like a bug was pushed to the STRchive main branch last night and broke the cite. It should be all fixed now!
Keep an eye out for the release of a fully updated STRchive website in the coming weeks. We're adding lots of new features and will update the TRGT files with newly discovered loci.

@Checunmily
Copy link

That will be a great help for me @egor-dolzhenko. Exactly I should read some paper first. I will contact you if I find more details in the future. Thank you so much!

@egor-dolzhenko
Copy link
Collaborator

This sounds great, @Checunmily!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants