Create new repeat definition file #49

BrendaLee1 · 2024-11-23T13:50:51Z

Hi,
Excellent tool, I checked the repeat file you provided and find that some of our repeat region were not included in your file. I wonder if this tool can create our own repeat definition file and how?

best,
Lai.

pbsena · 2024-11-25T12:17:47Z

Hello,

TRGT is flexible on the repeat definition passed as input, and the BED files provided in the code are simply the most common pathogenic TRs of interest in hg38, but the program should work with other references and custom panels as well. The program will use reads that map to that loci and realing them to the flanking regions around each entry in the BED input.

In order to use different repeats, you just need a BED file with the genome coordinates and a fourth field definind ID and MOTIFS. The examples in the BED files provided in the repeats directory also contain the STRUCT field but those are not necessary. So this for instance would be sufficient:

chr1	57367043	57367119	ID=DAB1;MOTIFS=AAAAT,GAAAT

If you are interested in annotated loci in other genomes, we recommend the following resource, which has various relevant loci in multiple human assemblies and other mammalian organisms: https://strchive.org/

BrendaLee1 · 2024-11-26T11:16:54Z

Hi,
I tried the repeat definition format as you suggested, but got the error below:
Locus Processing: Error at BED line 60365: STRUC field missing, version 1.4.1.

I want to build repeat definition file for multi-motif repeat region, how to write the structure filed.

pbsena · 2024-11-26T12:02:23Z

Thanks for reporting, this shouldn't have happened and we'll look into the problem (we should be discontinuing the STRUC field). In the meantime, could you add STRUC=<LOCUS> to the end of your definition? Here LOCUS should be the name given in the ID field. In the above example it should look like this:

chr1	57367043	57367119	ID=DAB1;MOTIFS=AAAAT,GAAAT;STRUC=<DAB1>

Thank you!

BrendaLee1 · 2024-11-26T12:30:16Z

Hi,
Thank you for your advice, it works.

Checunmily · 2024-11-27T09:46:44Z

TRGT is flexible on the repeat definition passed as input, and the BED files provided in the code are simply the most common pathogenic TRs of interest in hg38,

Hello, I want to know the differences between "pathogenic_repeats.hg38.bed" and "repeat_catalog.hg38.bed" that you provided. It seems that the larger one is still much less than the region contained in human_GRCh38_no_alt_analysis_set.trf.bed. Are these two bed files together being "simply the most common pathogenic TRs of interest in hg38"? I want to call TR variants but I do not have a target, which one should I use? Or use both?

Thanks in advance.

egor-dolzhenko · 2024-11-27T19:55:57Z

Thanks for the question. If you are interested in general analysis of tandem repeats, it might be best to profile both known pathogenic repeats and also all/most repeats across the genome.

We are in the process of updating repeat catalogs for TRGT. However, for pathogenic repeats, I'd recommend using the latest set from STRchive (you can download them from here in the TRGT format: https://strchive.org/database/index.html). For genome-wide repeat catalog, you could use either adotto_repeats.hg38.bed.gz or variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz. Although the latter catalog will most likely change in the near future.

Did I answer your questions? Please let me know if there is anything else you'd like to know.

Checunmily · 2024-11-28T02:00:34Z

@egor-dolzhenko Great! I will try it following your advice. But I think STRchive website is not available now, maybe I can try it later. Thanks again!

Checunmily · 2024-11-28T10:05:15Z

@egor-dolzhenko Hello, me again. Today I checked several TR definition files you mentioned and I got much more confused. I tried to select several the coordinates(most from the head of the bed file) in the bed files by manually seeing the TR sequences on the refseq (GCA_000001405.15_GRCh38_no_alt_analysis_set.fa) I used but it could not match.

Actually I can hardly find a pair of coordinates which are totally same from different TR definition files. For example, pathogenic_repeats.hg38.bed has a little bit different at the start and end position with hg38.STRchive-disease-loci.TRGT.bed at one gene/disease that both definition files have, I found there was a similar issue before #28. variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz almost match the refseq but not exactly because its coordinate ends not well. adotto_repeats.hg38.bed.gz seems to give a region that include the TR sequence rather than always being a TR region(similar to human_GRCh38_no_alt_analysis_set.trf.bed?). repeat_catalog.hg38.bed seems to be the best one during randomly checking but the entire region is too small to perform genome-wide analysis, I think.

I want to know if I made something wrong? And I also want to know what will happen if I use an inappropriate TR definition file? Will it generate more false negative? How does TRGT use these definition files, just give a ambiguous coordinate which is overlapped with TR should be OK?

I am sorry to ask so much, and I am just starting out in this field and have a lot to learn, so I would really appreciate it if you could point out my misunderstandings. Thank you!

egor-dolzhenko · 2024-11-28T15:36:07Z

@egor-dolzhenko Great! I will try it following your advice. But I think STRchive website is not available now, maybe I can try it later. Thanks again!

Sorry about that. You can access STRchive's data directly here: https://github.com/dashnowlab/STRchive.

egor-dolzhenko · 2024-11-28T16:07:37Z

@egor-dolzhenko Hello, me again. Today I checked several TR definition files you mentioned and I got much more confused. I tried to select several the coordinates(most from the head of the bed file) in the bed files by manually seeing the TR sequences on the refseq (GCA_000001405.15_GRCh38_no_alt_analysis_set.fa) I used but it could not match.

Actually I can hardly find a pair of coordinates which are totally same from different TR definition files. For example, pathogenic_repeats.hg38.bed has a little bit different at the start and end position with hg38.STRchive-disease-loci.TRGT.bed at one gene/disease that both definition files have, I found there was a similar issue before #28. variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz almost match the refseq but not exactly because its coordinate ends not well. adotto_repeats.hg38.bed.gz seems to give a region that include the TR sequence rather than always being a TR region(similar to human_GRCh38_no_alt_analysis_set.trf.bed?). repeat_catalog.hg38.bed seems to be the best one during randomly checking but the entire region is too small to perform genome-wide analysis, I think.

I want to know if I made something wrong? And I also want to know what will happen if I use an inappropriate TR definition file? Will it generate more false negative? How does TRGT use these definition files, just give a ambiguous coordinate which is overlapped with TR should be OK?

I am sorry to ask so much, and I am just starting out in this field and have a lot to learn, so I would really appreciate it if you could point out my misunderstandings. Thank you!

Thanks for the comment @Checunmily. There are multiple ongoing projects that aim to define better repeat catalogs and are starting to tackle these representational issues (for example see this paper and this one). One challenge is that some repeat regions can be defined in multiple equally good ways. Also some people choose to include flanking sequences surrounding the repeats into their repeat definitions and some don't. These and other issues create disagreements between repeat catalogs and so it is usually a good idea to pick one genome-wide catalog for all your analyses. If you'd like, please feel free to pick one or two discrepant repeat definitions that we could review together. If we find that one definition is clearly better than the other, we could pass this feedback to the developers of these catalogs. As for TRGT itself, it will work with different repeat definitions, especially if they capture the entire repeat region / variation (take a look at the "analysis of variation clusters" section in this paper for more information about this).

hdashnow · 2024-11-28T16:13:42Z

@egor-dolzhenko Great! I will try it following your advice. But I think STRchive website is not available now, maybe I can try it later. Thanks again!

@Checunmily
Thanks so much for reporting this! It looks like a bug was pushed to the STRchive main branch last night and broke the cite. It should be all fixed now!
Keep an eye out for the release of a fully updated STRchive website in the coming weeks. We're adding lots of new features and will update the TRGT files with newly discovered loci.

Checunmily · 2024-11-29T02:13:13Z

That will be a great help for me @egor-dolzhenko. Exactly I should read some paper first. I will contact you if I find more details in the future. Thank you so much!

egor-dolzhenko · 2024-11-29T15:40:17Z

This sounds great, @Checunmily!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create new repeat definition file #49

Create new repeat definition file #49

BrendaLee1 commented Nov 23, 2024

pbsena commented Nov 25, 2024 •

edited

Loading

BrendaLee1 commented Nov 26, 2024 •

edited

Loading

pbsena commented Nov 26, 2024

BrendaLee1 commented Nov 26, 2024

Checunmily commented Nov 27, 2024

egor-dolzhenko commented Nov 27, 2024

Checunmily commented Nov 28, 2024

Checunmily commented Nov 28, 2024

egor-dolzhenko commented Nov 28, 2024

egor-dolzhenko commented Nov 28, 2024

hdashnow commented Nov 28, 2024 •

edited

Loading

Checunmily commented Nov 29, 2024

egor-dolzhenko commented Nov 29, 2024

Create new repeat definition file #49

Create new repeat definition file #49

Comments

BrendaLee1 commented Nov 23, 2024

pbsena commented Nov 25, 2024 • edited Loading

BrendaLee1 commented Nov 26, 2024 • edited Loading

pbsena commented Nov 26, 2024

BrendaLee1 commented Nov 26, 2024

Checunmily commented Nov 27, 2024

egor-dolzhenko commented Nov 27, 2024

Checunmily commented Nov 28, 2024

Checunmily commented Nov 28, 2024

egor-dolzhenko commented Nov 28, 2024

egor-dolzhenko commented Nov 28, 2024

hdashnow commented Nov 28, 2024 • edited Loading

Checunmily commented Nov 29, 2024

egor-dolzhenko commented Nov 29, 2024

pbsena commented Nov 25, 2024 •

edited

Loading

BrendaLee1 commented Nov 26, 2024 •

edited

Loading

hdashnow commented Nov 28, 2024 •

edited

Loading