Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BED files produces are not tabulated #11

Closed
jordiabante opened this issue May 28, 2018 · 7 comments
Closed

BED files produces are not tabulated #11

jordiabante opened this issue May 28, 2018 · 7 comments

Comments

@jordiabante
Copy link
Collaborator

BED files produced are not tabulated as of now. This complicates a little bit any post-processing with commonly used tools such as any tool from the UCSC toolbox. Fixing this would imply modifying the fprintf commands in the MATLAB scripts.

@GarrettJenkinson
Copy link
Owner

I agree we should modify things to simplify post-processing. Do we know what the "most correct" format is for the bedGraph specification? Because as far as I know, it is a UCSC-defined file format, and they seem to allow any whitespace as a delimiter between the columns:

http://genome.ucsc.edu/goldenPath/help/bedgraph.html

And in their examples, they use a single space rather than a tab. The discussion is more thorough for their BED file specification:

http://genome.ucsc.edu/FAQ/FAQformat#format1

BED fields in custom tracks can be whitespace-delimited or tab-delimited. Only some variations of BED types, such as bedDetail, require a tab character delimitation for the detail columns.

likewise, there in their examples, they use a mix of single-space, multi-space, and tabs to delimit the columns.

So I am fairly surprised that even the UCSC tools would not properly handle their own file formats, do you have examples of this? We may want to contact them about this issue.

I know the Broad institute's IGV browser (that is not associated with the UCSC's standards AFAIK) demands tabs in bed files:

https://software.broadinstitute.org/software/igv/BED

A BED file (.bed) is a tab-delimited text file that defines a feature track. It can have any file extension, but .bed is recommended.

but am not sure what they expect/handle for bedGraph since they simply point to the UCSC page that is not specific. Although IGV demands bedGraph files have a ".bedgraph" file extension:

https://software.broadinstitute.org/software/igv/bedgraph

Welcome to the conflicting world of bioinformatics where there is no equivalent of the IEEE enforcing standards like they do for general purpose computing (e.g., floating point, etc.). Even well-entrenched tools like samtools do not seem to handle properly the BED file format:

samtools/tabix#4

@jordiabante
Copy link
Collaborator Author

So about the whitespace, I am fairly certain that the BED files being produced right now have more than a single whitespace. In fact, I would say that each line has a different number of whitespaces. However, even with single spaces the UCSC tools that I have tried won't work either. Here is a little example that I ran on MARCC:

I went ahead tried the UCSC tool bedSort to see what actually happens. If you go to

/scratch/groups/afeinbe2/shared/data/toy/output

you will see the following files:

jordi_bed_test.ws.bed: a single-space delimited BED file with no header.
jordi_bed_test.tab.bed: a tab delimited BED file with no header.

And look what happened in each case:

jordi@marcc:output$bedSort jordi_bed_test.ws.bed jordi_bed_test.ws.bed.sort
Expecting tab in bed line chr1 0 149 0.734712
jordi@marcc:output$bedSort jordi_bed_test.tab.bed jordi_bed_test.tab.bed.sort
jordi@marcc:output$

So the command didn't work with the single-space BED file but it did work just fine with the tabulated BED file (the command didn't work with the original BED file either, logically). Like you see, the command expects a tabulated BED file in this case. In addition to this example, I also found the bedGraphToBigWig command to have issues with whitespaces when preparing the BW files for John.

@GarrettJenkinson
Copy link
Owner

Yes, I agree that informME at present can produce any variable number of white-spaces. What I am saying is that the specification is white-space delimited, which should mean all combinations of whitespace (other than newlines) are valid similar to how awk pulls down columns with any amount of whitespace between.

But given that tabs seem to be favored in some tools, let's proceed with tabs. I find it especially egregious that using the example from the UCSC website with single spaces will not work with their own tools. Changing informME to be tab-delimited will simplify our users' lives in downstream tools, so let's do it.

@jordiabante
Copy link
Collaborator Author

jordiabante commented May 28, 2018

Ok great!

So the files affected by this change are:

  • makeBedsForMethAnalysis.m
  • diffMethAnalysisToBed.m

Am I missing anything here? Do you want me to create a branch and take care of it?

@GarrettJenkinson
Copy link
Owner

Sure. Though why would bed2bw.sh need updating? Right now it handles the columns via awk which should handle any whitespace. Don't think I would want to make it more restrictive to only work on tabs, since this is a useful utility for converting any proper bedgraph file (such as those from current versions of informME).

@jordiabante
Copy link
Collaborator Author

Oh yeah, good point. Then let's leave it as it is and verify there's no issue with the new format. That's probably the best for backwards compatibility like you say. Okay, I will start working on it!

@GarrettJenkinson
Copy link
Owner

closed by #12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants