-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BED files produces are not tabulated #11
Comments
I agree we should modify things to simplify post-processing. Do we know what the "most correct" format is for the bedGraph specification? Because as far as I know, it is a UCSC-defined file format, and they seem to allow any whitespace as a delimiter between the columns: http://genome.ucsc.edu/goldenPath/help/bedgraph.html And in their examples, they use a single space rather than a tab. The discussion is more thorough for their BED file specification: http://genome.ucsc.edu/FAQ/FAQformat#format1
likewise, there in their examples, they use a mix of single-space, multi-space, and tabs to delimit the columns. So I am fairly surprised that even the UCSC tools would not properly handle their own file formats, do you have examples of this? We may want to contact them about this issue. I know the Broad institute's IGV browser (that is not associated with the UCSC's standards AFAIK) demands tabs in bed files: https://software.broadinstitute.org/software/igv/BED
but am not sure what they expect/handle for bedGraph since they simply point to the UCSC page that is not specific. Although IGV demands bedGraph files have a ".bedgraph" file extension: https://software.broadinstitute.org/software/igv/bedgraph Welcome to the conflicting world of bioinformatics where there is no equivalent of the IEEE enforcing standards like they do for general purpose computing (e.g., floating point, etc.). Even well-entrenched tools like samtools do not seem to handle properly the BED file format: |
So about the whitespace, I am fairly certain that the BED files being produced right now have more than a single whitespace. In fact, I would say that each line has a different number of whitespaces. However, even with single spaces the UCSC tools that I have tried won't work either. Here is a little example that I ran on MARCC: I went ahead tried the UCSC tool bedSort to see what actually happens. If you go to /scratch/groups/afeinbe2/shared/data/toy/output you will see the following files: jordi_bed_test.ws.bed: a single-space delimited BED file with no header. And look what happened in each case: jordi@marcc:output$bedSort jordi_bed_test.ws.bed jordi_bed_test.ws.bed.sort So the command didn't work with the single-space BED file but it did work just fine with the tabulated BED file (the command didn't work with the original BED file either, logically). Like you see, the command expects a tabulated BED file in this case. In addition to this example, I also found the bedGraphToBigWig command to have issues with whitespaces when preparing the BW files for John. |
Yes, I agree that informME at present can produce any variable number of white-spaces. What I am saying is that the specification is white-space delimited, which should mean all combinations of whitespace (other than newlines) are valid similar to how awk pulls down columns with any amount of whitespace between. But given that tabs seem to be favored in some tools, let's proceed with tabs. I find it especially egregious that using the example from the UCSC website with single spaces will not work with their own tools. Changing informME to be tab-delimited will simplify our users' lives in downstream tools, so let's do it. |
Ok great! So the files affected by this change are:
Am I missing anything here? Do you want me to create a branch and take care of it? |
Sure. Though why would bed2bw.sh need updating? Right now it handles the columns via awk which should handle any whitespace. Don't think I would want to make it more restrictive to only work on tabs, since this is a useful utility for converting any proper bedgraph file (such as those from current versions of informME). |
Oh yeah, good point. Then let's leave it as it is and verify there's no issue with the new format. That's probably the best for backwards compatibility like you say. Okay, I will start working on it! |
closed by #12 |
BED files produced are not tabulated as of now. This complicates a little bit any post-processing with commonly used tools such as any tool from the UCSC toolbox. Fixing this would imply modifying the fprintf commands in the MATLAB scripts.
The text was updated successfully, but these errors were encountered: