Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot parse dicom_ontology.owl #14

Open
tgbugs opened this issue Oct 17, 2018 · 5 comments
Open

cannot parse dicom_ontology.owl #14

tgbugs opened this issue Oct 17, 2018 · 5 comments

Comments

@tgbugs
Copy link

tgbugs commented Oct 17, 2018

@khelm I was trying to parse dicom_ontology.owl with my usual suite of ttl parsers (rapper among others gives pretty good debug info) and noticed that there are multiple cases where the descriptions are malformed. I think all that needs to be changed is to add an additional cleaning and proper escaping rules after

definition = definitionGroup.group(1) # has quotes already
(I don't have access to the source data files or I would try it myself). I think running it through rdflib.Literal or json.dumps may be sufficient.

Some examples of issues.

  1. Strange ​​<200b> char at the end of every definition (I have to open it in vim to see this).
  2. Internal double quotes are not escaped.
  3. The backslash char \ is not escaped so parsers try to interpret things like -1\-1 as an escape sequence and fail.

An incomplete set of fixes with the examples (as a patch).
ontdiff.txt

@khelm
Copy link
Contributor

khelm commented Oct 18, 2018

Hi @tgbugs -

  1. 200b is a zero-width non-printing character. Is that actually causing the ttl parser to crash or just an oddity? It should be easy enough to filter out.
  2. The DICOM standard has a couple of issues in the Descriptions and Attribute Names. The first is that there are some Attribute Names that have apostrophes in the ("Physician's Name), so using double quotes for the string keeps the quote order from getting out of sync. But, as you found, the Descriptions also have unescaped double quotes in them and backslashes as well. I will look into separate code to escape those characters.

Also, I have uploaded a similar python dict file that includes the tag value and the definition/notes text from the DICOM XML docbook. Try running your units-detecting code on that file and see how it goes. I kept the utf-8 encoding so that things like the mu and degrees symbols were still intact. This is not true in the current owl file in which I substituted u's for mu's.

@tgbugs
Copy link
Author

tgbugs commented Oct 18, 2018

  1. I don't think the zero width is what was causing the parsing error (and it is a simple s/<200b>//g fix).
  2. Ok, sounds good. In the mean time I may have a way to automatically fix that issues using obo:IAO_0000115 and ; as delimiters.
  3. Great, I will take a look.

@mick-d
Copy link

mick-d commented Apr 12, 2019

Hi, I would like to confirm that I could not parse the OWL file either trying several different tools (e.g. OWLGrEd, WebVOWL)

@khelm
Copy link
Contributor

khelm commented Apr 16, 2019

Thanks @mick-d , we're working on it. @tgbugs what is the solution using IAO 0000115? Did you have to do something to get OWL to recognize ";" as a delimiter? Are there any instance of that character in the definitions?

@tgbugs
Copy link
Author

tgbugs commented Apr 16, 2019

I reviewed my .bash_history file to see what I did, and unfortunately it looks like I made all the changes using vim's ex mode (:%s/a/b/) so I don't have a record of what I did. I didn't make any changes to the generating code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants