Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triggering Run Basic Checks #14

Open
GeoDirk opened this issue Aug 24, 2022 · 5 comments
Open

Triggering Run Basic Checks #14

GeoDirk opened this issue Aug 24, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@GeoDirk
Copy link

GeoDirk commented Aug 24, 2022

For our purposes, when parsing through the USFM tokens, we are coming across projects that have a bunch of weird things with their verse tags:

  • \v : empty verse tags
  • \v 4Tonga : places where the verse tag and the verse text run together
  • \v 4- : places where they accidentally forgot to finish off a verse range

I'm sure that we will be finding more and more of these types of USFM errors as we go along. We usually can detect these in our plugin and report them back to the user. However, ideally it would be fantastic if we could trigger the "Run Basic Checks" function and make the user clean up the mess. For what we've encountered thus far, the basic checks would have caught all of these issues and then get back a report on what is bad. This feature would be a new enhancement to the API.

Alternatively, you all probably have a standard library out there that could look at the USFM and do the checks. Anything like that out there in your public libraries?

@FoolRunning FoolRunning added the enhancement New feature or request label Aug 24, 2022
@FoolRunning
Copy link
Contributor

Unfortunately, all the checking code is currently in the Paratext executable (which is not public).

@tombogle
Copy link
Collaborator

I wonder if maybe something could be added to the API to indicate that you want the USFM tokens, but only if/when the checks have been run and passed cleanly.

@FoolRunning
Copy link
Contributor

@GeoDirk, I'm not sure how clean you need your data or if it would work for what you need, but you could try get the USX first using strict=true to make sure that the data is clean before reading in the tokens.

@GeoDirk
Copy link
Author

GeoDirk commented Aug 25, 2022

Would using USX and strict = true skip verse data? If so, then I would rather not go that route and stick with the alerts that I have now parsing the USFM.

Basically I need to produce the equivalent of:

01001001 In the beginning, God created the heavens and the earth.
01001002 ...

Which is why bad verse tags are problematic. Obtaining the verse text without the extra attributes has been surprisingly easy with parsing through your USFM tokens.

We are using this data to send it off to NLP for processing and looking for alignments hence why we need the precision.

@FoolRunning
Copy link
Contributor

FoolRunning commented Aug 25, 2022

Yeah, it probably won't work to use strict=true since that will validate a bunch of other stuff you probably don't want validated (i.e. it's designed to get the data to a pristine state to be uploaded to DBL, for example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants