Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle whitespace in base64 strings in fathom extract #254

Open
danielhertenstein opened this issue Jul 21, 2020 · 0 comments
Open

Handle whitespace in base64 strings in fathom extract #254

danielhertenstein opened this issue Jul 21, 2020 · 0 comments
Labels
python Requires work in primarily the Python language

Comments

@danielhertenstein
Copy link
Collaborator

In sample EN_B59a of the fathom-form-autofill corpus, there are two base64 strings that already existed in the page when they were frozen. These strings have whitespace in them. fathom extract uses a regular expression for extracting base64 strings. It does not include whitespace as an allowed character and will assume the string has ended upon encountering whitespace. @erikrose found a link that says that whitespace is allowed in quoted base64 strings, but is ignored when decoding. We should add support to fathom extract for this quoted case. We do not expect whitespace to appear in any of the strings created by freeze-dry (what we use to freeze/save samples). This should only affect some of the pages that already have base64 strings in them. Because this problem rarely occurs and shouldn't prevent feature vectors from being created for a sample, this is low priority.

@danielhertenstein danielhertenstein added the python Requires work in primarily the Python language label Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Requires work in primarily the Python language
Projects
None yet
Development

No branches or pull requests

1 participant