Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-intuitive messytables behavior #90

Open
ThrawnCA opened this issue Dec 3, 2019 · 6 comments
Open

Non-intuitive messytables behavior #90

ThrawnCA opened this issue Dec 3, 2019 · 6 comments

Comments

@ThrawnCA
Copy link
Collaborator

ThrawnCA commented Dec 3, 2019

In testing XLoader and the Data Dictionary recently, I noticed that after setting just one column type to 'timestamp' on a new resource and re-submitting it, the whole Data Dictionary was populated with appropriate types. Checking the Datastore log, I realised that this was because our non-ISO dates (dd/mm/yyyy) had caused a direct PostgreSQL COPY to fail, so it fell back to using messytables - which meant that it guessed all of our types.

Although this behaviour is actually rather useful, it seems rather inconsistent that you get full type-guessing by providing invalid input, yet no such guessing if your data is entirely valid. It's also unfortunate that it presumably replaces any Data Dictionary overrides you may have entered manually.

So, there are two issues:

  • If there are any existing Data Dictionary overrides, then messytables probably should not be permitted to overwrite them.

  • How easily could the use of messytables be made configurable? It occurs to me that it would be useful to have at least a setting to use messytables on first creation of a resource, while attempting a fast COPY for further updates. I'm sure that there are also a lot of potential XLoader users (possibly even my own workplace) who would prefer the maximum-backwards-compatibility option of always using messytables, despite the more modest speed gains compared to DataPusher - and as a side benefit, this wouldn't result in the log warning when PostgreSQL COPY fails (which can confuse non-technical users).

@ThrawnCA
Copy link
Collaborator Author

ThrawnCA commented Dec 3, 2019

Looks like #36 already addressed the first point.

@davidread
Copy link
Contributor

Good points. This is all an excellent area to improve, it's not something me or my client are actively pursuing, so I'm happy for you to do this.

Definitely when messytables loads it should respect the Data Dictionary.

On detecting types in general use, I'm not keen to rely too much on the messytables load because it does a tiresome conversion to chunks of JSON and then INSERT statements - requires a lot of CPU and it runs on the same machine as CKAN, which was not necessarily the case with DataPusher's separate queue.

However we might still use messytables' type detection ability. I think it scans the first chunk of the CSV and makes a simple decision - I can't remember what the basis is. We could try setting the column types on that basis (the datastore_create) and try the COPY, and if that fails revert to string columns and redo the COPY?

A better alternative might be to still load initially as strings, then use messytables or some other heuristic to guess types, and then see if postgres itself can successfully cast a column to that type. If it can, we set the Data Dictionary for next time. And if next time it fails, then maybe it reverts to strings, COPY and then try the casting thing again.

Dunno what's best, but maybe that sparks some ideas, or a basis for experimentation?

@ThrawnCA
Copy link
Collaborator Author

ThrawnCA commented Dec 3, 2019

Well, we have a bunch of data that uses dd/MM/yyyy dates, and will continue to do so for the foreseeable future, which our clients want to have parsed and presented via the API as dates. So continuing to use messytables for those will probably be necessary for us.

(Incidentally, it's further counter-intuitive that messytables will set the Data Dictionary for those fields to 'timestamp', but then they'll fail PostgreSQL COPY because it can't recognise them as timestamps.)

I really think that if nothing else, it would be good to have a "compatibility mode" to just keep using messytables and behaving more like the DataPusher. You're right that it's potentially less performant than a direct COPY, but sometimes it's what people would prefer, especially since it means that they don't have to change their processes. And if someone knows that their data will fail COPY and will fall back every time, then it's actually faster to just use messytables immediately.

requires a lot of CPU and it runs on the same machine as CKAN, which was not necessarily the case with DataPusher's separate queue.

While that is true, it may not matter much in cases where CKAN is in a load-balanced (and possibly auto-scaled) cluster (which ours is).

@ThrawnCA
Copy link
Collaborator Author

ThrawnCA commented Dec 3, 2019

If it can, we set the Data Dictionary for next time.

Hmm. This sounds like a potential enhancement to the new paster migrate_types command, attempting to optimise the Data Dictionary for the capabilities of COPY and messytables.

I can think of several possible scenarios:

  • Data Dictionary overrides exist. Test them with a dry-run COPY.
    • COPY succeeds. Do nothing.
    • COPY fails. Flag resource for messytables-only loads? Revert to strings?
  • Data Dictionary overrides do not exist. Guess potential overrides using a dry-run messytables.
    • COPY succeeds with all overrides in place. Apply overrides permanently.
    • One or more overrides make COPY fail. Populate all overrides from messytables and flag for messytables-only loads in future? Keep using strings? Populate only the successful overrides?
  • Resource is already flagged for messytables-only loads. Re-test COPY anyway?

This would likely be quite slow, but it's an interesting idea.

ThrawnCA referenced this issue in qld-gov-au/ckanext-xloader Dec 4, 2019
…es if it is set, #90

- This will be slower on resources that don't need type-guessing, but will behave much more like the DataPusher, which simplifies the transition.
@ThrawnCA
Copy link
Collaborator Author

ThrawnCA commented Oct 4, 2022

On detecting types in general use, I'm not keen to rely too much on the messytables load because it does a tiresome conversion to chunks of JSON and then INSERT statements - requires a lot of CPU and it runs on the same machine as CKAN, which was not necessarily the case with DataPusher's separate queue.

It's actually still not necessarily the case. You totally can run XLoader on a separate machine; we've been doing that on www.data.qld.gov.au for years now.

@davidread
Copy link
Contributor

@ThrawnCA I'm afraid I'm no longer involved, so I'll leave this with you and the CKAN community - best wishes

JVickery-TBS pushed a commit to JVickery-TBS/ckanext-xloader that referenced this issue May 16, 2024
[QOLSVC-5123] skip rows that are completely blank instead of erroring out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants