Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checker performance options #392

Open
jpmckinney opened this issue Jul 10, 2023 · 2 comments
Open

Checker performance options #392

jpmckinney opened this issue Jul 10, 2023 · 2 comments

Comments

@jpmckinney
Copy link
Member

One idea is to check the original packages. This would mean using a new check table that links to collection_file (instead of release_check and record_check linking to release and record).

However:

  1. It might be more difficult to analyze errors by OCID (see Error summary section of this notebook).
  2. It might consume too much memory. Some packages are extremely large.
@jpmckinney jpmckinney added steps Relating to specific steps (transforms) performance and removed steps Relating to specific steps (transforms) labels Jul 10, 2023
@jpmckinney
Copy link
Member Author

jpmckinney commented Jul 10, 2023

Blocked by open-contracting/lib-cove-ocds#56 re: item 2 above.

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 23, 2023

It might consume too much memory. Some packages are extremely large.

Indeed: Colombia files, for example, are a few GBs.

Blocked by open-contracting/lib-cove-ocds#56 re: item 2 above.

In lib-cove-ocds, there's the option to read the file from disk. In that case, ijson can parse iteratively. (Would need to parse twice – once for package data and once for each release, like in file_worker.py.)

In kingfisher-process, we'd also have to read the file from disk – not from the DB, as I don't think it's possible to stream jsonb (or bytea) out of PostgreSQL. Edit: Kingfisher Process can read one release at a time from the DB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant