Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROJECT IDEA] Flattened document-link #26

Open
stevieflow opened this issue Feb 21, 2021 · 2 comments
Open

[PROJECT IDEA] Flattened document-link #26

stevieflow opened this issue Feb 21, 2021 · 2 comments

Comments

@stevieflow
Copy link

stevieflow commented Feb 21, 2021

Rationale

There are thousands of document-links available via IATI data. These all have a range of metadata (both via the document-link directly, but also the parent iati-activity), which might be useful for people to access, for a variety of reasons

Proposal

A service (perhaps built off Classic + iati-flattener: iati-data-access/iati-flattener#1?) to output a list of document-link items. This would pull in relevant data from both the specific document-link element, but also elements such as reporting-org (name; ref; type); recipient-country; sector(s); activity-status.

Users could query and get this list in spreadsheet, JSON and XML format

@notshi
Copy link

notshi commented Apr 1, 2021

Hi @stevieflow, there are 1,966,749 document-links published across all activities, of which 392,533 are unique urls so this is a massive query.

This started from me wanting to find out if this was possible via dquery so hope it's ok to share!

Getting a list of flattened document-links is expensive but doable.
Here is the first 100 results of flattened document-link only.

However, including iati-activity level elements slowed the query down to the point that I don't think is practical. In fact, we did some tests and it was not possible to download all the data so much so that it would probably be better to run code over all the data and bypass a database entirely.

Here is the first 100 results including iati-acitivity elements.
It might be better to 'Download XSON' as various formats without running the query in the browser as it could slow your browser!

We found that scaling this query doesn't work and downloads time out as there's too much data to comb through.
Simplifying the query, however, makes it feasible and much less stress on the database.

Here is the first 10000 results for a much simpler document-link url query.
Changing the offset should allow you to page through the results.

There are probably much more efficient ways to write the query code but this was a first pass :)

@notshi
Copy link

notshi commented Apr 1, 2021

So we looked at optimising the query and this seems to work.

Here is the first 100,000 activities.
You'll need to run that query 10 times by changing the offset to get all the data as this looks through all 1 million activities in IATI.

Again, you don't have to run the query, just click on 'Download XSON' to download the data in various formats.
It seems the time to download varies depending on the data of that particular page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants