-
Notifications
You must be signed in to change notification settings - Fork 15
How to warm the in‐memory cache
You need the list of all datasets, up-to-date, in a CSV file, all_datasets.csv
.
The file should contain the list of URLs for each published dataset which we are going to pass to the curl
command.
The successful loading of a dataset page will save that dataset page data in the cache.
In order for curl
to load the list of urls, the file all_datasets.csv
should follow a specific format that curl
can understand.
Each line should have the format url = "http://example.com"
.
To speed up the processing of the list of URLs you can tell curl
to parallelise the processing of the list.
This requires a recent version of curl
which is not available on our EC2 instances, so you will have to run curl on your local machine.
Log in as centos
user with SSH to the bastion server, then export the database connection details as environment variables by running
$ export $(cat db-env | xargs -L 1)
Then create on the server the file listing all the urls to load in the cache:
$ psql -U $PGUSER -d $PGDATABASE -h $PGHOST -p $PGPORT -af /home/centos/all_dataset_urls.sql > /tmp/all_datasets.csv
After logging out from the bastion server, download the generated file /tmp/all_datasets.csv
on your local machine with scp
my-mac$ scp -i <key-goes-here> username@host:/tmp/all_datasets.csv ~/Downloads/
On your local machine run curl
with the list of files:
my-mac$ time curl --silent --parallel --parallel-immediate --parallel-max 5 --config ~/Downloads/all_datasets.csv
In my experience, the command takes ~70mn to run on my M1Pro macbookpro.
The file /home/centos/all_dataset_urls.sql
, already present on the server, contains the sql query that retrieve the list of published dataset presented in a format understood by curl
and exported as CSV:
\copy (select 'url = "https://gigadb.org/dataset/' || identifier || '"' as url from dataset where upload_status = 'Published' order by identifier desc) TO STDOUT (format CSV, quote *);
As an alternative to the first part of the instructions, you can obtain the same CSV file by running the select
part of the above query using DBeaver or PGAdmin desktop apps and use their exporting functionality, but be careful to not include the headers and quotes in the resulting file.
In the second part of the instructions, the value of --parallel-max
should not be high as that would negatively impact the experience of real users.