Skip to content

How to warm the in‐memory cache

Rija Ménagé edited this page Aug 22, 2024 · 6 revisions

Abstract

You need the list of all datasets, up-to-date, in a CSV file, all_datasets.csv. The file should contain the list of URLs for each published dataset which we are going to pass to the curl command. The successful loading of a dataset page will save that dataset page data in the cache. In order for curl to load the list of urls, the file all_datasets.csv should follow a specific format that curl can understand. Each line should have the format url = "http://example.com". To speed up the processing of the list of URLs you can tell curl to parallelise the processing of the list. This requires a recent version of curl which is not available on our EC2 instances, so you will have to run curl on your local machine.

Instructions

1) Make an up-to-date list of URLs to load in cache in the correct format

Log in as centos user with SSH to the bastion server, then export the database connection details as environment variables by running

$ export $(cat db-env | xargs -L 1)

Then create on the server the file listing all the urls to load in the cache:

$ psql -U $PGUSER -d $PGDATABASE -h $PGHOST -p $PGPORT -af /home/centos/all_dataset_urls.sql  > /tmp/all_datasets.csv

After logging out from the bastion server, download the generated file /tmp/all_datasets.csv on your local machine with scp

my-mac$ scp -i <key-goes-here> username@host:/tmp/all_datasets.csv ~/Downloads/

2) Warm the cache

On your local machine run curl with the list of files:

my-mac$ time curl --silent --parallel --parallel-immediate --parallel-max 5 --config ~/Downloads/all_datasets.csv

In my experience, the command takes ~70mn to run on my M1Pro macbookpro.

Additional information

The file /home/centos/all_dataset_urls.sql, already present on the server, contains the sql query that retrieve the list of published dataset presented in a format understood by curl and exported as CSV:

\copy (select 'url = "https://gigadb.org/dataset/' || identifier || '"' as url from dataset where upload_status = 'Published' order by identifier desc) TO STDOUT (format CSV, quote *);

As an alternative to the first part of the instructions, you can obtain the same CSV file by running the select part of the above query using DBeaver or PGAdmin desktop apps and use their exporting functionality, but be careful to not include the headers and quotes in the resulting file.

In the second part of the instructions, the value of --parallel-max should not be high as that would negatively impact the experience of real users.