This is the source repository for the site https://www.gbif.org/analytics.
GBIF capture various metrics to enable monitoring of data trends.
The development is being done in an open manner, to enable others to verify procedures, contribute, or fork the project for their own purposes. The results are visible on https://www.gbif.org/analytics/global and show global and country specific charts illustrating the changes observed in the GBIF index since 2007.
Please note that all samples of the index have been reprocessed to consistent quality control and to the same taxonomic backbone to enable comparisons over time. This is the first time this analysis has been possible, and is thanks to the adoption of the Hadoop environment at GBIF which enables the large scale analysis. In total there are approximately 32 billion (to January 2021) records being analysed for these reports.
The project is divided into several parts:
- Hive and Sqoop scripts which are responsible for importing historical data from archived MySQL database dumps
- Hive scripts that snapshotted data from the message-based real-time indexing system which served GBIF between late 2013 and Q3 2019
- Hive scripts that snapshot recent data from the latest GBIF infrastructure (the real time indexing system currently serving GBIF)
- Hive scripts that process all data to the same quality control and taxonomic backbone
- Hive scripts that digest the data into specific views suitable for download from Hadoop and further processing
- R and Python scripts that process the data into views per country
- R and Python scripts that produce the static charts for each country
These steps are required for a new environment. It is probably easiest to use the Docker image.
- Install the yum packages R, cairo, cairo-devel
- Run
Rscript R/install-packages.R
(Possibly it is necessary to set theR_LIBS_USER
environment variable.)
- This will only work on a Cloudera Manager managed gateway such as
c5gateway-vh
on which you should be able tosudo -i -u hdfs
and find the code in/home/hdfs/analytics/
(do agit pull
) - Make sure Hadoop libraries and binaries (e.g. hive) are on your path
- The snapshot name will be the date as
YYYYMMDD
so e.g.20140923
. - Create new "raw" table from the HDFS table using
hive/import/hdfs/create_new_snapshot.sh
. Pass in snapshot database, snapshot name, source Hive database and source Hive table e.g.cd hive/import/hdfs; ./create_new_snapshot.sh snapshot $(date +%Y%m%d) prod_h occurrence
- Tell Matt he can run the backup script, which exports these snapshots to external storage.
- Add the new snapshot name to the
hive/normalize/build_raw_scripts.sh
script, to the array hdfs_v1_snapshots. If the HDFS schema has changed you'll have to add a new array called e.g. hdfs_v2_snapshots and add logic to process that array at the bottom of the script (another loop). - Add the new snapshot name to
hive/normalize/create_occurrence_tables.sh
in the same way as above. - Add the new snapshot name to
hive/process/build_prepare_script.sh
in the same way as above. - Replace the last element of
temporalFacetSnapshots
inR/graph/utils.R
with your new snapshot. Follow the formatting in use, e.g.2015-01-19
- Make sure the version of EPSG used in the latest occurrence project pom.xml is the same as the one that the script
hive/normalize/create_tmp_interp_tables.sh
fetches. Do that by checking the pom.xml (hopefully still at: https://github.com/gbif/occurrence/blob/master/pom.xml) for the geotools.version. That version should be the same as what's in the shell script (at time of writing the geotools.version was 20.5 and the script line wascurl -L 'http://download.osgeo.org/webdav/geotools/org/geotools/gt-epsg-hsql/20.5/gt-epsg-hsql-20.5.jar' > /tmp/gt-epsg-hsql.jar
) - Set up additional geocode services (e.g. using UAT or Dev, or duplicates running in prod). There need to be as many backends connections available as tasks will run in YARN.
- From the root (analytics) directory you can now run the
build.sh
script to run all the HBase and Hive table building, build all the master CSV files, which are in turn processed down to per country/region CSVs and GeoTIFFs, then generate the maps and figures needed for the website and the country reports. Note that this will take up to 48 hours and is unfortunately error prone, so all steps could also be run individually. In any case it's probably best to run all parts of this script on a machine in the secretariat and ideally in a "screen" session. To run it all do:
screen -L -S analytics
./build.sh -interpretSnapshots -summarizeSnapshots -downloadCsvs -processCsvs -makeFigures
(Detach from the screen with "^A d", reattach with screen -x
.)
- rsync the CSVs, GeoTIFFs, figures and maps to
[email protected]:/var/www/html/analytics-files/
and check (this server is also used for gbif-dev.org)rsync -avn report/ [email protected]:/var/www/html/analytics-files/
rsync -avn registry-report/ [email protected]:/var/www/html/analytics-files/registry/
- Check the download statistics are up-to-date (Nagios should be alerting if not, but https://api.gbif.org/v1/occurrence/download/statistics/downloadedRecordsByDataset?fromDate=2023-03). If not, update with https://github.com/gbif/registry/blob/master/populate_downloaded_records_statistics.sh
- Generate the country reports — check you are using correct APIs! (Normally prod but UAT analytics assets.) Instructions are in the country-reports project.
- rsync the reports to
[email protected]:/var/www/html/analytics-files/
rsync -av country-report/ [email protected]:/var/www/html/analytics-files/country/
- rsync the CSVs, GeoTIFFs, figures and maps to
[email protected]:/var/www/html/analytics-files/
rsync -avn report/ [email protected]:/var/www/html/analytics-files/
rsync -avn registry-report/ [email protected]:/var/www/html/analytics-files/registry/
- rsync the reports to
[email protected]:/var/www/html/analytics-files/
rsync -av country-report/ [email protected]:/var/www/html/analytics-files/country/
- Check https://www.gbif.org/analytics, write an email to [email protected] giving heads up on the new data, and accept the many accolades due your outstanding achievement in the field of excellence!
- Archive the new analytics. The old analytics files have been used several times by the communications team:
cd /var/www/html/
tar -cvJf /mnt/auto/analytics/archives/gbif_analytics_$(date +%Y-%m-01).tar.xz --exclude favicon.ico --exclude '*.pdf' analytics-files/[a-z]*
# or at the start of the year, when the country reports have been generated:
tar -cvJf /mnt/auto/analytics/archives/gbif_analytics_$(date +%Y-%m-01).tar.xz --exclude favicon.ico analytics-files/[a-z]*
Then upload this file to Box.
- Copy only the CSVs and GeoTIFFs to the public, web archive:
rsync -rtv /var/www/html/analytics-files/[a-z]* /mnt/auto/analytics/files/$(date +%Y-%m-01) --exclude figure --exclude map --exclude '*.pdf' --exclude favicon.ico
cd /var/www/html/analytics-files
ln -s /mnt/auto/analytics/files/$(date +%Y-%m-01) .
- Verify the display of this at https://analytics-files.gbif.org/
The work presented here is not new, and builds on ideas already published. In particular the work of Javier Otegui, Arturo H. Ariño, María A. Encinas, Francisco Pando (https://doi.org/10.1371/journal.pone.0055144) was used as inspiration during the first development iteration, and Javier Otegui kindly provided a crash course in R to kickstart the development.