Skip to content

Data code

abigailhayes edited this page Apr 28, 2021 · 4 revisions

The data folder contains the code needed to extract data from FamilyMan on the Analytical Platform and output the FCSQ CSVs.

This is currently a work in progress. Domestic Violence has been completed, subject to a dual run on the AP databases are being updated. Adoption has been started.

Contents

Overview

  • The run_data.R script contains the code to run all the other scripts.
  • There are then up to 3 scripts for each case type; a main script, a script containing the SQL and a script for any checks.
  • There is also currently a functions.R script whilst the processing is a work in progress. These functions will eventually be moved to the fcsrap packahe.
  • The derived_tables folder contains sql code for replicating the extraction from the database, using the derived tables instead of the originals.
  • 'lookups' and 'temp' are both for storing temporary files in the local area when they have been imported from the S3 bucket or before exporting to the S3 bucket.

run_data

The code in run_data can be run to process all the data at once. There are some aspects which may need updating from quarter to quarter.

is_test

The run data file starts by defining is_test. When is_test <- FALSE and the rest of the code is run, updated versions of the CSVs will be saved to the S3 bucket. When is_test <- TRUE and the rest of the code is run, updated versions of the CSVs will be created locally but not exported to the S3 bucket. This enables changes to be made and testing to be run without overwriting the CSVs with incorrect data.

Packages

The packages needed in any of the other scripts are brought into the library within run_data.R:

  • glue - for adding variables into the SQL text
  • s3tools - for accessing the S3 bucket
  • dplyr, dtplyr and data.table - for processing data
  • lubridate - for processing with dates
  • readr - for reading in the lookups from the S3 bucket

Variables

The next section of run_data.R defines the variables used in the rest of the code. Some of these will need to be updated every quarter, whilst others should generally remain the same but may change occasionally.

  • 'snapshot_date' - specifies the FamilyMan snapshot to be used when extracting data using Athena. When running for FCSQ, will generally want this to be the most recent date available.
  • exclude_year and exclude_quarter - the quarter that the snapshot_date falls in will contain incomplete data so needs to be excluded
  • year_cut_off - specifies the year before the data should start from eg. if `year_cut_off <- '2010' then the data extracted will start in 2011
  • pub_year and pub_quarter - specifies the year and quarter of publication, generally the quarter before the snapshot date, this is only used in naming the CSVs
  • database - specifies the database to extract the data from
  • path_to_project - indicates the folder where all of the scripts are saved, used to shorten the path names needed elsewhere
  • csv_folder - specifies the folder in the S3 bucket where the CSVs should be saved to, used to shorten the path names needed elsewhere

Lookups

Any lookups needed in the rest of the code are imported in run_data.R. They are saved in the lookups folder. Currently the lookups are; court lookup and birth country lookup.

Scripts

The remainder of run_data.R contains code that calls the other scripts within the data folder.

sql

Each case type area has a file saved as name_sql.R. Within this file, there are SQL queries which can be used to extract data from the Analytical Platform. The SQL is saved in a separate file to the other processing, so that the queries can be called elsewhere eg. in a checks file.

The SQL is used to query the database using Athena. Athena uses Presto SQL, and the commands within this are fairly limited. Any additional processing is therefore generally carried out in R. For learning more about SQL, see W3 Schools or the DASD SQL training.

Each SQL query is saved as a text variable. The data is then extracted by calling this within dbtools::read_sql(). The glue() function allows other SQL queries, or other variables, to be called within an SQL query.

Derived tables

The SQL code saved in the data folder is used to extract the data from tables which are directly replicated from FamilyMan in Aramis. However, derived tables are also being developed to enable easier access to the data if shared wider with researchers. It should be possible to recreate the Family Court Statistics data using these tables as an alternative source of data. The derived_tables folder is to contain SQL code that replicates the data extraction of the other SQL queries but using the derived tables.

There are some current issues that have been identified:

  • Domestic Violence is referred to as Domestic Abuse
  • Where one case has applications in different case types, for FCSQ we count the case in each of the case types separately. In the derived tables, each case is allocated only one case type. So, the case type column in the derived tables may not extract the same data as would be extracted for FCSQ.

Processing the data in R

The initial data is extracted from the FamilyMan database using SQL in Athena. This data is saved as a data.table format. The data is at an event level as applications or orders, so part of the processing is to extract case starts and cases disposed. There is then further processing to create the necessary CSVs before these are exported to the Family S3 bucket.

Each case type area will have the calculations enclosed within a function once complete. This is to avoid the memory limit of R being reached. Currently, only the dv code has been completed as a function. Where there are CSVs containing data on more than one case type, the temp folder is used to store the partial CSVs created for each area. It will then be possible to create a separate script for combining these into the full CSV.

Packages

Most of the processing is carried out using dtplyr. This uses the same function names as dplyr but with data.table carrying out the processing in the background. This was chosen to combine the readability of dplyr with the speed of data.table.

To begin using a data variable, the first function to apply is dtplyr::lazy_dt(). This can then be followed by a series of dplyr functions. To complete the calculation, the final function should be as.data.table() so that the output is again a data.table. Examples can be seen in dv.R and below.

Example: new_data <- old_data %>% dtplyr::lazy_dt() %>% filter(year>2020) %>% as.data.table()

Clone this wiki locally