Alzheimer’s Disease (AD) Biomarker Collection Package.
With sharp increases in AD cases, deaths, and costs stressing the health care system and caregivers, several major AD data sources exist which allow researchers to conduct their research. For example, the BIOCARD study is a longitudinal, observational study initiated in 1995, and designed to identify biomarkers associated with progression from cognitively normal to mild cognitive impairment or dementia; the ADNI study is a multicenter observation study launched in 2004, to collect clinical, imaging, genetic and biospecimen biomarkers from cohorts of different clinical states at baseline; the NACC UDS data is a collection of data reflecting the total enrollment since 2005 across 34 AD Centers and includes subjects with a range of cognitive status.
The ADMerge Package provides a function, ad_merge()
, that merges data from various AD data sources to create a analysis dataset.This package establishes AD data standards and data dictionaries that define the formats and organization structures of the AD data across multiple data sources. R functions are provided for data analysts to integrate data from multiple data sources and create their analysis datasets.
Use the following codes to install the ADMerge package
library(devtools)
install_github("Thewhey-Brian/ADMerge")
library(ADMerge)
For details about how to install a R package directly from GitHub: https://rdrr.io/cran/remotes/man/install_github.html.
In order to collect biomarkers for AD, locally access to all biomarker files is needed.
To help understand the significant amount of data in ADNI dataset, we provided ADNI_Tools to provide/generate a reference files dictionary. So one can access the detailed information for all files without downloading them.
Before merging all biomarkers across different files, it is crucial to review the files structure through function get_src_table()
.
src_table = get_src_table(path_to_biomarker_files)
The data structure table src_table
will be one of the inputs for the main merging function.
Inputs:
path
: The path to the directory containing the data files.FILE_pattern
: A regular expression pattern that specifies the file types to include in the source table. The default is ".xlsx|.xls|.csv".ID_pattern
: A regular expression pattern that specifies the potential ID variables in the data files. The default is "ID".DATE_pattern
: A regular expression pattern that specifies the potential DATE variables in the data files. The default is "DATE|VISITNO".IS_overlap_list
: A list of logical values that specifies, when merging, whether overlapping between time windows is allowed (TRUE) or not (FALSE). The length of the list must be equal to the number of files being read. The default is NULL.WINDOW_list
: A list of numeric time windows for matching the DATE variables. The length of the list must be equal to the number of files being read. Default is NULL.ID_usr_list
: A list of user-specified ID variable names. If provided, the function will try to match the variable names to the potential ID variables in the data files. The default is NULL.DATE_usr_list
: A list of user-specified DATE variable names. If provided, the function will try to match the variable names to the potential DATE variables in the data files. The default is NULL.file
: A path to a file where the source table will be saved as a CSV file.
Outputs: A table with the following structure:
file. | VARS_in_file | ID_in_file | DATE_in_file | ID_for_merge | DATE_for_merge | IS_overlap | WINDOW |
---|---|---|---|---|---|---|---|
CSF_file.csv | Phase; ID; RID; SITEID; ... | ID; RID; SITEID | USERDATE; USERDATE2; EXAMDATE; | ID | EXAMDATE | FALSE | 366 |
IMAGE_file.csv | Phase; ID; RID; SITEID; ... | ID; RID; SITEID | USERDATE; USERDATE2; SCANDATE; | ID | SCANDATE | FALSE | 366 |
DIAGNOSIS_file.csv | Phase; ID; RID; SITEID; ... | ID; RID; SITEID | USERDATE; USERDATE2; | ID | USERDATE | FALSE | 366 |
There are two ways to modify src_table
generated by get_src_table()
.
- Run
get_src_table()
again with any specifiedID_usr_list
,DATE_usr_list
,IS_overlap_list
,WINDOW_list
. Note: The length of the list must be equal to the number of files insrc_table
. - Run
get_src_table()
again withfile
specified. This will savesrc_table
as a csv file to the local directory. One can edit this csv file locally and input to the merging function later.
The merging action is performed by function ad_merge()
.
ad_data = ad_merge("path_to_biomarker_files", DATE_type = "Date", dict_src = src_table)
Inputs:
path
: The path to the directory containing the data files.DATE_type
: The type of DATE used in the data, either "Date" (e.g. 2017-1-16) or "Number" (e.g. 3 or m48 ...).dict_src
: A dataframe containing structual information of the input data files. Default NULL. Fill in ifsrc_table
is modified and stored in R environment.dict_src_path
: The path to thesrc_table
. Default NULL. Fill in ifsrc_table
is modified locally with its csv file.timeline_file
: The name of the file containing the timeline for the data. Could be any value in the file column ofsrc_table
.timeline_path
: The path to the timeline file. Default NULL. This is an alternative option for inputingtimeline_file
.
Outputs:
analysis_data
: The merged dataset with all the relevant biomarker information.dict_src
: Thesrc_table
used for this merging.
s3 functions summary()
and plot()
are provided to get the summary information about the merged analysis data.
summary(ad_data)
plot(ad_data, distn = "SCF_m1", group = "SEX")
There are several crucial inputs for the plotting function:
distn
The name of the variable to plot the distribution.group
The name of the variable to group and colored in the plot.baseline
A boolean indicating whether to include only the baseline data in the plot.