Merge pull request #6 from ACED-IDP/feature/data-model

Adds meta-data
ACED-IDP · Dec 15, 2023 · 86cb6c5 · 86cb6c5
2 parents 71e5912 + 0b9b33d
commit 86cb6c5
Show file tree

Hide file tree

Showing 7 changed files with 265 additions and 33 deletions.
diff --git a/docs/data-model/integration.md b/docs/data-model/integration.md
@@ -0,0 +1,112 @@
+
+# Integrating your data
+
+Converting tabular data (CSV, TSV, spreadsheet, database table) into FHIR (Fast Healthcare Interoperability Resources) involves several steps to map the data in the spreadsheet to FHIR's resource structure. Here is what you need to know to get started:
+
+As you create a upload files, you can tag them with identifiers which by default will create minimal, skeleton graph.
+
+You can retrieve that data using the gen3_util command line tool, and update the metadata to create a more complete graph representing your study.
+
+You may choose to work with the data in it's "native" json format, or convert it to a tabular format for integration.  The system will re-convert tabular data back to json for submittal.
+
+The process of integrating your data into the graph involves several steps:
+
+* Step 1: Identify Data and FHIR Resources
+    * Inventory tabular data: Review the spreadsheet to understand the types of data it contains (e.g., patient demographics, lab results, medications).
+    * Understand FHIR Resources: Familiarize yourself with FHIR resources relevant to the data in your spreadsheet (e.g., Patient, Observation, Specimen, etc.).
+
+* Step 2: Mapping Spreadsheet Columns to FHIR Fields
+    * Analyze Columns: Map each column in the spreadsheet to corresponding fields in FHIR resources. For instance, you may have a field called biopsy_anatomical_location with content of "Prostate needle biopsies", that would map to Specimen.collection.method and Specimen.collection.bodySite.
+    * Handle Relationships: Identify how different pieces of data relate to each other and how they map to FHIR resource relationships (e.g., linking patients to their observations).
+
+* Step 3: Data Transformation and Structure
+    * Prepare Data: Ensure data consistency and format alignment. Dates, codes, and identifiers should comply with FHIR standards.
+    * Normalize Data: Split the spreadsheet data into FHIR-compliant resources.
+
+* Step 4: Utilize provided FHIR Tooling or Libraries
+    * FHIR Tooling: Use gen3_utils [TODO]() and associated libraries to support data conversion and validation.
+    * Validation: Use gen3_utils [TODO]() to validate the transformed data against FHIR specifications to ensure compliance and accuracy.
+
+* Step 5: Import into FHIR-Compatible System
+    * Load Data: Use gen3_utils [TODO]() to load the transformed data into the aced system.
+    * Testing and Verification: Use gen3_utils [TODO]() to ensure your data appears correctly in the portal and analysis tools.
+
+* Step 6: Iterate and Refine
+    * Review and Refine: Check for any discrepancies or issues during the import process. Refine the conversion process as needed.
+    * Feedback Loop: Gather feedback from users or stakeholders to improve the mapping and conversion process.
+
+
+## Ontologies
+
+Ontologies within FHIR serve as a formal representation of concepts, their relationships, and properties within the healthcare domain. They provide a shared vocabulary and framework that enable consistent interpretation and exchange of healthcare data among different systems and entities.
+
+FHIR utilizes ontologies in various ways:
+
+* Terminology Binding: Ontologies help define and bind standardized terminologies to FHIR resources. This ensures that data elements, such as diagnoses or procedures, are uniformly understood across different studies or submissions.
+
+* Code Systems: FHIR employs standardized code systems (like SNOMED CT, LOINC, or RxNorm) within its resources. These code systems are essentially ontologies that define concepts and relationships, allowing for precise identification and categorization of medical information.
+
+* Mapping and Alignment: Ontologies assist in mapping data between different standards and formats. They facilitate the alignment of disparate data representations by providing a common reference point, making it easier to convert and interpret information accurately across systems.
+
+* Semantic Interoperability: By using ontologies, FHIR promotes semantic interoperability. This means that not only can systems exchange data but also understand the meaning behind the exchanged information, enhancing communication and reducing ambiguity in healthcare data exchange.
+
+* Consistency and Reusability: Ontologies establish a consistent and reusable framework for defining healthcare concepts. This consistency aids in data integration, analytics, and the development of applications or systems that can leverage shared knowledge.
+
+In essence, ontologies in FHIR serve as the backbone for standardization, enabling effective communication and interpretation of healthcare data among various stakeholders, systems, and applications.
+
+### Example: SNOMED CT
+
+The [Specimen resource in FHIR](https://hl7.org/fhir/specimen.html) represents a sample or specimen collected during a healthcare event and contains details about its origin, type, and processing.
+
+Mapping a [SNOMED body part](https://bioportal.bioontology.org/ontologies/SNOMEDCT?p=classes&conceptid=442083009) to a FHIR Specimen involves linking the anatomical or body site specified in SNOMED CT to the relevant information within a FHIR Specimen resource.
+
+<img src="/images/snomed-bodypart.png" width="100%">
+
+The mapping process typically involves several steps:
+
+* Identification of SNOMED CT Body Part: SNOMED CT contains a comprehensive hierarchy of anatomical structures and body parts. This could include specific codes representing organs, tissues, or body sites.
+
+* Mapping to FHIR Specimen: In FHIR, the Specimen resource includes fields like specimen type, collection details, container, and possibly body site information.
+
+* Matching Concepts: The SNOMED CT code representing the body part or anatomical site needs to be correlated with the relevant field(s) in the FHIR Specimen resource. For instance, the FHIR Specimen resource has a field called "collection.bodySite" that can be used to capture the anatomical location from which the specimen was obtained.
+
+## Identifiers
+
+Identifiers in FHIR references typically include the following components: [see](https://hl7.org/fhir/datatypes.html#Identifier)
+
+> A string, typically numeric or alphanumeric, that is associated with a single object or entity within a given system. Typically, identifiers are used to connect content in resources to external content available in other frameworks or protocols.
+
+System: Indicates the system or namespace to which the identifier belongs. By default the namespace is `http://aced-idp.org/<project-id>`.
+
+Value: The actual value of the identifier within the specified system. For instance, a lab controlled subject identifier or a specimen identifier.
+
+
+
+## References
+
+By using identifiers in references, FHIR ensures that data can be accurately linked, retrieved, and interpreted across different systems and contexts within the healthcare domain, promoting interoperability and consistency in data exchange. [see](https://hl7.org/fhir/references.html)
+
+> Many of the defined elements in a resource are references to other resources. Using these references, the resources combine to build a web of information about healthcare.
+
+
+## Key resources
+
+### ResearchStudy
+> A scientific study of nature that sometimes includes processes involved in health and disease. [see](https://hl7.org/fhir/researchstudy.html)
+
+### ResearchSubject
+> A ResearchSubject is a participant or object which is the recipient of investigative activities in a research study. [see](https://hl7.org/fhir/researchsubject.html)
+
+
+### Patient 
+> Demographics and other administrative information about an individual or animal receiving care or other health-related services. [see](https://hl7.org/fhir/patient.html)
+
+### Specimen
+
+> A sample to be used for analysis. [see](https://hl7.org/fhir/specimen.html)
+
+### DocumentReference
+> A reference to a document of any kind for any purpose. [see](https://hl7.org/fhir/documentreference.html)
+
+
+See the  <a href="/workflows/metadata/">metadata workflow section</a> for more information on how to create and upload metadata.
diff --git a/docs/data-model/introduction.md b/docs/data-model/introduction.md
@@ -0,0 +1,35 @@
+
+# FHIR for Research Analysts
+Given all of the intricacies healthcare and experimental data, we use Fast Healthcare Interoperability Resources (FHIR) as a data model to ensure informaticians and analysts can concentrate on science, not data structures.  This document introduces model for Research Analysts and describes how an analyst can shape and query FHIR resources.
+
+## What is FHIR?
+
+In an era where healthcare information is abundant yet diverse and often siloed, FHIR emerges as a standard, empowering research analysts to navigate, aggregate, and interpret health data seamlessly. This guide aims to unravel the intricacies of FHIR, equipping research analysts with the knowledge and tools needed to harness the potential of interoperable healthcare data for insightful analysis and impactful research outcomes in the context of ACED collaborations.
+
+## Graph Model
+
+FHIR has certain aspects that can align with graph-like structures or facilitate graph-based analysis:
+
+Resource Relationships: FHIR resources often have relationships with other resources. For instance, a Patient resource can be associated with multiple Observation resources, which in turn might be linked to Condition or Procedure resources. These relationships create a network-like structure, similar to a graph.
+
+References and Linkages: FHIR resources utilize references to establish connections between related entities. These references can be leveraged to create graph-like representations when modeling relationships between patients, specimens, observations, etc.
+
+### Example
+
+The following "file focused" example illustrates how ACED uses FHIR resources a DocumentReference's ancestors within a study.
+
+Examine [resource](https://www.hl7.org/fhir/resource.html) definitions [here](http://www.hl7.org/fhir/resource.html):
+
+* Details on [uploaded files](https://aced-idp.github.io/workflows/upload/) are captured as [DocumentReference](http://www.hl7.org/fhir/documentreference.html)
+
+* DocumentReference.[subject](https://www.hl7.org/fhir/documentreference-definitions.html#DocumentReference.subject) indicates who or what the document is about:  
+  * Can simply point to the [ResearchStudy](https://hl7.org/fhir/researchstudy.html), to indicate the file is part of the study
+  * Can point to [Patient](https://hl7.org/fhir/patient.html), or [Specimen](https://hl7.org/fhir/specimen.html), to indicate the file is based on them
+* An [Observation](https://hl7.org/fhir/observation.html) can point to any entity    
+* A [Task](https://hl7.org/fhir/task.html), or [DiagnosticReport](https://hl7.org/fhir/diagnosticreport.html)  can provide [provenance](https://en.wikipedia.org/wiki/Provenance#Data_provenance) on how the file was created 
+
+Each resource has at least one study controlled [official](https://hl7.org/fhir/codesystem-identifier-use.html#identifier-use-official) [Identifier](https://hl7.org/fhir/datatypes.html#Identifier).  Child resources have [Reference](http://www.hl7.org/fhir/references.html) fields to point to their parent.
+
+
+<img src="/images/fhir-graph-model.png" width="100%">
+
diff --git a/docs/images/fhir-graph-model.png b/docs/images/fhir-graph-model.png
diff --git a/docs/images/snomed-bodypart.png b/docs/images/snomed-bodypart.png
diff --git a/docs/workflows/metadata.md b/docs/workflows/metadata.md
@@ -0,0 +1,67 @@
+# Creating and Uploading Metadata
+
+### Create Metadata
+
+Create basic, minimal metadata for the project:
+
+```sh
+gen3_util meta create /tmp/$PROJECT_ID
+
+ls -1 /tmp/$PROJECT_ID
+DocumentReference.ndjson
+Observation.ndjson
+Patient.ndjson
+ResearchStudy.ndjson
+ResearchSubject.ndjson
+Specimen.ndjson
+Task.ndjson
+```
+
+### Retrieve existing metadata
+Retrieve the existing metadata from the portal.
+
+```sh
+
+gen3_util meta cp
+
+TODO
+```
+
+### Integrate your data
+
+Convert the FHIR data to tabular form.
+
+```sh
+TODO
+```
+
+Convert the tabular data to FHIR.
+
+```sh
+TODO
+```
+
+Validate the data
+
+```sh
+$ gen3_util meta validate --help
+Usage: gen3_util meta validate [OPTIONS] DIRECTORY
+
+  Validate FHIR data in DIRECTORY.
+
+```
+
+
+
+### Publish the Metadata
+
+```text
+# copy the metadata to the bucket and publish the metadata to the portal
+gen3_util meta publish /tmp/$PROJECT_ID
+```
+
+## View the Files
+
+This final step uploads the metadata associated with the project and makes the files visible on the [Explorer page](https://aced-idp.org/explorer).
+
+<a href="https://aced-idp.org/explorer">![Gen3 File Explorer](./explorer.png)</a>
diff --git a/docs/workflows/upload.md b/docs/workflows/upload.md
@@ -13,21 +13,43 @@ This page will guide you through both steps using the `gen3_util` tool.
 
 ## Creating and Uploading Manifest
 
-First, set the `PROJECT_ID` environmental variable to that of your program and project (e.g. `aced-myproject`):
+Think of a 'manifest' for file uploads as a detailed packing list for items being sent in a shipment. 
+It's like a comprehensive index or catalog that describes all the files being uploaded, providing essential information about each file in a structured format.
+For each file in the manifest, the following information is collected:
+
+* Identification and Details
+  * `file_name` - the relative full path name of the file [required]
+  * `size` - the size of the file in bytes [filled automatically]
+  * `md5` - the md5 checksum of the file [filled automatically]
+  * `object_id` - the object id of the file in indexd [system generated] 
+  * `mime_type` - the mime type of the file [filled automatically] 
+  * `remote_path` - the path to the file in when downloaded from the bucket [optional]
+  * meta-data - any identifiers that are associated with the file
+    * `project_id` - the project id of the file [required]
+    * `specimen_id` - the specimen id of the file [optional]
+    * `patient_id` - the patient id of the file [optional]
+    * `task_id` - the task id of the file [optional]
+    * `observation_id` - the observation id of the file [optional]
+
+
+
+A single manifest can exist for a project at any one time.
+
+Set the `PROJECT_ID` environmental variable to that of your program and project (e.g. `aced-myproject`):
 
 ```sh
 export PROJECT_ID=aced-myproject
 ```
 
-### Upload a Single File
+### Add a single file to the manifest
 
 To upload a single file to the manifest, run:
 
 ```sh
 gen3_util files manifest put example-file.txt
 ```
 
-### Upload Multiple Files
+### Add multiple files to the manifest
 
 To upload multiple files to the manifest, use the `find` and `xargs` commands to send files the `gen3_util`. 
 
@@ -39,53 +61,45 @@ find example-directory -type f  | xargs -P 0 -I PATH gen3_util files manifest pu
 
 Note that we use `xargs` `-P 0` argument to run commands in parallel, greatly reducing the amount of time to add many files to a manifest.
 
-### Verify the Manifest
+### Verify the manifest
 
 ```sh
 gen3_util files manifest ls | grep file_name
 ```
-If incorrect, then delete the manifest `gen3_util files manifest rm` and then re-add all files using the `gen3_util files manifest put` command shown previously.
 
-### Upload the Manifest
+### Removing files from the manifest
 
 ```sh
-gen3_util files manifest upload
+gen3_util files manifest rm --object_id xxxx
 ```
 
-This command will upload all files defined in the newly create manifest to the S3 storage endpoints associated with the project.
-
-## Creating and Uploading Metadata
 
-### Create Metadata
-
-Create basic, minimal metadata for the project:
+### Upload the manifest
 
 ```sh
-gen3_util meta create /tmp/$PROJECT_ID
+gen3_util files manifest upload
 ```
 
-### Optional: Edit the Metadata
+This command will upload all files defined in the newly create manifest to the S3 storage endpoints associated with the project.
 
-```sh
-ls -1 /tmp/$PROJECT_ID
-DocumentReference.ndjson
-Observation.ndjson
-Patient.ndjson
-ResearchStudy.ndjson
-ResearchSubject.ndjson
-Specimen.ndjson
-Task.ndjson
-```
+By default, minimum metadata is created for each file, based on the project id and other identifiers.  You can override this behavior with the `--metadata` flag. 
 
-### Publish the Metadata
+```commandline
+gen3_util files manifest upload --help
+Usage: gen3_util files manifest upload [OPTIONS]
 
-```text
-# copy the metadata to the bucket and publish the metadata to the portal
-gen3_util meta publish /tmp/$PROJECT_ID
-```
+  Upload to index and project bucket.  Uses local manifest, or manifest_path.
 
-## View the Files
+Options:
+  --project_id TEXT             Gen3 program-project authorization
+  --restricted_project_id TEXT  Gen3 program-project, additional authorization
+  --upload-path TEXT            gen3-client upload path  [default: .]
+  --duplicate_check             Update files records  [default: False]
+  --manifest_path TEXT          Provide your own manifest file.
+  --meta_data                   Generate and submit metadata.  [default: True]
+  --wait                        Wait for metadata completion.  [default: False]
+
+```
 
-This final step uploads the metadata associated with the project and makes the files visible on the [Explorer page](https://aced-idp.org/explorer).
+See the  <a href="/workflows/metadata/">metadata workflow section</a> for more information on how to create and upload metadata.
 
-<a href="https://aced-idp.org/explorer">![Gen3 File Explorer](./explorer.png)</a>
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -16,6 +16,10 @@ nav:
   - workflows/creating-project.md
   - workflows/upload.md
   - workflows/download.md
+  - workflows/metadata.md
+- Meta-Data:
+    - data-model/introduction.md
+    - data-model/integration.md
 - Status Monitor ↗: https://aced-idp.github.io/status-monitor
 
 plugins: