update documentation some more

monarch-initiative · Apr 19, 2024 · 72bcd3a · 72bcd3a
1 parent 4f30a65
commit 72bcd3a
Show file tree

Hide file tree

Showing 7 changed files with 218 additions and 204 deletions.
diff --git a/docs/Ingests/index.md b/docs/Ingests/index.md
@@ -0,0 +1,12 @@
+<sub>
+(For CLI usage, see the [CLI commands](./CLI.md) page.)
+</sub>  
+
+Koza is designed to process and transform existing data into a target csv/json/jsonl format.  
+
+This process is internally known as an **ingest**. Ingests are defined by:  
+
+1. [Source config yaml](./source_config.md): Ingest configuration, including:
+    -  metadata, formats, required columns, any SSSOM files, etc. 
+1. [Map config yaml](./mapping.md): (Optional) configures creation of mapping dictionary  
+1. [Transform code](./transform.md): a Python script, with specific transform instructions 
diff --git a/docs/Ingests/mapping.md b/docs/Ingests/mapping.md
@@ -0,0 +1,62 @@
+
+Mapping with Koza is optional, but can be done in two ways:  
+
+- Automated mapping with SSSOM files  
+- Manual mapping with a map config yaml
+
+### SSSOM Mapping
+
+Koza supports mapping with SSSOM files (Semantic Similarity of Source and Target Ontology Mappings).  
+Simply add the path to the SSSOM file to your source config, the desired target prefixes,  
+and any prefixes you want to use to filter the SSSOM file.  
+Koza will automatically create a mapping lookup table which will automatically  
+attempt to map any values in the source file to an ID with the target prefix.
+
+```yaml
+sssom_config:
+    sssom_file: './path/to/your_mapping_file.sssom.tsv'
+    filter_prefixes: 
+        - 'SOMEPREFIX'
+        - 'OTHERPREFIX'
+    target_prefixes: 
+        - 'OTHERPREFIX'
+    use_match:
+        - 'exact'
+```
+
+**Note:** Currently, only the `exact` match type is supported (`narrow` and `broad` match types will be added in the future).
+
+### Manual Mapping / Additional Data
+
+The map config yaml allows you to include data from other sources in your ingests,  
+which may have different columns or formats.  
+
+If you don't have an SSSOM file, or you want to manually map some values, you can use a map config yaml.  
+You can then add this map to your source config yaml in the `depends_on` property.  
+
+Koza will then create a nested dictionary with the specified key and values.  
+For example, the following map config yaml maps values from the `STRING` column to the `entrez` and `NCBI taxid` columns.
+
+```yaml
+# koza/examples/maps/entrez-2-string.yaml
+name: ...
+files: ...
+
+columns:
+- 'NCBI taxid'
+- 'entrez'
+- 'STRING'
+
+key: 'STRING'
+
+values:
+- 'entrez'
+- 'NCBI taxid'
+```
+
+
+The mapping dict will be available in your transform script from the `koza_app` object (see the Transform Code section below).
+
+---
+
+**Next Steps: [Transform Code](./transform.md)**
diff --git a/docs/Ingests/source_config.md b/docs/Ingests/source_config.md
@@ -0,0 +1,74 @@
+This YAML file sets properties for the ingest of a single file type from a within a Source.
+
+!!! tip "Paths are relative to the directory from which you execute Koza."
+
+## Source Configuration Properties
+
+| **Required properties**              |                                                                                                     |
+| ------------------------------------ | --------------------------------------------------------------------------------------------------- |
+| `name`                               | Name of the source                                                                                  |
+| `files`                              | List of files to process                                                                            |
+|                                      |                                                                                                     |
+| **Optional properties**              |                                                                                                     |
+| `file_archive`                       | Path to a file archive containing the file(s) to process <br/> Supported archive formats: zip, gzip |
+| `format`                             | Format of the data file(s) (CSV or JSON)                                                            |
+| `sssom_config`                       | Configures usage of SSSOM mapping files                                                             |
+| `depends_on`                         | List of map config files to use                                                                     |
+| `metadata`                           | Metadata for the source, either a list of properties,<br/>or path to a `metadata.yaml`              |
+| `transform_code`                     | Path to a python file to transform the data                                                         |
+| `transform_mode`                     | How to process the transform file                                                                   |
+| `global_table`                       | Path to a global translation table file                                                             |
+| `local_table`                        | Path to a local translation table file                                                              |
+| `field_type_map`                     | Dict of field names and their type (using the FieldType enum)                                       |
+| `filters`                            | List of filters to apply                                                                            |
+| `json_path`                          | Path within JSON object containing data to process                                                  |
+| `required_properties`                | List of properties that must be present in output (JSON only)                                       |
+|                                      |                                                                                                     |
+| **Optional CSV Specific Properties** |                                                                                                     |
+| `columns`                            | List of columns to include in output (CSV only)                                                     |
+| `delimiter`                          | Delimiter for csv files                                                                             |
+| `header`                             | Header row index for csv files                                                                      |
+| `header_delimiter`                   | Delimiter for header in csv files                                                                   |
+| `header_prefix`                      | Prefix for header in csv files                                                                      |
+| `comment_char`                       | Comment character for csv files                                                                     |
+| `skip_blank_lines`                   | Skip blank lines in csv files                                                                       |
+
+## Metadata Properties
+
+Metadata is optional, and can be defined as a list of properties and values, or as a path to a `metadata.yaml` file, 
+for example - `metadata: "./path/to/metadata.yaml"`.  
+Remember that the path is relative to the directory from which you execute Koza.
+
+| **Metadata Properties** |                                              |
+| ----------------------- | -------------------------------------------- |
+| id                      | TBD                                          |
+| name                    | If empty, uses source name                   |
+| ingest_title            | Title of source of data, map to biolink name |
+| ingest_url              | URL to source of data, Maps to biolink iri   |
+| description             | Description of ingest                        |
+| source                  | Source of data being transformed             |
+| provided_by             | TBD                                          |
+| rights                  | TBD                                          |
+
+### Composing Configuration from Multiple Yaml Files
+
+Koza's custom YAML Loader supports importing/including other yaml files with an `!include` tag.
+
+For example, if you had a file named `standard-columns.yaml`:
+
+```yaml
+- "column_1"
+- "column_2"
+- "column_3"
+- "column_4": "int"
+```
+
+Then in any ingests you wish to use these columns, you can simply `!include` them:
+
+```yaml
+columns: !include "./path/to/standard-columns.yaml"
+```
+
+---
+
+**Next Steps: [Mapping and Additional Data](./mapping.md)**
diff --git a/docs/Ingests/transform.md b/docs/Ingests/transform.md
@@ -0,0 +1,59 @@
+This Python script is where you'll define the specific steps of your data transformation. 
+Koza will load this script and execute it for each row of data in your source file,  
+applying any filters and mapping as defined in your source config yaml,  
+and outputting the transformed data to the target csv/json/jsonl file.
+
+When Koza is called, either by command-line or as a library using `transform_source()`,  
+it creates a `KozaApp` object for the specified ingest.  
+This KozaApp will be your entry point to Koza:
+
+```python
+from koza.cli_runner import get_koza_app
+koza_app = get_koza_app('your-source-name')
+```
+
+The KozaApp object has the following methods:
+
+| Method | Description |
+| --- | --- |
+| `get_row()` | Returns the next row of data from the source file |
+| `get_map(map_name)` | Returns the mapping dict for the specified map |
+| `get_global_table()` | Returns the global translation table |
+| `get_local_table()` | Returns the local translation table |
+
+??? tldr "Example Python Transform Script"
+
+    ```python
+    # other imports, eg. uuid, pydantic, etc.
+    import uuid
+    from biolink.pydanticmodel_v2 import Gene, PairwiseGeneToGeneInteraction
+
+    # Koza imports
+    from koza.cli_runner import get_koza_app
+
+    # This is the name of the ingest you want to run
+    source_name = 'map-protein-links-detailed'
+    koza_app = get_koza_app(source_name)
+        
+    # If your ingest depends_on a mapping file, you can access it like this:
+    map_name = 'entrez-2-string'
+    koza_map = koza_app.get_map(map_name)
+
+    # This grabs the first/next row from the source data
+    # Koza will reload this script and return the next row until it reaches EOF or row-limit
+    while (row := koza_app.get_row()) is not None:
+        # Now you can lay out your actual transformations, and define your output:
+
+        gene_a = Gene(id='NCBIGene:' + koza_map[row['protein1']]['entrez'])
+        gene_b = Gene(id='NCBIGene:' + koza_map[row['protein2']]['entrez'])
+
+        pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
+            id="uuid:" + str(uuid.uuid1()),
+            subject=gene_a.id,
+            object=gene_b.id,
+            predicate="biolink:interacts_with"
+        )
+
+        # Finally, write the transformed row to the target file
+        koza_app.write(gene_a, gene_b, pairwise_gene_to_gene_interaction)
+    ```