Skip to content

Commit

Permalink
update config doc, allow using all extracted files
Browse files Browse the repository at this point in the history
  • Loading branch information
glass-ships committed Apr 22, 2024
1 parent 8bfd87f commit aafb61e
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 35 deletions.
57 changes: 29 additions & 28 deletions docs/Ingests/source_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,34 +4,35 @@ This YAML file sets properties for the ingest of a single file type from a withi

## Source Configuration Properties

| **Required properties** | |
| ------------------------------------ | --------------------------------------------------------------------------------------------------- |
| `name` | Name of the data ingest, as `<data source>_<type_of_ingest>`, <br/>ex. `hpoa_gene_to_disease` |
| `files` | List of files to process |
| | |
| **Optional properties** | |
| `file_archive` | Path to a file archive containing the file(s) to process <br/> Supported archive formats: zip, gzip |
| `format` | Format of the data file(s) (CSV or JSON) |
| `sssom_config` | Configures usage of SSSOM mapping files |
| `depends_on` | List of map config files to use |
| `metadata` | Metadata for the source, either a list of properties,<br/>or path to a `metadata.yaml` |
| `transform_code` | Path to a python file to transform the data |
| `transform_mode` | How to process the transform file |
| `global_table` | Path to a global translation table file |
| `local_table` | Path to a local translation table file |
| `field_type_map` | Dict of field names and their type (using the FieldType enum) |
| `filters` | List of filters to apply |
| `json_path` | Path within JSON object containing data to process |
| `required_properties` | List of properties that must be present in output (JSON only) |
| | |
| **Optional CSV-Specific Properties** | |
| `columns` | List of columns to include in output (CSV only) |
| `delimiter` | Delimiter for csv files |
| `header` | Header row index for csv files |
| `header_delimiter` | Delimiter for header in csv files |
| `header_prefix` | Prefix for header in csv files |
| `comment_char` | Comment character for csv files |
| `skip_blank_lines` | Skip blank lines in csv files |
| **Required properties** | |
| --------------------------- | --------------------------------------------------------------------------------------------------- |
| `name` | Name of the data ingest, as `<data source>_<type_of_ingest>`, <br/>ex. `hpoa_gene_to_disease` |
| `files` | List of files to process |
| | |
| **Optional properties** | |
| `file_archive` | Path to a file archive containing the file(s) to process <br/> Supported archive formats: zip, gzip |
| `format` | Format of the data file(s) (CSV or JSON) |
| `sssom_config` | Configures usage of SSSOM mapping files |
| `depends_on` | List of map config files to use |
| `metadata` | Metadata for the source, either a list of properties,<br/>or path to a `metadata.yaml` |
| `transform_code` | Path to a python file to transform the data |
| `transform_mode` | How to process the transform file |
| `global_table` | Path to a global translation table file |
| `local_table` | Path to a local translation table file |
| `field_type_map` | Dict of field names and their type (using the FieldType enum) |
| `filters` | List of filters to apply |
| `json_path` | Path within JSON object containing data to process |
| `required_properties` | List of properties that must be present in output (JSON only) |
| | |
| **CSV-Specific Properties** | |
| `delimiter` | Delimiter for csv files (**Required for CSV format**) |
| **Optional CSV Properties** | |
| `columns` | List of columns to include in output (CSV only) |
| `header` | Header row index for csv files |
| `header_delimiter` | Delimiter for header in csv files |
| `header_prefix` | Prefix for header in csv files |
| `comment_char` | Comment character for csv files |
| `skip_blank_lines` | Skip blank lines in csv files |

## Metadata Properties

Expand Down
12 changes: 5 additions & 7 deletions src/koza/model/config/source_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ class SourceConfig:
required_properties: List[str] (optional) - list of properties which must be in json data files
metadata: DatasetDescription (optional) - metadata for the source
delimiter: str (optional) - delimiter for csv files
header: int (optional) - header row index
header: int (optional) - header row index (required if format is csv and header is not none)
header_delimiter: str (optional) - delimiter for header in csv files
header_prefix: str (optional) - prefix for header in csv files
comment_char: str (optional) - comment character for csv files
Expand Down Expand Up @@ -192,12 +192,10 @@ def extract_archive(self):
archive.extractall(archive_path)
else:
raise ValueError("Error extracting archive. Supported archive types: .tar.gz, .zip")
files = [os.path.join(archive_path, file) for file in self.files]
# Possibly replace with this code if we want to extract all files? (not sure)
# if self.files:
# files = [os.path.join(archive_path, file) for file in self.files]
# else:
# files = [os.path.join(archive_path, file) for file in os.listdir(archive_path)]
if self.files:
files = [os.path.join(archive_path, file) for file in self.files]
else:
files = [os.path.join(archive_path, file) for file in os.listdir(archive_path)]
return files

def __post_init__(self):
Expand Down

0 comments on commit aafb61e

Please sign in to comment.