Release Release v23.6.1 · Sage-Bionetworks/schematic

Release notes

New Features and Enhancements

Update and insert (upsert) rows in Synapse tables. This feature allows piece-wise updates to a table in Synapse: a user only needs a csv manifest containing new or changed data/metadata. Given a manifest csv file and a dataset folder on Synapse, schematic will find the associated metadata table for this dataset folder. For each row in the manifest file schematic would check whether the row is already present in the Synapse metadata table. If the row is present, schematic would update it with values from the corresponding manifest row. If the row is not present, schematic will insert it as a new row in the Synapse metadata table. Instructions for using the upsert features via the schematic CLI are here. Note: this feature works differently than the existing table replace option (the default table manipulation option) in schematic. A table replace will substitute the full content of an existing table with the content of a manifest csv file. The latter allows removing rows from the existing table. The upsert feature does not remove existing rows in the table. This feature does not impact users that only work with csv manifest files and do not store metadata in Synapse tables.
Adding parameter controlling whether to execute validation rules part of the Great Expectations (GX) suite. GX is great but some rules take a while to load and execute. This is undesirable in certain situation (e.g. large number of data records that need to be validated in real time). A user can now turn off GX validation rules.
Standardizing validation error format: previously different types of data validation errors may have had different 'look and feel' to them (in addition to different structure). Validation error format and structure are now standardized which allows users & client apps to reliably process them.
New REST API endpoints:
- Retrieving validation rules associated with an attribute in a data model schema: if a schema attribute has a validation rule specifying its type (e.g. int, string, etc.), this endpoint allows retrieving the validation rule and determining the type of the attribute via the schematic REST API. The endpoint retrieves any other validation rules associated with an attribute as well.
- Retrieving the display name associated with an attribute in a data model: aside from machine-friendly labels, attributes in data-model schemas have human-friendly names (aka display names); this endpoint allows retrieving the display name of an attribute given its label.
- Checking if an entity is w/in an asset view (aka fileview in Synapse): this is useful when a user is uncertain whether a dataset has been deleted; users can provide the dataset ID and schematic would check if a dataset with this ID is present.

These endpoints can be accessed by running the schematic REST API locally or deployments on the cloud using schematic version (v23.6.1) or greater.

Updated REST API web server: previously schematic used the default Flask web server. That was suitable for development, but unreliable for production deployments. The new schematic REST API server (uWSGI) remedies security and performance issues.

Performance improvements:

Loading a manifest (or other) csv files now takes advantage of multiple processors speeding up loading of large files if the user's machine has multiple cores (the more cores, the larger the speed up).
REST API calls are profiled and benchmarked against a standard set of inputs (e.g. data models, csv manifests, etc.).
Validation rules are benchmarked against a standard set of inputs (e.g. data models, csv manifests, etc.).

These benchmarks allow us to detect when feature performance is degraded (or improved) due to an update; they'd also allow us to maintain guarantees on performance in the future.

Bug fixes:

Data template formatting: catching edge cases and ensuring column headers are aligned with column values; ensuring conditional formatting works as expected in both Excel and Google Sheets templates.
Ensuring properties of attributes in the data-model schema are properly loaded in schematic: the same property can be reused in multiple attributes (e.g. if the property represents the same concept: name, diagnosis); previously, a property would only be added to one schema attribute. This allows setting up data models for Relation Databases (RDB) where different tables may have columns with the same name (e.g. both Patient and Biospecimen table can have column 'name').

Security fixes:

Updated dependencies, hardened handling of access tokens, among other security and reliability issues allowing schematic to be deployed in secure production environments handling PHI data.

Technical debt:

Code doesn't escape the 2nd law of thermodynamics. We put energy into refactoring handling of validation rules and interactions with Synapse (so that adding features and avoiding bugs is easier later); catching errors and exceptions more robustly and specifically (so that users and clients know what's causing a problem and can handle, report, or fix it more effectively); improving coverage of automated testing (so that we reduce the likelihood of letting bugs in released versions of schematic).

For more details on specific changes, please refer to the changelog below.

What's Changed

Skip api tests when rule combination tests are run by @GiaJordan in #1068
Added workflow to deploy schematic docker container in Github container registry by @linglp in #1062
Remove schematic support for Python v3.7 and v3.8 by @GiaJordan in #1090
Refactor table operations structure in asset store by @GiaJordan in #1069
Added input_token as a parameter for /manifest/get endpoint to fix credential issues when getting an existing manifest on AWS by @linglp in #1080
Fixed getProjectManifests function in synapse storage by @linglp in #1084
Develop api node display names by @mialy-defelice in #1094
Create API endpoint for get_node_validation_rules by @mialy-defelice in #1095
Update schematic dependencies by @GiaJordan in #1092
Raise errors for wrong schema errors by @GiaJordan in #1073
Set default of "table_manipulation" as "replace" in API endpoint when users enter None and updated tests by @linglp in #1115
Update synapseClient dependency and api for manifest table uploads by @GiaJordan in #1101
set pyopenssl = "^23.0.0" by @andrewelamb in #1125
added date GE rule by @andrewelamb in #1103
Implement table upsert feature by using schematic-db by @GiaJordan in #1081
Add use_schema_label parameter to manifest submission endpoint, separate manifest submission and table upsert tests by @GiaJordan in #1129
Delete GE checkpoint after completion of GE validation by @GiaJordan in #1136
Remove try: catch: block from manifest submission command function by @GiaJordan in #1130
Save all properties that are Included in the domain of a Class by @mialy-defelice in #1134
Display exceptions raised during validation with Great expectations, allow exclusion of upper bound OR lower bound for inRange rule by @GiaJordan in #1131
Update Documentation - python/package versions and POCs by @GiaJordan in #1139
Increase buffer size to a higher limit to deal with long token by @linglp in #1144
lock schematic-db to version 0.0.6 by @GiaJordan in #1145
use try: finally: to delete checkpoint even if running the checkpoint fails or errors out by @GiaJordan in #1155
Allowed CORS on given routes instead of all routes by @linglp in #1168
Added restrict rules param to manifest/validate by @linglp in #1178
Bug Fix: remedy negation of table manipulation specification by @GiaJordan in #1186
Added an endpoint to check entity type on Synapse and an endpoint to check if an entity is in the asset view by @linglp in #1078
add restrict rules control to manifest validate by @linglp in #1189
Added a parameter to control if GE gets used when using manifest/validate endpoint by @linglp in #1177
Propagate logger level entered in from command line to other schematic submodules by @GiaJordan in #1180
Add timing of validation operations to DEBUG log by @GiaJordan in #1181
Standardize validation error format and type by @GiaJordan in #1183
Added function to calculate and clear cache by @linglp in #1190
Develop add file only manifest submission option by @mialy-defelice in #1175
Create example data model for single rule benchmarking by @GiaJordan in #1193
Develop api tests for benchmarking single rule validation performance by @GiaJordan in #1184
Fixed column headers problem when generating an Excel spreadsheet for getting an existing manifest by @linglp in #1164
Do not index attribute visualization dataframe when converting to csv for component viz endpoint by @GiaJordan in #1196
Remove restrictions on rule number and allowed combinations by @GiaJordan in #1203
Start checking mypy and black by @andrewelamb in #1204
Add IsNA rule to validation suite | FDS-81 FDS-232 by @GiaJordan in #1200
Add API endpoint to visualize attributes for a specific component by @GiaJordan in #1195
Added functionality to download a manifest by using the manifest id by @linglp in #1192
Introduce workflow for API tests by @GiaJordan in #1208
Remove requirement for .synapseConfig file to use upsert feature by @GiaJordan in #1207
Parallelize operations in load_df by @GiaJordan in #1100
Id Column - fix bug preventing table updates by @GiaJordan in #1214
Id Column - fix bug where there would be two columns with uuid values by @GiaJordan in #1215
Merge AWS deployment branch to develop by @linglp in #1142
Id Column - fix bug preventing upserts by @GiaJordan in #1216
Downloaded manifests to temporary folders on AWS by @linglp in #1210
FDS-293 Fix column mismatch for Excel File Based Manifest Generation by @mialy-defelice in #1213
Feature fds 361 pylint by @andrewelamb in #1212
Changed workflow to run poetry version 1.2.0 by @linglp in #1222
remove breakpoint from command by @GiaJordan in #1224
Restrict typing-extensions package to versions before v4.6.0 by @GiaJordan in #1228
Make aws group dependencies optional and restrict typing_extensions package version by @GiaJordan in #1227
Change Uuid column name to Id by @GiaJordan in #1211
Avoided publishing minor releases to Docker Hub by @linglp in #1231
Remove lock on schematic_db version - FDS-287 by @GiaJordan in #1232
feat: added nginx to encrypt the communication between ALB and API by @linglp in #1202
Installed schematic in docker file by @linglp in #1230
fix: Fixed unit test in test_manifest by @linglp in #1233
Fix formatting issues when pulling data from Synapse to Excel Manifest by @mialy-defelice in #1234
fix: fixed tests in test_api.py by @linglp in #1225
Added test manifest by @linglp in #1238
[Bug fix]: Tried install pdoc again in github action workflow pdoc.yml by @linglp in #1239
Schematic Release v23.6.1 by @linglp in #1236

Full Changelog: v23.1.1...v23.6.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v23.6.1