Better sheet utils #289

netsettler · 2023-10-20T19:14:15Z

New module bundle_utils.py that is intended for schema-respecting worksheets ("metadata bundle"). There are various modular bits of functionality here, but the main entry point here is:
- load_items to load data from a given table set, doing certain notational canonicalizations, and checking that things are in the appropriate format.
In common.py, new hint types:
- CsvReader
- JsonSchema
- Regexp
In lang_utils.py:
- New arguments just_are= to there_are get verb conjugation without the details.
- Add "while" to "which" and "that" as clause handlers in the string pluralizer (e.g., so that "error while parsing x" pluralizes as "errors while parsing x")
In misc_utils.py, miscellaneous new functionality:
- New class AbstractVirtualApp that is either an actual VirtualApp or can be used to make mocks if the thing being called expects an AbstractVirtualApp instead of a VirtualApp.
- New function to_snake_case that assumes its argument is either a CamelCase string or snake_case string, and returns the snake_case form.
- New function is_uuid (migrated from Fourfront)
- New function pad_to
- New class JsonLinesReader
In qa_checkers.py:
- Change the VERSION_IS_BETA_PATTERN to recognize alpha or beta patterns. Probably a rename would be better, but also incompatible. As far as I know, this is used only to not fuss if you haven't made a changelog entry for a beta (or now also alpha).
New module sheet_utils.py for loading workbooks in a variety of formats, but without schema interpretation.

A lot of this is implementation classes for each of the kinds of files, but the main entry point is intended to be load_table_set if you are not working with schemas. For schema-related support, see bundle_utils.py.
New module validation_utils.py with these facilities:
- New class SchemaManager for managing a set of schemas so that programs asking for a schema by name only download one time and then use a cache. There are also facilities here for populating a dictionary with all schemas in a table set (the kind of thing returned by load_table_set in sheet_utils.py) in order to pre-process it as a metadata bundle for checking purposes.
- New functions:
  - validate_data_against_schemas to validate that table sets (workbooks, or the equivalent) have rows in each tab conforming to the schema for that tab.
  - summary_of_data_validation_errors to summarize the errors obtained from validate_data_against_schemas.

…er_agent, for example

… not the workbook level artifact. Better handling of init args.

… ItemManager.load to take a tab_name argument so that CSV files can perhaps infer a type name.

Co-authored-by: drio18 <[email protected]>

…y yet, though.

…schema_hinting

coveralls · 2023-10-23T12:59:26Z

Pull Request Test Coverage Report for Build 6710266183

866 of 967 (89.56%) changed or added relevant lines in 7 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.9%) to 78.585%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
dcicutils/misc_utils.py	37	40	92.5%
dcicutils/validation_utils.py	110	122	90.16%
dcicutils/sheet_utils.py	350	391	89.51%
dcicutils/bundle_utils.py	351	396	88.64%

Files with Coverage Reduction	New Missed Lines	%
dcicutils/lang_utils.py	2	97.83%

Totals
Change from base Build 6422359319:	0.9%
Covered Lines:	9332
Relevant Lines:	11875

💛 - Coveralls

willronchetti

Some small comments I think should be answered before approval, but generally looks great

willronchetti · 2023-10-23T13:05:44Z

dcicutils/glacier_utils.py

-                    copy_args['Tagging'] = tags
-                if self.kms_key_id:
-                    copy_args['ServerSideEncryption'] = 'aws:kms'
-                    copy_args['SSEKMSKeyId'] = self.kms_key_id
-                response = self.s3.copy_object(
-                    **copy_args, CopySource=copy_source
-                )


These deletions in glacier utils should be reverted I think, as KMS Key args are definitely needed. Not sure why these deletions are here.

I merged david's branch.

Hm, not sure what this would have been about - I didn't make any changes related to glacier_utils.

willronchetti · 2023-10-23T13:05:52Z

test/data_files/sample_schemas/project.json

+    ],
+    "type": "object",
+    "properties": {
+        "static_headers": {


Same as with User, a lot of this schema stuff can probably be eliminated and just test representative cases.

These are the result of actual calls to get_schema and are intended to be a stable way to test features. I don't see a useful way to be both stable and to test against real production data. I think it's a very bad idea to remove it because then you're not doing the only thing this is intended to do, which is test a non-contrived example.

willronchetti · 2023-10-23T13:06:01Z

test/data_files/sample_schemas/user.json

+            "type": "string",
+            "lookup": 130,
+            "default": "US/Eastern",
+            "enum": [


Generally you can probably eliminate most of these fields aside from what's used?

willronchetti · 2023-10-23T13:16:00Z

test/test_bundle_utils.py

+import pytest
+import re
+
+# from collections import namedtuple


Some commented out imports you may want to clean up

I'll do that.

willronchetti · 2023-10-23T13:17:17Z

dcicutils/sheet_utils.py

+    return value
+
+
+def expand_string_escape_sequences(text: str) -> str:


Might be useful to docstring this since you're doing some fairly specific transformations

Yeah, more doc strings are needed. This is a good one to make sure I do.

willronchetti · 2023-10-23T13:18:14Z

dcicutils/sheet_utils.py

+# Doug thinks we might want (metaclass=ABCMeta) here to make this an abstract base class.
+# I am less certain but open to discussion. Among other things, as implemented now,
+# the __init__ method here needs to run and the documentation says that ABC's won't appear


The main benefit of doing so is so that subclasses will throw errors on import I believe if they do not implement the spec. Might be worth doing but not strictly speaking necessary.

It also requires a certain kind of hygiene that I didn't really want to enforce because it's more theoretical than practical.

willronchetti · 2023-10-23T13:21:24Z

dcicutils/sheet_utils.py

+    def _all_rows(cls, sheet: Worksheet):
+        row_max = sheet.max_row
+        for row in range(2, row_max + 1):
+            yield row
+
+    @classmethod
+    def _all_cols(cls, sheet: Worksheet):
+        col_max = sheet.max_column
+        for col in range(1, col_max + 1):
+            yield col


For both of these are starting off the initial index - you should mention why and what the expected structure is. Functions with names like "all" here imply all to me, not necessary "all minus headers" or in the case of columns I'm not really sure what you're cutting off, all depends on expected structure.

There is always a header in this format, so this is all of the data columns and rows. If there is not a header, none of this will work. I could rename this to _all_data_rows but it will not eliminate the risk, which is not checkable and must simply be documented. We don't have a submission protocol for non-headered files. TableSets, when I document them, must have headers, just as dictionaries cannot have keyless entries.

Well, in the case of using this tool for items, we can check that the headings are things in the schema. I don't think it does that now, but it now could. data is unlikely to accidentally match. for tablesets, the lower level abstraction, there is no such reference so if you do:

1 2 3 4 5 6 7 8 9

you'll just get

[ {'1': '4', '2': '5', '3': '6'}, {'1': '7', '2': '8', '3': '9'} ]

and the effect won't be data loss, "just" data misuse. The first row isn't skipped, it's used as a header.

at least this is likely to cause an error downstream rather than just quietly losing a data item. :)

dmichaels-harvard · 2023-10-23T13:58:33Z

dcicutils/validation_utils.py

+                "type": data_type,
+                "item" if identifying_value else "unidentified": identifying_value if identifying_value else True,
+                "index": data_item_index,
+                "missing_properties": schema_validation_error.validator_value})


I discovered this is actually not quite right; the list of missing properties can be gotten like this (I can make this change later if you want) ... "missing_properties": list(set(schema_validation_error.validator_value) - set(schema_validation_error.instance))

…oined_list and .disjoined_list.

…_fixes Repairs to sheet_utils changes, addressing C4-1111 (and C4-1116)

netsettler and others added 30 commits August 14, 2023 07:21

First cut at tools for parsing workbooks.

4c84f0b

Refactor to separate some functionality into a separate sevice class.

7b73a67

Add a csv file for testing.

3d4573f

Add some negative testing.

f4e5cfa

Update lock file.

e9d2465

Document new sheets_utils module.

6e9060f

Issue a beta for this functionality.

df12c91

Fix documentation for sheet_utils.

6a39c8a

Add some declarations. Small refactors to improve modularity.

eedb5c6

Rearrange some methods for presentational reasons.

a6b68fe

First cut at useful functionality.

3ff63a9

Some name changes to make things more abstract. workbook becomes read…

39bd2e0

…er_agent, for example

Rename sheetname to tabname throughout, to be more clear that this is…

77b72f6

… not the workbook level artifact. Better handling of init args.

Add some doc strings. Rename load_table_set to just load. Arrange for…

ba8c55c

… ItemManager.load to take a tab_name argument so that CSV files can perhaps infer a type name.

Add load_items function. Fix some test names. Update changelog.

50488cb

Experimental bug fix from Will to hopefully make get_schema_names work.

807e525

update changelog

2a8e81a

Update dcicutils/sheet_utils.py

718054a

Co-authored-by: drio18 <[email protected]>

Merge branch 'master' into kmp_sheet_utils

682c95a

Merge branch 'kmp_sheet_utils' into kmp_sheet_utils_refactor_for_csv

582f002

Add some comments in response to Doug's code review.

56d1459

Support TSV files.

2facf9e

Add changelog info about tsv files.

bcc4e63

Add a missing data file.

9de282e

First stable cut at schema hinting. Doesn't find schemas automaticall…

8d6495f

…y yet, though.

Merge branch 'master' into kmp_sheet_utils

3a103ee

Mark chardet as an acceptable license for use.

56f702a

Merge branch 'kmp_sheet_utils' into kmp_sheet_utils_refactor_for_csv

08d428e

Merge branch 'kmp_sheet_utils_refactor_for_csv' into kmp_sheet_utils_…

42ad579

…schema_hinting

Backport some small fixes and cosmetics from the schemas branch.

60ada3f

netsettler added 2 commits October 20, 2023 15:07

Fix PEP8 and some static checks.

336142c

Merge branch 'master' into kmp_sheet_utils_better_schemas3

e308a69

netsettler requested review from willronchetti, drio18 and dmichaels-harvard October 20, 2023 19:14

netsettler added 2 commits October 20, 2023 15:33

Make sure jsonschema support is loaded.

fa1f272

Import jsonschema better.

2a80c95

willronchetti reviewed Oct 23, 2023

View reviewed changes

dmichaels-harvard reviewed Oct 23, 2023

View reviewed changes

Tidy things up Will's code review.

f2a1c4f

netsettler changed the base branch from master to pyyaml-version-6-which-is-also-python311-and-sheet-utils October 23, 2023 17:04

netsettler added 4 commits October 23, 2023 13:05

Bump alpha version.

9992ad5

Rearrange some items in dcicutils/common.py. No functional change.

e6815f7

Revert some changes to glacier_utils.py

7cb98bb

Update changelog.

04a5aae

netsettler changed the base branch from pyyaml-version-6-which-is-also-python311-and-sheet-utils to python_3_11_with_sheet_utils October 23, 2023 19:28

netsettler changed the base branch from python_3_11_with_sheet_utils to master October 23, 2023 19:34

netsettler added 7 commits October 23, 2023 15:37

Bump alpha version.

68d9459

Begin to address David's problems in C4-1111.

46a2c09

PEP8

263ce0a

Correct a testing problem (hopefully).

60b5ef1

Stub in support for checking non-flattened files.

d9fb9f6

Refactor to make an extra entry point for type hinting.

748cde8

Support non-string elements of the sequences given to lang_utils.conj…

61b8155

…oined_list and .disjoined_list.

dmichaels-harvard approved these changes Oct 31, 2023

View reviewed changes

netsettler and others added 3 commits October 31, 2023 13:06

Merge pull request #290 from 4dn-dcic/kmp_sheet_utils_better_schemas3…

646a7bd

…_fixes Repairs to sheet_utils changes, addressing C4-1111 (and C4-1116)

PEP8

d910853

De-beta as 8.1.0

dfc93e3

netsettler merged commit 2b92bc2 into master Oct 31, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better sheet utils #289

Better sheet utils #289

netsettler commented Oct 20, 2023 •

edited

Loading

coveralls commented Oct 23, 2023 •

edited

Loading

willronchetti left a comment

willronchetti Oct 23, 2023

netsettler Oct 23, 2023

dmichaels-harvard Oct 23, 2023

willronchetti Oct 23, 2023

netsettler Oct 23, 2023

willronchetti Oct 23, 2023

willronchetti Oct 23, 2023

netsettler Oct 23, 2023

willronchetti Oct 23, 2023

netsettler Oct 23, 2023

willronchetti Oct 23, 2023

netsettler Oct 23, 2023

willronchetti Oct 23, 2023

netsettler Oct 23, 2023

netsettler Oct 23, 2023

netsettler Oct 23, 2023

dmichaels-harvard Oct 23, 2023

		return value


		def expand_string_escape_sequences(text: str) -> str:

Better sheet utils #289

Better sheet utils #289

Conversation

netsettler commented Oct 20, 2023 • edited Loading

coveralls commented Oct 23, 2023 • edited Loading

Pull Request Test Coverage Report for Build 6710266183

💛 - Coveralls

willronchetti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netsettler commented Oct 20, 2023 •

edited

Loading

coveralls commented Oct 23, 2023 •

edited

Loading