Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better sheet utils #289

Merged
merged 110 commits into from
Oct 31, 2023
Merged

Better sheet utils #289

merged 110 commits into from
Oct 31, 2023

Conversation

netsettler
Copy link
Collaborator

@netsettler netsettler commented Oct 20, 2023

  • New module bundle_utils.py that is intended for schema-respecting worksheets ("metadata bundle"). There are various modular bits of functionality here, but the main entry point here is:

    • load_items to load data from a given table set, doing certain notational canonicalizations, and checking that things are in the appropriate format.
  • In common.py, new hint types:

    • CsvReader
    • JsonSchema
    • Regexp
  • In lang_utils.py:

    • New arguments just_are= to there_are get verb conjugation without the details.

    • Add "while" to "which" and "that" as clause handlers in the string pluralizer (e.g., so that "error while parsing x" pluralizes as "errors while parsing x")

  • In misc_utils.py, miscellaneous new functionality:

    • New class AbstractVirtualApp that is either an actual VirtualApp or can be used to make mocks if the thing being called expects an AbstractVirtualApp instead of a VirtualApp.

    • New function to_snake_case that assumes its argument is either a CamelCase string or snake_case string, and returns the snake_case form.

    • New function is_uuid (migrated from Fourfront)

    • New function pad_to

    • New class JsonLinesReader

  • In qa_checkers.py:

    • Change the VERSION_IS_BETA_PATTERN to recognize alpha or beta patterns. Probably a rename would be better, but also incompatible. As far as I know, this is used only to not fuss if you haven't made a changelog entry for a beta (or now also alpha).
  • New module sheet_utils.py for loading workbooks in a variety of formats, but without schema interpretation.

    A lot of this is implementation classes for each of the kinds of files, but the main entry point is intended to be load_table_set if you are not working with schemas. For schema-related support, see bundle_utils.py.

  • New module validation_utils.py with these facilities:

    • New class SchemaManager for managing a set of schemas so that programs asking for a schema by name only download one time and then use a cache. There are also facilities here for populating a dictionary with all schemas in a table set (the kind of thing returned by load_table_set in sheet_utils.py) in order to pre-process it as a metadata bundle for checking purposes.

    • New functions:

      • validate_data_against_schemas to validate that table sets (workbooks, or the equivalent) have rows in each tab conforming to the schema for that tab.

      • summary_of_data_validation_errors to summarize the errors obtained from validate_data_against_schemas.

netsettler and others added 30 commits August 14, 2023 07:21
… not the workbook level artifact. Better handling of init args.
… ItemManager.load to take a tab_name argument so that CSV files can perhaps infer a type name.
@coveralls
Copy link

coveralls commented Oct 23, 2023

Pull Request Test Coverage Report for Build 6710266183

  • 866 of 967 (89.56%) changed or added relevant lines in 7 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.9%) to 78.585%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dcicutils/misc_utils.py 37 40 92.5%
dcicutils/validation_utils.py 110 122 90.16%
dcicutils/sheet_utils.py 350 391 89.51%
dcicutils/bundle_utils.py 351 396 88.64%
Files with Coverage Reduction New Missed Lines %
dcicutils/lang_utils.py 2 97.83%
Totals Coverage Status
Change from base Build 6422359319: 0.9%
Covered Lines: 9332
Relevant Lines: 11875

💛 - Coveralls

Copy link
Member

@willronchetti willronchetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments I think should be answered before approval, but generally looks great

Comment on lines 399 to 405
copy_args['Tagging'] = tags
if self.kms_key_id:
copy_args['ServerSideEncryption'] = 'aws:kms'
copy_args['SSEKMSKeyId'] = self.kms_key_id
response = self.s3.copy_object(
**copy_args, CopySource=copy_source
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These deletions in glacier utils should be reverted I think, as KMS Key args are definitely needed. Not sure why these deletions are here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged david's branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure what this would have been about - I didn't make any changes related to glacier_utils.

],
"type": "object",
"properties": {
"static_headers": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with User, a lot of this schema stuff can probably be eliminated and just test representative cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the result of actual calls to get_schema and are intended to be a stable way to test features. I don't see a useful way to be both stable and to test against real production data. I think it's a very bad idea to remove it because then you're not doing the only thing this is intended to do, which is test a non-contrived example.

"type": "string",
"lookup": 130,
"default": "US/Eastern",
"enum": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally you can probably eliminate most of these fields aside from what's used?

import pytest
import re

# from collections import namedtuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some commented out imports you may want to clean up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that.

return value


def expand_string_escape_sequences(text: str) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to docstring this since you're doing some fairly specific transformations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, more doc strings are needed. This is a good one to make sure I do.

Comment on lines 139 to 141
# Doug thinks we might want (metaclass=ABCMeta) here to make this an abstract base class.
# I am less certain but open to discussion. Among other things, as implemented now,
# the __init__ method here needs to run and the documentation says that ABC's won't appear
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main benefit of doing so is so that subclasses will throw errors on import I believe if they do not implement the spec. Might be worth doing but not strictly speaking necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also requires a certain kind of hygiene that I didn't really want to enforce because it's more theoretical than practical.

Comment on lines +350 to +359
def _all_rows(cls, sheet: Worksheet):
row_max = sheet.max_row
for row in range(2, row_max + 1):
yield row

@classmethod
def _all_cols(cls, sheet: Worksheet):
col_max = sheet.max_column
for col in range(1, col_max + 1):
yield col
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both of these are starting off the initial index - you should mention why and what the expected structure is. Functions with names like "all" here imply all to me, not necessary "all minus headers" or in the case of columns I'm not really sure what you're cutting off, all depends on expected structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is always a header in this format, so this is all of the data columns and rows. If there is not a header, none of this will work. I could rename this to _all_data_rows but it will not eliminate the risk, which is not checkable and must simply be documented. We don't have a submission protocol for non-headered files. TableSets, when I document them, must have headers, just as dictionaries cannot have keyless entries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in the case of using this tool for items, we can check that the headings are things in the schema. I don't think it does that now, but it now could. data is unlikely to accidentally match. for tablesets, the lower level abstraction, there is no such reference so if you do:

1 2 3
4 5 6
7 8 9

you'll just get

[
    {'1': '4', '2': '5', '3': '6'},
    {'1': '7', '2': '8', '3': '9'}
]

and the effect won't be data loss, "just" data misuse. The first row isn't skipped, it's used as a header.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least this is likely to cause an error downstream rather than just quietly losing a data item. :)

"type": data_type,
"item" if identifying_value else "unidentified": identifying_value if identifying_value else True,
"index": data_item_index,
"missing_properties": schema_validation_error.validator_value})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered this is actually not quite right; the list of missing properties can be gotten like this (I can make this change later if you want) ... "missing_properties": list(set(schema_validation_error.validator_value) - set(schema_validation_error.instance))

@netsettler netsettler changed the base branch from master to pyyaml-version-6-which-is-also-python311-and-sheet-utils October 23, 2023 17:04
@netsettler netsettler changed the base branch from pyyaml-version-6-which-is-also-python311-and-sheet-utils to python_3_11_with_sheet_utils October 23, 2023 19:28
@netsettler netsettler changed the base branch from python_3_11_with_sheet_utils to master October 23, 2023 19:34
netsettler and others added 3 commits October 31, 2023 13:06
…_fixes

Repairs to sheet_utils changes, addressing C4-1111 (and C4-1116)
@netsettler netsettler merged commit 2b92bc2 into master Oct 31, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants