-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better sheet utils #289
Better sheet utils #289
Conversation
…er_agent, for example
… not the workbook level artifact. Better handling of init args.
… ItemManager.load to take a tab_name argument so that CSV files can perhaps infer a type name.
Co-authored-by: drio18 <[email protected]>
Pull Request Test Coverage Report for Build 6710266183
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments I think should be answered before approval, but generally looks great
dcicutils/glacier_utils.py
Outdated
copy_args['Tagging'] = tags | ||
if self.kms_key_id: | ||
copy_args['ServerSideEncryption'] = 'aws:kms' | ||
copy_args['SSEKMSKeyId'] = self.kms_key_id | ||
response = self.s3.copy_object( | ||
**copy_args, CopySource=copy_source | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These deletions in glacier utils should be reverted I think, as KMS Key args are definitely needed. Not sure why these deletions are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged david's branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, not sure what this would have been about - I didn't make any changes related to glacier_utils.
], | ||
"type": "object", | ||
"properties": { | ||
"static_headers": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as with User, a lot of this schema stuff can probably be eliminated and just test representative cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the result of actual calls to get_schema
and are intended to be a stable way to test features. I don't see a useful way to be both stable and to test against real production data. I think it's a very bad idea to remove it because then you're not doing the only thing this is intended to do, which is test a non-contrived example.
"type": "string", | ||
"lookup": 130, | ||
"default": "US/Eastern", | ||
"enum": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally you can probably eliminate most of these fields aside from what's used?
test/test_bundle_utils.py
Outdated
import pytest | ||
import re | ||
|
||
# from collections import namedtuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some commented out imports you may want to clean up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that.
return value | ||
|
||
|
||
def expand_string_escape_sequences(text: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be useful to docstring this since you're doing some fairly specific transformations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, more doc strings are needed. This is a good one to make sure I do.
dcicutils/sheet_utils.py
Outdated
# Doug thinks we might want (metaclass=ABCMeta) here to make this an abstract base class. | ||
# I am less certain but open to discussion. Among other things, as implemented now, | ||
# the __init__ method here needs to run and the documentation says that ABC's won't appear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main benefit of doing so is so that subclasses will throw errors on import I believe if they do not implement the spec. Might be worth doing but not strictly speaking necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also requires a certain kind of hygiene that I didn't really want to enforce because it's more theoretical than practical.
def _all_rows(cls, sheet: Worksheet): | ||
row_max = sheet.max_row | ||
for row in range(2, row_max + 1): | ||
yield row | ||
|
||
@classmethod | ||
def _all_cols(cls, sheet: Worksheet): | ||
col_max = sheet.max_column | ||
for col in range(1, col_max + 1): | ||
yield col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both of these are starting off the initial index - you should mention why and what the expected structure is. Functions with names like "all" here imply all to me, not necessary "all minus headers" or in the case of columns I'm not really sure what you're cutting off, all depends on expected structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is always a header in this format, so this is all of the data columns and rows. If there is not a header, none of this will work. I could rename this to _all_data_rows
but it will not eliminate the risk, which is not checkable and must simply be documented. We don't have a submission protocol for non-headered files. TableSets, when I document them, must have headers, just as dictionaries cannot have keyless entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, in the case of using this tool for items, we can check that the headings are things in the schema. I don't think it does that now, but it now could. data is unlikely to accidentally match. for tablesets, the lower level abstraction, there is no such reference so if you do:
1 2 3
4 5 6
7 8 9
you'll just get
[
{'1': '4', '2': '5', '3': '6'},
{'1': '7', '2': '8', '3': '9'}
]
and the effect won't be data loss, "just" data misuse. The first row isn't skipped, it's used as a header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least this is likely to cause an error downstream rather than just quietly losing a data item. :)
"type": data_type, | ||
"item" if identifying_value else "unidentified": identifying_value if identifying_value else True, | ||
"index": data_item_index, | ||
"missing_properties": schema_validation_error.validator_value}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discovered this is actually not quite right; the list of missing properties can be gotten like this (I can make this change later if you want) ... "missing_properties": list(set(schema_validation_error.validator_value) - set(schema_validation_error.instance))
…oined_list and .disjoined_list.
…_fixes Repairs to sheet_utils changes, addressing C4-1111 (and C4-1116)
New module
bundle_utils.py
that is intended for schema-respecting worksheets ("metadata bundle"). There are various modular bits of functionality here, but the main entry point here is:load_items
to load data from a given table set, doing certain notational canonicalizations, and checking that things are in the appropriate format.In
common.py
, new hint types:CsvReader
JsonSchema
Regexp
In
lang_utils.py
:New arguments
just_are=
tothere_are
get verb conjugation without the details.Add "while" to "which" and "that" as clause handlers in the string pluralizer (e.g., so that "error while parsing x" pluralizes as "errors while parsing x")
In
misc_utils.py
, miscellaneous new functionality:New class
AbstractVirtualApp
that is either an actual VirtualApp or can be used to make mocks if the thing being called expects anAbstractVirtualApp
instead of aVirtualApp
.New function
to_snake_case
that assumes its argument is either a CamelCase string or snake_case string, and returns the snake_case form.New function
is_uuid
(migrated from Fourfront)New function
pad_to
New class
JsonLinesReader
In
qa_checkers.py
:VERSION_IS_BETA_PATTERN
to recognize alpha or beta patterns. Probably a rename would be better, but also incompatible. As far as I know, this is used only to not fuss if you haven't made a changelog entry for a beta (or now also alpha).New module
sheet_utils.py
for loading workbooks in a variety of formats, but without schema interpretation.A lot of this is implementation classes for each of the kinds of files, but the main entry point is intended to be
load_table_set
if you are not working with schemas. For schema-related support, seebundle_utils.py
.New module
validation_utils.py
with these facilities:New class
SchemaManager
for managing a set of schemas so that programs asking for a schema by name only download one time and then use a cache. There are also facilities here for populating a dictionary with all schemas in a table set (the kind of thing returned byload_table_set
insheet_utils.py
) in order to pre-process it as a metadata bundle for checking purposes.New functions:
validate_data_against_schemas
to validate that table sets (workbooks, or the equivalent) have rows in each tab conforming to the schema for that tab.summary_of_data_validation_errors
to summarize the errors obtained fromvalidate_data_against_schemas
.