Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instrument NADE commands using metric spans #1844

Merged
merged 30 commits into from
Nov 22, 2024

Conversation

sfc-gh-mchok
Copy link
Collaborator

@sfc-gh-mchok sfc-gh-mchok commented Nov 8, 2024

Pre-review checklist

  • I've confirmed that instructions included in README.md are still correct after my changes in the codebase.
  • I've added or updated automated unit tests to verify correctness of my new code.
  • I've added or updated integration tests to verify correctness of my new code.
  • I've confirmed that my changes are working by executing CLI's commands manually on MacOS.
  • I've confirmed that my changes are working by executing CLI's commands manually on Windows.
  • I've confirmed that my changes are up-to-date with the target branch.
  • I've described my changes in the release notes.
  • I've described my changes in the section below.

Changes description

  • Added metric spans to telemetry payload being sent
  • Instrumented native app commands with metric spans API
  • Added integ tests to verify the proper spans are being sent in telemetry for example commands (see tests_integration/nativeapp/test_metrics.py for examples of traces being sent)
  • Refactored test files to fit with convention, combining metrics counters + spans tests

An example trace of snow app run looks like the following (truncated for brevity):

{
        'message': {
            'command_execution_time': 30.307388332905248,
            'counters': {
                'features.global.pdf_templates': 1,
                'features.global.templates_processor': 1,
                'features.app.snowpark_processor': 0,
                'features.app.post_deploy_scripts': 1,
                'features.app.event_sharing': 0,
                'features.app.event_sharing_warning': 0,
                'features.app.event_sharing_error': 0
            },
            'spans': {
                'completed_spans': [{
                    'id': 'a45d527defc44719a0b0d8e8ab9d02b5',
                    'name': 'action.app.deploy',
                    'start_time': 0.6444029589183629,
                    'parent': None,
                    'parent_id': None,
                    'execution_time': 27.891165500041097,
                    'error': None,
                    'span_count_in_subtree': 10,
                    'span_depth': 1,
                    'trimmed': False
                }, {
                    'id': '8dd033f84f4f481b8542fff24a759ccf',
                    'name': 'action.app_pkg.deploy',
                    'start_time': 0.644967291969806,
                    'parent': 'action.app.deploy',
                    'parent_id': 'a45d527defc44719a0b0d8e8ab9d02b5',
                    'execution_time': 11.048065917100757,
                    'error': None,
                    'span_count_in_subtree': 7,
                    'span_depth': 2,
                    'trimmed': False
                }, {
                    'id': '088ae6c6ac0947e9ac8796fe83e60dd9',
                    'name': 'bundle',
                    'start_time': 0.6450087090488523,
                    'parent': 'action.app_pkg.deploy',
                    'parent_id': '8dd033f84f4f481b8542fff24a759ccf',
                    'execution_time': 0.0043289999011904,
                    'error': None,
                    'span_count_in_subtree': 1,
                    'span_depth': 3,
                    'trimmed': False
                }, {
                    'id': 'ae49cdc6fe944579a057fa4850d92c1d',
                    'name': 'artifact_processors',
                    'start_time': 0.6495900419540703,
                    'parent': 'action.app_pkg.deploy',
                    'parent_id': '8dd033f84f4f481b8542fff24a759ccf',
                    'execution_time': 0.00230620801448822,
                    'error': None,
                    'span_count_in_subtree': 2,
                    'span_depth': 3,
                    'trimmed': False
                }, {
                    'id': 'b83c2257cedd41a38a89bc79b60f0214',
                    'name': 'templates_processor',
                    'start_time': 0.6496683340519667,
                    'parent': 'artifact_processors',
                    'parent_id': 'ae49cdc6fe944579a057fa4850d92c1d',
                    'execution_time': 0.0022297080140560865,
                    'error': None,
                    'span_count_in_subtree': 1,
                    'span_depth': 4,
                    'trimmed': False
                }, {
                    'id': '399ba61144824d25861401573f9447c4',
                    'name': 'sync_deploy_root_with_stage',
                    'start_time': 2.6214112089946866,
                    'parent': 'action.app_pkg.deploy',
                    'parent_id': '8dd033f84f4f481b8542fff24a759ccf',
                    'execution_time': 5.49098225007765,
                    'error': None,
                    'span_count_in_subtree': 1,
                    'span_depth': 3,
                    'trimmed': False
                }, {
                    'id': '483b5fe50c1c4e6ea4e81598a0bf926d',
                    'name': 'post_deploy_hooks',
                    'start_time': 8.112657208926976,
                    'parent': 'action.app_pkg.deploy',
                    'parent_id': '8dd033f84f4f481b8542fff24a759ccf',
                    'execution_time': 3.1565516660921276,
                    'error': None,
                    'span_count_in_subtree': 1,
                    'span_depth': 3,
                    'trimmed': False
                }, {
                    'id': 'cd065a3c3148457d82d45f9b62d3ec62',
                    'name': 'validate_setup_script',
                    'start_time': 11.27043924992904,
                    'parent': 'action.app_pkg.deploy',
                    'parent_id': '8dd033f84f4f481b8542fff24a759ccf',
                    'execution_time': 0.422550291987136,
                    'error': None,
                    'span_count_in_subtree': 1,
                    'span_depth': 3,
                    'trimmed': False
                }, {
                    'id': '99829e4b459b48fb92e111c951f7f58c',
                    'name': 'update_app_object',
                    'start_time': 11.693080459022895,
                    'parent': 'action.app.deploy',
                    'parent_id': 'a45d527defc44719a0b0d8e8ab9d02b5',
                    'execution_time': 16.84247687482275,
                    'error': None,
                    'span_count_in_subtree': 2,
                    'span_depth': 2,
                    'trimmed': False
                }, {
                    'id': '7f79381f00f442bc8d499e7950ca9ac0',
                    'name': 'post_deploy_hooks',
                    'start_time': 22.447485291864723,
                    'parent': 'update_app_object',
                    'parent_id': '99829e4b459b48fb92e111c951f7f58c',
                    'execution_time': 6.088065165793523,
                    'error': None,
                    'span_count_in_subtree': 1,
                    'span_depth': 3,
                    'trimmed': False
                }, {
                    'id': '71938e6500ac427db621280c229bdf4e',
                    'name': 'get_snowsight_url_for_app',
                    'start_time': 28.535641624825075,
                    'parent': None,
                    'parent_id': None,
                    'execution_time': 1.8380501249339432,
                    'error': None,
                    'span_count_in_subtree': 1,
                    'span_depth': 1,
                    'trimmed': False
                }],
                'num_spans_past_depth_limit': 0,
                'num_spans_past_total_limit': 0
            }
        }

@sfc-gh-mchok sfc-gh-mchok self-assigned this Nov 8, 2024
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file renamed from test_feature_metrics since it was going to be too large with the spans, and also converted the tests to use the factories

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the test directory mirrors the layout of the source directory so that tests are easy to locate. Breaking down the files like this because of some arbitrary size constraint makes things easier to manage, and makes it easier for us to review the changes because we can no longer see the diff. Please follow the convention instead.

Copy link
Contributor

@sfc-gh-fcampbell sfc-gh-fcampbell Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't follow this comment, did you mean to say it makes it harder if we split the test file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I merged the new integ tests and also the old unit tests to test_metrics.py in their respective folders, and also adjusted the codeowners file accordingly, let me know if this works for you!

@sfc-gh-mchok sfc-gh-mchok marked this pull request as ready for review November 11, 2024 22:21
@sfc-gh-mchok sfc-gh-mchok requested review from a team as code owners November 11, 2024 22:21
[
(
"app deploy",
lambda: ProjectV11Factory(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the best I could come up with to have parametrization with factories, if anyone has a better approach I'd be interested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we need parametrized tests at all? I don't think I can read this anymore. I think overall we make way, way too frequent use of parametrized tests in our codebase, but that's a different story.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the arguments to the factories are the same, you can parametrize just on the class and instantiate it in the test

@pytest.mark.parametrize("factory_class", [FooFactory, BarFactory])
def test_factory(factory_class):
    obj = factory_class(arg=123)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up getting rid of all the parametrization and separating the tests out since the factories didn't have the same arguments, I think it's more clear and separated this way, let me know what you think

@sfc-gh-bdufour
Copy link
Contributor

description + pre-review checklist aren't optional, please fill them in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the test directory mirrors the layout of the source directory so that tests are easy to locate. Breaking down the files like this because of some arbitrary size constraint makes things easier to manage, and makes it easier for us to review the changes because we can no longer see the diff. Please follow the convention instead.

@@ -88,7 +88,7 @@ def compile_artifacts(self):
if not self._should_invoke_processors():
return

with cc.phase("Invoking artifact processors"):
with cc.phase("Invoking artifact processors", span_name="compile_artifacts"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no compilation happening. How about simply "artifact_processors"? We should probably also have individual processors as sub-spans.

@@ -213,6 +214,26 @@ def get_cli_context() -> _CliGlobalContextAccess:
return _CliGlobalContextAccess(get_cli_context_manager())


def start_cli_metrics_span(span_name: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "start" makes sense here, since this both starts and stops the span at the appropriate time. Let's workshop the name of this + the start_span method it calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I feel @span would suffice

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed both the method name and the decorator name to just span, let me know if this works for you

@@ -420,6 +421,7 @@ def get_objects_owned_by_application(self) -> List[ApplicationOwnedObject]:
).fetchall()
return [{"name": row[1], "type": row[2]} for row in results]

@start_cli_metrics_span("create_or_upgrade_app")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spans shouldn't be implementation-specific, but rather represent a logic step in the overall operation being performed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted the names and locations of all the spans to better encapsulate this, let me know if they work for you!

@@ -915,6 +916,7 @@ def validate_setup_script(
if validation_result["status"] == "FAIL":
raise SetupScriptFailedValidation()

@start_cli_metrics_span("get_validation_result")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, let's not name things according to the implementation. That can change over time.

@@ -68,6 +68,7 @@ def phase(
self,
enter_message: str,
exit_message: Optional[str] = None,
span_name: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is truly needed, since we can use multiple context managers on the same line. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about that feature, I adjusted these instances, thanks for the reminder

@@ -304,6 +308,7 @@ def validation_item_to_str(item: dict[str, str | int]):
return s


@start_cli_metrics_span("drop_generic_object")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

[
(
"app deploy",
lambda: ProjectV11Factory(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we need parametrized tests at all? I don't think I can read this anymore. I think overall we make way, way too frequent use of parametrized tests in our codebase, but that's a different story.

… un-parametrize tests, merge test files to keep with covention
@sfc-gh-mchok
Copy link
Collaborator Author

description + pre-review checklist aren't optional, please fill them in

My mistake, I usually do it at the same time as when I make the PR but I started it as a draft this time and forgot to do it when I set it as ready for review, should be populated now!

def mock_time_monotonic():
with mock.patch("time.monotonic", side_effect=count()) as mocked_time_monotonic:
yield mocked_time_monotonic
def test_metrics_no_counters():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All counter tests moved from test_metrics_counters.py to consolidate the files and keep to the convention


@pytest.mark.integration
@mock.patch("snowflake.connector.telemetry.TelemetryClient.try_add_log_to_batch")
def test_feature_counters_v2_post_deploy_not_available_in_bundle(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counter tests moved from test_feature_metrics.py to consolidate them in one file and keep to the convention, and converted to use factories

@@ -649,6 +650,7 @@ def resolve_without_follow(path: Path) -> Path:
return Path(os.path.abspath(path))


@span("build_initial_bundle")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just bundle ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head bundle is the higher level container that does both this step (populating the deploy root) and then applying all the processors on it, but if we're fine just having the first step be called bundle then that's good with me

@@ -355,6 +355,7 @@ def post_deploy_hooks(self) -> list[PostDeployHook] | None:
model = self._entity_model
return model.meta and model.meta.post_deploy

@span("deploy_app")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should consider some kind of standardized scheme for actions + entities, maybe something like action.app.deploy ?

@@ -544,6 +546,7 @@ def _bundle(self):
compiler.compile_artifacts()
return bundle_map

@span("deploy_app_package")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would rather do this on the action for consistency

@@ -68,7 +68,11 @@ def _format_message(self, message: str, output: Output) -> Text:
return text

@contextmanager
def phase(self, enter_message: str, exit_message: Optional[str] = None):
def phase(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're no longer touching this, I'd revert the file entirely

@@ -76,6 +76,7 @@ def _get_stage_paths_to_sync(
return stage_paths


@span("sync_remote_and_local_files")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one of the rare cases where I'd use the actual function name, since it's exactly the logical concept we're timing.

spans_to_report,
# start_time can be the same, so we want to sort by something more
# deterministic, so we use a counter to see which spans are started first
key=lambda span: span.creation_key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we need this global counter. My reasoning is this: assume we have a stable sort. Then, given a list of spans, we can first sort by start time, then by end time. If two start times are the same, it's unlikely that they'll have the same end time. Even if they do, that means we have two spans that are identical, and the stable sort will keep them in the same order anyway. What am I missing?

Copy link
Collaborator Author

@sfc-gh-mchok sfc-gh-mchok Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem isn't that we don't have a stable sort (I believe python's builtins are stable), it's that we reorder (i.e. break the creation ordering of) the spans in the first sort pass where we take spans breadth first to ensure we conform to the limits in this code block

For example, given the following span tree where the creation order is [span1, span2, span3]:

├── span1
│   └── span2
└── span3

Then the first pass will reorder the spans to become breadth-first [span1, span3, span2] as the spans_to_report, since we won't know which spans will be reported until we go through them

CMIIW but I believe this means we'll need some sort of persistent auxiliary structure to keep track of the spans in creation order, to re-sort after the first pass, unless we can say we trust the start/end_times being reported

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: requirement to have spans sorted by start_time during reporting was waived, the new requirement is to only do one sort, and keep the spans sorted by (depth, start_time, end_time), in that order for tiebreaking

@@ -99,6 +100,9 @@ class for holding metrics span data and encapsulating related operations
# denotes whether direct children were trimmed from telemetry payload
TRIMMED_KEY: ClassVar[str] = "trimmed"

# counter for sorting by creation order
_CREATION_COUNTER: ClassVar[count] = count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global counters that never get reset are generally not a good idea

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented here

src/snowflake/cli/api/metrics.py Show resolved Hide resolved
}


def assert_spans_within_limits(spans: Dict) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move all helpers to the top, before any test cases. It's really hard to find them in between test cases.

tests_integration/nativeapp/test_metrics.py Show resolved Hide resolved
for span in self._completed_spans
if span.span_depth <= self.SPAN_DEPTH_LIMIT
),
key=lambda span: (span.span_depth, span.start_time, span.execution_time),
Copy link
Collaborator Author

@sfc-gh-mchok sfc-gh-mchok Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, only one sort, however the last tiebreaker is execution_time here instead of end_time since they are logically equivalent and we only track execution time explicitly, not end time

sfc-gh-bdufour
sfc-gh-bdufour previously approved these changes Nov 21, 2024
"""
Decorator to start a command metrics span that encompasses a whole function

Must be used instead of directly calling @get_cli_context().metrics.start_span(span_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's span now, not start_span

@sfc-gh-mchok sfc-gh-mchok merged commit ca7c5ac into main Nov 22, 2024
19 checks passed
@sfc-gh-mchok sfc-gh-mchok deleted the mchok-metric-spans-instrument-nade-cmds branch November 22, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants