Feature request: Add no-op support for collector lambda layer #1181

jerrytfleung · 2024-03-06T19:28:20Z

Is your feature request related to a problem? Please describe.
If Config.Validate() of a component returns false, the collector lambda layer cannot start in AWS lambda. As a result, the user lambda function is broken.

Describe the solution you'd like
Depending on the component, an invalid component configuration may not need to fail the whole collector lambda layer. We could let that component run in no-op.

Describe alternatives you've considered
Tried removing all config validation logic in the component and moved them to Start function. If config is invalid, just print a message instead. However, opentelemetry-collector-contrib code reviewer would like to check if there is other way to go.

Additional context
PR review comment
The component PR

The text was updated successfully, but these errors were encountered:

serkan-ozal · 2024-09-01T21:02:31Z

I am not sure whether it is the correct approach to switch to noop mode when configuration in valid. Because it might be confusing for the users and as far as I know it doesn't align with the way of how OTEL configurations are handled.

Instead of noop, default values might be used for the invalid configs and fail fast if there is no default value for the invalid config.

WDYT @tylerbenson?

cheempz · 2024-09-03T17:58:19Z

I am not sure whether it is the correct approach to switch to noop mode when configuration in valid. Because it might be confusing for the users and as far as I know it doesn't align with the way of how OTEL configurations are handled.

Instead of noop, default values might be used for the invalid configs and fail fast if there is no default value for the invalid config.

WDYT @tylerbenson?

Adding some more context--it's reasonable for otelcol outside of Lambda to fail fast on invalid config, the only consequence is the collector doesn't run but it doesn't bring down the entire host. But in Lambda, the otelcol extension failing means the entire Lambda runtime crashes, kind of like crashing the entire VM because otelcol didn't start. To me this is pretty terrible user experience.

tylerbenson · 2024-09-04T16:46:12Z

I can see both arguments here, though I'm leaning towards fail fast being the better option. Might be worth discussing in the SIG meeting.

Lambda versions are generally immutable, so it's nice to know immediately if you configured something wrong. If a deployment is urgent, the rollback can be as easy as removing the collector layer and redeploying.

serkan-ozal · 2024-09-04T17:05:22Z

But in Lambda, the otelcol extension failing means the entire Lambda runtime crashes, kind of like crashing the entire VM because otelcol didn't start.

BTW, I am really not sure whether entire Lambda environment crashes if/when an extension fails gracefully (by calling /extension/init/error endpoint)

serkan-ozal · 2024-09-04T17:06:13Z

And also AWS Lambda encourages being fail fast for extensions: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html#runtimes-extensions-init-error

cheempz · 2024-09-04T20:21:44Z

That's good to know re: /extension/init/error endpoint, it seems the otelcol extension is already using it. From a quick test of an otelcol extension with misconfigured pipeline, it doesn't result in a crash but in an Extension.InitError:

Test Event Name
(unsaved) test event

Response
{
  "errorType": "Extension.InitError",
  "errorMessage": "RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7 Error: exit code 0"
}

Function Logs
TELEMETRY	Name: collector	State: Subscribed	Types: [Platform]
{"level":"warn","ts":1725481081.1169627,"logger":"lifecycle.manager","msg":"Failed to start the extension","error":"invalid configuration: service::pipelines::logs: references receiver \"telemetryapi\" which is not configured"}
EXTENSION	Name: collector	State: InitError	Events: [INVOKE, SHUTDOWN]
INIT_REPORT Init Duration: 439.08 ms	Phase: init	Status: error	Error Type: Extension.InitError
TELEMETRY	Name: collector	State: Already subscribed	Types: [Platform]
{"level":"warn","ts":1725481086.9609792,"logger":"lifecycle.manager","msg":"Failed to start the extension","error":"unable to start, otelcol state is Closed"}
EXTENSION	Name: collector	State: InitError	Events: [INVOKE, SHUTDOWN]
INIT_REPORT Init Duration: 5971.95 ms	Phase: invoke	Status: error	Error Type: Extension.InitError
START RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7 Version: $LATEST
RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7 Error: exit code 0
Extension.InitError
END RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7
REPORT RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7	Duration: 6031.77 ms	Billed Duration: 6032 ms	Memory Size: 128 MB	Max Memory Used: 75 MB

Request ID
32e01a48-1b65-4559-9ba6-ec4620f689d7

Still, the end result is the application is unavailable, and I do think it's pretty disruptive even given the recourse available. It goes against the expectation that observability tools strive to cause as little disruption to the application as possible.

serkan-ozal · 2024-09-04T20:36:06Z

I still prefer fail fast when something is not configured properly. Pre-prod environments are there to catch such cases before happening in productions. If it is silently ignored (even though there are error logs), I am pretty sure that most of the people and companies will not notice it until they find out that they have missing traces after some time.

I agree that both of the approaches have their own pros and cons, but IMO, being aware of the issues earlier is more important than suppressing them.

jerrytfleung added the enhancement New feature or request label Mar 6, 2024

jerrytfleung changed the title ~~Add no-op support for collector lambda layer~~ Feature request: Add no-op support for collector lambda layer Mar 6, 2024

jerrytfleung mentioned this issue Mar 6, 2024

[extension/solarwindsapmsettingsextension] Added part one of the concrete implementation of solarwindsapmsettingsextension open-telemetry/opentelemetry-collector-contrib#30788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Add no-op support for collector lambda layer #1181

Feature request: Add no-op support for collector lambda layer #1181

jerrytfleung commented Mar 6, 2024

serkan-ozal commented Sep 1, 2024

cheempz commented Sep 3, 2024

tylerbenson commented Sep 4, 2024

serkan-ozal commented Sep 4, 2024

serkan-ozal commented Sep 4, 2024

cheempz commented Sep 4, 2024

serkan-ozal commented Sep 4, 2024 •

edited

Loading

Feature request: Add no-op support for collector lambda layer #1181

Feature request: Add no-op support for collector lambda layer #1181

Comments

jerrytfleung commented Mar 6, 2024

serkan-ozal commented Sep 1, 2024

cheempz commented Sep 3, 2024

tylerbenson commented Sep 4, 2024

serkan-ozal commented Sep 4, 2024

serkan-ozal commented Sep 4, 2024

cheempz commented Sep 4, 2024

serkan-ozal commented Sep 4, 2024 • edited Loading

serkan-ozal commented Sep 4, 2024 •

edited

Loading