Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_credentials_http: Add support for EKS Pod Identities #9013

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

iandrewt
Copy link
Contributor

@iandrewt iandrewt commented Jun 26, 2024

This PR initially includes two commits, as I had to patch flb_utils for IPv6 clusters to work.

This patch rewrites how the HTTP credentials provider works to allow both ECS and EKS identities to work. It is based on the aws-sdk-go-v2 implementation.

It validates that the endpoint is correct if the transport is HTTP, but does not support DNS resolution, however based on how the pod identity agent works today, DNS should not be needed. If the transport is HTTPS, which will not be the case when using EKS Pod Identities today, any endpoint is allowed. This is in line with how the AWS SDK works.

Similarly to the SDK, it also reads the authentication token environment variables, with the file taking precedence over the raw token variable.

This has been tested against an EKS 1.30 cluster with AL2023 nodes, but it would be great if someone else could test this too rather than just taking my word for it 😄. I did test with an ECS cluster as well to ensure that did not break as a result of this change, but I don't generally use ECS, so if someone else could validate this that would be great.

Fixes #8550


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change - installed on EKS with the fluent-bit helm chart with an image override, with an S3 destination, with no annotation on the service account, a pod identity configured, and the node role not having permission to use S3 (I don't think I set the IMDS hop limit to 1 in my testing)
  • Debug log output from testing the change
[2024/06/26 12:50:33] [ info] [output:s3:s3.0] Using upload size 100000000 bytes
[2024/06/26 12:50:33] [debug] [aws_credentials] Initialized Env Provider in standard chain
[2024/06/26 12:50:33] [debug] [aws_credentials] creating profile (null) provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Initialized AWS Profile Provider in standard chain
[2024/06/26 12:50:33] [debug] [aws_credentials] Not initializing EKS provider because AWS_ROLE_ARN was not set
[2024/06/26 12:50:33] [debug] [aws_credentials] Configuring HTTP provider with http://[fd00:ec2::23]/v1/credentials
[2024/06/26 12:50:33] [debug] [aws_credentials] Initialized HTTP Provider in standard chain
[2024/06/26 12:50:33] [debug] [aws_credentials] Initialized EC2 Provider in standard chain
[2024/06/26 12:50:33] [debug] [aws_credentials] Sync called on the http provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Sync called on the EC2 provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Init called on the env provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Init called on the profile provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Reading shared config file.
[2024/06/26 12:50:33] [debug] [aws_credentials] Shared config file /root/.aws/config does not exist
[2024/06/26 12:50:33] [debug] [aws_credentials] Reading shared credentials file.
[2024/06/26 12:50:33] [debug] [aws_credentials] Shared credentials file /root/.aws/credentials does not exist
[2024/06/26 12:50:33] [debug] [aws_credentials] Init called on the http provider
[2024/06/26 12:50:33] [debug] [aws_credentials] loading auth token file
[2024/06/26 12:50:33] [debug] [http_client] not using http_proxy for header
[2024/06/26 12:50:34] [debug] [aws_credentials] upstream_set called on the http provider
[2024/06/26 12:50:34] [debug] [aws_credentials] upstream_set called on the EC2 provider
  • Attached Valgrind output that shows no leaks or memory corruption was found
    Not sure how I'd test this in-cluster, but the tests for the credential provider should be evidence enough. If more test cases are needed I'm happy to write them
valgrind --leak-check=full ./bin/flb-it-aws_credentials_http
==726645== Memcheck, a memory error detector
==726645== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==726645== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==726645== Command: ./bin/flb-it-aws_credentials_http
==726645== 
Test test_http_provider...                      [2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
[2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
Test test_http_provider_auth_token...           [2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
[2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
Test test_local_http_provider...                [2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
[2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
Test test_http_provider_error_case...           [2024/06/26 22:52:50] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2024/06/26 22:52:50] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
Test test_http_provider_malformed_response...   [2024/06/26 22:52:51] [error] [aws_credentials] Could not parse credentials response - invalid JSON.
[2024/06/26 22:52:51] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2024/06/26 22:52:51] [error] [aws_credentials] Could not parse credentials response - invalid JSON.
[2024/06/26 22:52:51] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2024/06/26 22:52:51] [error] [aws_credentials] Could not parse credentials response - invalid JSON.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
SUCCESS: All unit tests have passed.
==726645== 
==726645== HEAP SUMMARY:
==726645==     in use at exit: 0 bytes in 0 blocks
==726645==   total heap usage: 8,637 allocs, 8,637 frees, 895,672 bytes allocated
==726645== 
==726645== All heap blocks were freed -- no leaks are possible
==726645== 
==726645== For lists of detected and suppressed errors, rerun with: -s
==726645== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

fluent/fluent-bit-docs#1400

Backporting

  • [N/A] Backport to latest stable release.
    I'm fine with this being in a minor release instead of a patch, given the testing it should receive

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

iandrewt added 2 commits June 26, 2024 22:29
This patch rewrites how the HTTP credentials provider works to allow
both ECS and EKS identities to work. It is based on the aws-sdk-go-v2
implementation.

It validates that the endpoint is correct if the transport is HTTP, but
does not support DNS resolution, however based on how the pod identity
agent works today, DNS should not be needed. If the transport is HTTPS,
which will not be the case when using EKS Pod Identities, any endpoint
is allowed. This is in line with how the AWS SDK works.

Similarly to the SDK, it also reads the authentication token environment
variables, with the file taking precedence over the raw token variable.

This has been tested against an EKS 1.30 cluster with AL2023 nodes.

Signed-off-by: Andrew Titmuss <[email protected]>
This patch fixes the handling of IPv6 literals, which would fail to
parse after not finding balanced `[]` characters.

This happened because it searched for the first `:` before copying the
host portion, so (assuming it didn't return NULL), the endpoint
`[fd00:ec2::23]` would become `fd00`.

I've added some tests around this case, but I wouldn't describe the
tests as in-depth.

This patch is needed for EKS Pod Identities to work on IPv6 clusters.

Signed-off-by: Andrew Titmuss <[email protected]>
@patrick-stephens
Copy link
Contributor

I suspect we probably need some docs updates for this as well, or at least a simple example of how to use it? Can we link a docs PR?

@iandrewt
Copy link
Contributor Author

I found this while searching for docs to update, but I can't see it referenced anywhere on docs.fluentbit.io. Do you want a link added to this page somewhere as well?

https://github.com/fluent/fluent-bit-docs/blob/43c4fe134611da471e706b0edb2f9acd7cdfdbc3/administration/aws-credentials.md

@patrick-stephens
Copy link
Contributor

I found this while searching for docs to update, but I can't see it referenced anywhere on docs.fluentbit.io. Do you want a link added to this page somewhere as well?

https://github.com/fluent/fluent-bit-docs/blob/43c4fe134611da471e706b0edb2f9acd7cdfdbc3/administration/aws-credentials.md

I think it was linked from the S3 or similar pages, I'm sure I found it before. It really should be a docs page in the index probably too.

@iandrewt
Copy link
Contributor Author

Added the AWS Credentials docs to the index: fluent/fluent-bit-docs#1400

@iandrewt
Copy link
Contributor Author

Looks like @PettitWesley has been working on the same feature for the aws/aws-for-fluent-bit distro: PettitWesley#32

Happy if you want to wait for that PR instead, it seems to handle some things differently to how I implemented it. For instance, my implementation reads the auth token environment variables once at startup (file path, not the contents) rather than on each refresh, which is in line with how I saw it implemented in the AWS SDKs. However, mine might not handle tokens with \r\n correctly (this should be an error case)

In any case, I definitely learned a lot about how the credential chain works by implementing this.

iandrewt added 2 commits June 29, 2024 11:51
This error handler is for JSON APIs that respond with a Code and Message
field in their error responses.

It is originally from PettitWesley/fluent-bit@7a2c1a8

Signed-off-by: Andrew Titmuss <[email protected]>
This patch fixes the following:
1. ensures malformed tokens (containing `\r\n`) are treated as invalid
2. fixes allowing any https host, which previously went through the same
   checks as http hosts
3. fixes a memory leak when provider initialisation fails
4. logs errors from the HTTP agent
5. rewrites the tests using the cases from PettitWesley/fluent-bit@195db30

Signed-off-by: Andrew Titmuss <[email protected]>
@iandrewt
Copy link
Contributor Author

I've updated my PR with some improved error handling and rewritten tests that validate the unhappy path better, after looking at how my PR and @PettitWesley's branch differs. Running valgrind against the test cases reports no memory leaks, including fixing one that wasn't caught by the original tests I had written.

Comment on lines +158 to +159
// Loopback
// IPv4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fluent Bit code style is to always use /* */ comments

Copy link
Contributor

@PettitWesley PettitWesley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes that you linked that I made for the AWS Distro were also made on the master branch and tested thoroughly. Apologies that I forgot to put up a PR against this repo. Here is the PR: https://github.com/fluent/fluent-bit/pull/9206/files

This PR looks very very similar to my changes, which a few renamings and slight differences.

I strongly prefer the change I made in flb_utils to use SDS strings as opposed to this PR which modifies the existing flb_utils C string code- that must be thoroughly tested to avoid negative impact on other parts of this project.

If you continue with this work, please rebase your changes on top of my commits so we both get credit. I am unable to test this change right now, so help with testing is appreciated.

@iandrewt
Copy link
Contributor Author

That's all good, I'm fine with your changes. As I had mentioned earlier, I had made my own implementation of the feature before finding your branch, then updated mine to account the behavioural differences between the two.

Happy for this to be closed in favour of #9206 - I'll give it a whirl in my cluster too.

Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support AWS pod identity
3 participants