Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIGRATION ISSUE: s3: Can no longer use Accept-Encoding: gzip on GetObject #2848

Closed
2 tasks done
ncw opened this issue Oct 22, 2024 · 5 comments
Closed
2 tasks done

MIGRATION ISSUE: s3: Can no longer use Accept-Encoding: gzip on GetObject #2848

ncw opened this issue Oct 22, 2024 · 5 comments
Assignees
Labels
p3 This is a minor priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. v1-v2-inconsistency v1-v2-inconsistency Behavior has changed from v1 to v2, or feature is missing altogether

Comments

@ncw
Copy link

ncw commented Oct 22, 2024

Pre-Migration Checklist

Go Version Used

go1.23.2

Describe the Migration Issue

In SDK v1 it was possible to specify Accept-Encoding: gzip on a GET of an s3 object.

This is necessary because without it both Cloudflare R2 and the S3 compatible interface to Google Cloud Storage will transparently decompress files stored with Content-Encoding: gzip and without Cache-Control: no-transform and rclone wants to be able to download the files as uploaded otherwise the MD5SUM in the ETags don't match.

This is explained in the google cloud storage docs. Cloudflare R2 appears to work the same way but I can't find the docs for that. as this pending commit shows.

The SDK v1 let users add Accept-Encoding: gzip however the SDK v2 seems to overwrite it with Accept-Encoding: identity whatever I try.

This had lead to this rclone issue rclone/rclone#8137

Code Comparison

V1 code snippet to add Accept-Encoding: gzip

	req := s3.GetObjectInput{
		Bucket:    &bucket,
		Key:       &bucketPath,
		VersionId: o.versionID,
	}
	if o.fs.opt.RequesterPays {
		req.RequestPayer = aws.String(s3.RequestPayerRequester)
	}
        //...
	httpReq, resp := o.fs.c.GetObjectRequest(&req)
	fs.FixRangeOption(options, o.bytes)

	// Override the automatic decompression in the transport to
	// download compressed files as-is
	if o.fs.opt.UseAcceptEncodingGzip.Value {
		httpReq.HTTPRequest.Header.Set("Accept-Encoding", "gzip")
	}

        // ...

	err = o.fs.pacer.Call(func() (bool, error) {
		var err error
		httpReq.HTTPRequest = httpReq.HTTPRequest.WithContext(ctx)
		err = httpReq.Send()
		return o.fs.shouldRetry(ctx, err)
	})

and the converted to the SDKv2 code

	req := s3.GetObjectInput{
		Bucket:    &bucket,
		Key:       &bucketPath,
		VersionId: o.versionID,
	}
	if o.fs.opt.RequesterPays {
		req.RequestPayer = types.RequestPayerRequester
	}
        //...
	fs.FixRangeOption(options, o.bytes)

	var APIOptions []func(*middleware.Stack) error

	// Override the automatic decompression in the transport to
	// download compressed files as-is
	if o.fs.opt.UseAcceptEncodingGzip.Value {
                // NB I can change this to any other header and it gets added, but `Accept-Encoding: gzip` is ignored
		APIOptions = append(APIOptions, smithyhttp.SetHeaderValue("Accept-Encoding", "gzip"))
	}

        //...

	var resp *s3.GetObjectOutput
	err = o.fs.pacer.Call(func() (bool, error) {
		var err error
		resp, err = o.fs.c.GetObject(ctx, &req, s3.WithAPIOptions(APIOptions...))
		return o.fs.shouldRetry(ctx, err)
	})

Observed Differences/Errors

With SDK v1 we see Accept-Encoding: gzip in the REQUEST

2024/10/22 11:00:27 DEBUG : HTTP REQUEST (req 0xc0009aa6c0)
2024/10/22 11:00:27 DEBUG : GET /rclone1/file.txt HTTP/1.1
Host: 14aad7c9ed489151b51557e321b246cf.r2.cloudflarestorage.com
User-Agent: rclone/v1.67.0
Accept-Encoding: gzip
Authorization: XXXX
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20241022T100027Z

2024/10/22 11:00:27 DEBUG : HTTP RESPONSE (req 0xc0009aa6c0)
2024/10/22 11:00:27 DEBUG : HTTP/1.1 200 OK
Content-Length: 3899
Accept-Ranges: bytes
Cf-Ray: 8d68a3f3bfeebea0-LHR
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/plain; charset=utf-8
Date: Tue, 22 Oct 2024 10:00:27 GMT
Etag: "1ca665693609f32ce05e6257c6a3b28d"
Last-Modified: Tue, 22 Oct 2024 09:59:55 GMT
Server: cloudflare
Vary: Accept-Encoding
X-Amz-Meta-Mtime: 1729590134.775546531

And with SDKv2 - Note the Accept-Encoding has been turned into identity in the REQUEST

2024/10/22 11:02:46 DEBUG : HTTP REQUEST (req 0xc000b1b540)
2024/10/22 11:02:46 DEBUG : GET /rclone1/file.txt?x-id=GetObject HTTP/1.1
Host: 14aad7c9ed489151b51557e321b246cf.r2.cloudflarestorage.com
User-Agent: rclone/v1.68.1
Accept-Encoding: identity
Amz-Sdk-Invocation-Id: 20bbf29b-eece-4cc7-a1b5-d401370cc611
Amz-Sdk-Request: attempt=1; max=10
Authorization: XXXX
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20241022T100246Z

2024/10/22 11:02:46 DEBUG : HTTP RESPONSE (req 0xc000b1b540)
2024/10/22 11:02:46 DEBUG : HTTP/1.1 200 OK
Transfer-Encoding: chunked
Cf-Ray: 8d68a757edce6557-LHR
Connection: keep-alive
Content-Type: text/plain; charset=utf-8
Date: Tue, 22 Oct 2024 10:02:46 GMT
Etag: W/"1ca665693609f32ce05e6257c6a3b28d"
Last-Modified: Tue, 22 Oct 2024 09:59:55 GMT
Server: cloudflare
Vary: Accept-Encoding
X-Amz-Meta-Mtime: 1729590134.775546531

Additional Context

This problem does not occur with AWS S3. However it could be worked around with the SDK v1 and can no longer be worked around with the SDK v2 so I think it counts as a regression.

@ncw ncw added needs-triage This issue or PR still needs to be triaged. v1-v2-inconsistency v1-v2-inconsistency Behavior has changed from v1 to v2, or feature is missing altogether labels Oct 22, 2024
@ncw
Copy link
Author

ncw commented Oct 22, 2024

With the help of @darthShadow we came up with this workaround

diff --git a/backend/s3/s3.go b/backend/s3/s3.go
index eda36a259..ef7a4d740 100644
--- a/backend/s3/s3.go
+++ b/backend/s3/s3.go
@@ -5862,6 +5862,14 @@ func (o *Object) downloadFromURL(ctx context.Context, bucketPath string, options
        return resp.Body, err
 }
 
+// middleware to remove the gzip encoding
+func removeDisableGzip() func(*middleware.Stack) error {
+       return func(stack *middleware.Stack) error {
+               _, err := stack.Finalize.Remove("DisableAcceptEncodingGzip")
+               return err
+       }
+}
+
 // Open an object for read
 func (o *Object) Open(ctx context.Context, options ...fs.OpenOption) (in io.ReadCloser, err error) {
        bucket, bucketPath := o.split()
@@ -5898,7 +5906,7 @@ func (o *Object) Open(ctx context.Context, options ...fs.OpenOption) (in io.Read
        // Override the automatic decompression in the transport to
        // download compressed files as-is
        if o.fs.opt.UseAcceptEncodingGzip.Value {
-               APIOptions = append(APIOptions, smithyhttp.AddHeaderValue("Accept-Encoding", "gzip"))
+               APIOptions = append(APIOptions, removeDisableGzip(), smithyhttp.AddHeaderValue("Accept-Encoding", "gzip"))
        }
 
        for _, option := range options {

This assumes that the DisableGzip middleware won't change its ID. We can't use the ID directly from the source code as it is in an internal package github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding

@RanVaknin RanVaknin self-assigned this Oct 23, 2024
@RanVaknin
Copy link
Contributor

Hi @ncw,

Thanks for reaching out. I had to dig around to get some context about this change, but this seems like an intentional change that was introduced because the Go standard library's HTTP client (which the Go SDK uses) will auto decompress the gzip file, and will cause issues with checksum validations. I tried to corroborate this by writing my own test code, but was not seeing any checksum validation errors when gzip accept encoding was present on an object with checksum on it.

In some clients there were config options introduced to disable this middleware (or rather enable gzip accept-encoding):

	ddb := dynamodb.NewFromConfig(cfg, func(options *dynamodb.Options) {
		options.EnableAcceptEncodingGzip = true
	})

The way you are disabling this middleware on the S3 client looks right to me.

To sum up, this is an intentional change. Let us know if you have additional questions / concerns.

Thanks,
Ran~

@RanVaknin RanVaknin added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. p3 This is a minor priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Oct 24, 2024
@ncw
Copy link
Author

ncw commented Oct 24, 2024

Thanks for your help and investigations @RanVaknin

Given that we've found an acceptable workaround I think we can close this issue now, and if anyone in the future has the same problem then hopefully they will find this issue.

@ncw ncw closed this as completed Oct 24, 2024
Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

@fenos
Copy link

fenos commented Oct 29, 2024

@RanVaknin, I think we should reconsider having the client default to accept-encoding: identity. This creates issues for third-party S3-compatible services that might sit behind proxies that overwrite the encoding.

We should either not use it in the signature or be an opt-in option.

Removing the middleware DisableAcceptEncodingGzip makes the client work fine, but I can see where apps built with this SDK are not very portable to other S3-compatible services or make it harder.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p3 This is a minor priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. v1-v2-inconsistency v1-v2-inconsistency Behavior has changed from v1 to v2, or feature is missing altogether
Projects
None yet
Development

No branches or pull requests

3 participants