Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No caching when using Docker image. #64

Open
kristoferlundgren opened this issue Nov 2, 2022 · 11 comments
Open

No caching when using Docker image. #64

kristoferlundgren opened this issue Nov 2, 2022 · 11 comments

Comments

@kristoferlundgren
Copy link

Describe the bug
Using the latest Docker image, no data is being cached.

To Reproduce
Steps to reproduce the behavior:

  1. Start container
    docker run --rm -ti -p 80:80 -e S3_SERVER=storage.googleapis.com -e S3_ACCESS_KEY_ID="<key>" -e S3_SECRET_KEY="<secret>" --env-file s3.env nginxinc/nginx-s3-gateway:latest-20221026

s3.env file:

S3_BUCKET_NAME=<bucket-name>
S3_SERVER_PORT=443
S3_SERVER_PROTO=https
S3_REGION=us-east-1
S3_STYLE=virtual
S3_DEBUG=false
AWS_SIGS_VERSION=4
ALLOW_DIRECTORY_LIST=true
PROVIDE_INDEX_PAGE=false
APPEND_SLASH_FOR_POSSIBLE_DIRECTORY=false
PROXY_CACHE_VALID_OK=1h
PROXY_CACHE_VALID_NOTFOUND=1m
PROXY_CACHE_VALID_FORBIDDEN=30s
  1. Pull multiple files from the S3 gateway at http://localhost

I can successfully browse the S3 bucket directory structure and download objects without any issue.
Although, when downloading the same object multiple times I cannot see any performance increase from a cache hit.

  1. Exec into container
    docker exec -ti <container> bash
    Run the following command:
    ls -la /var/cache/nginx/s3_proxy/

The cache directory is empty.
I also looked for looked for any disk usage increase with the command du -sh /* but no cached data is being stored in the container.

Expected behavior
According to the documentation, data should be cached when accessed multiple times and not reloaded from the remote S3 bucket at each access.

Your environment

  • Version of the container used (if downloaded from Docker Hub or Github)
    Docker image: nginxinc/nginx-s3-gateway:latest-20221026
  • S3 backend implementation you are using (AWS, Ceph, NetApp StorageGrid, etc)
    Google Cloud Storage
  • How you are deploying Docker/Stand-alone, etc
    Docker on MacOS using Rancher Desktop.
  • Authentication method (IAM, IAM with Fargate, IAM with K8S, AWS Credentials, etc)
    S3 authentication using Google Service Account with HMAC keys.
@dekobon
Copy link
Collaborator

dekobon commented Nov 3, 2022

Thank you for writing up this issue in such detail.

So far, I've been unable to reproduce this bug using AWS. In my configuration, I've put a text file on my S3 bucket and ran curl against it in a loop.

I saw that the cache files were correctly populated in the /var/cache/nginx/s3_proxy directory. I also monitored the instance for outbound connections via netstat and I only saw outbound connections every minute or so.

On my container, the contents of the cache directory look like:

root@88822b1c11cd:/var/cache/nginx/s3_proxy# find /var/cache/nginx/s3_proxy/
/var/cache/nginx/s3_proxy/
/var/cache/nginx/s3_proxy/1
/var/cache/nginx/s3_proxy/1/93
/var/cache/nginx/s3_proxy/1/93/b620bfa0e09b3cc11521660acb6e2931

I'll go and try to see if I can reproduce the issue on Google Cloud Storage.

@dekobon
Copy link
Collaborator

dekobon commented Nov 3, 2022

I just ran the same configuration against Google Cloud Storage and I was able to reproduce the behavior.

@dekobon
Copy link
Collaborator

dekobon commented Nov 3, 2022

I found the source of the issue. Google Cloud Storage diverges from the AWS S3 behavior by setting Cache-Control: private, max-age=0 by default for all objects. You need to edit the metadata for your object on Google Cloud Storage and change the value of Cache-Control to public in order to enable caching with the gateway. See the Cloud Storage Documentation for more information.

There may be a way to configure NGINX to ignore the header sent by Google Cloud Storage by using the proxy_ignore_headers directive to ignore the Cache-Control header.

@kristoferlundgren
Copy link
Author

kristoferlundgren commented Nov 4, 2022

Many thanks for tracking down the root cause of this issue.

As you (@dekobon ) suggested, I added proxy_ignore_headers Cache-Control; to the http {} part of /etc/nginx/nginx.conf, ran nginx -s reload inside the container. And voilà, it works!
Files are now cached, as expected.

I now have some choices.

  1. Mount my own /etc/nginx/nginx.conf into the container.
  2. Build and run a modified container image with this tiny modification.
  3. Ask this project to add the proxy_ignore_headers Cache-Control; as part of the config. Preferably configurable with an environment variable.

I would like to first ask for no.3 . What are your thoughts?

Again, thanks!

@dekobon
Copy link
Collaborator

dekobon commented Nov 4, 2022

I think asking for number three is reasonable. We may need a generalized way to accomplish this because we also need to solve for #65 .

@dekobon
Copy link
Collaborator

dekobon commented Nov 4, 2022

I've made some updates to the container so that you can now layer in additional NGINX configuration. See the documentation.

Also, I added a feature that allows you to strip out headers from the client response. For Google Cloud Storage you will want to do:

HEADER_PREFIXES_TO_STRIP=x-goog-;x-guploader-uploadid

Please let me know if this solution works for you. If it does, I'll mark this issue as closed.

@kristoferlundgren
Copy link
Author

  1. Trying the new feature by added the Cache-Control header:
    HEADER_PREFIXES_TO_STRIP="x-goog-;x-guploader-uploadid;Cache-Control"
    Resulted in the error:
    HEADER_PREFIXES_TO_STRIP must not contain uppercase characters (as documented)

  2. Second try (lowercase Cache-Control):
    HEADER_PREFIXES_TO_STRIP="x-goog-;x-guploader-uploadid;cache-control"
    Downloaded some files and then checked the cache directory. -Empty, i.e. Cache is still disabled.

  3. Third try: (stripping x-goog headers and mounting nginx http config file)
    docker run --rm -ti -p 80:80 -e S3_SERVER=storage.googleapis.com -e S3_ACCESS_KEY_ID="<key>" -e S3_SECRET_KEY="<secret>" -e HEADER_PREFIXES_TO_STRIP="x-goog-;x-guploader-uploadid" --env-file s3.env -v $(pwd)/cache.conf:/etc/nginx/conf.d/cache.conf nginxinc/nginx-s3-gateway:latest
    Where the $(pwd)/cache.conf file contains:
    proxy_ignore_headers Cache-Control;
    Downloaded some files and then checked the cache directory. Cache directory has content.
    I.e. Cache is working! :)

I would have preferred an environment variable solution, but this config works as well.
Many thanks for the assessment and quick remediation of this issue. And also reporting and fixing #65.

Before closing this issue I believe the need for proxy_ignore_headers Cache-Control; ought to be documented to aid usage when s3 backends (ex. Google Cloud Storage) emit caching preferences.

@dekobon
Copy link
Collaborator

dekobon commented Nov 6, 2022

I agree it should be documented. Also, we may want to add an environment variable that allows for ignoring cache control, but I wanted to get the extensibility part done ASAP because we've gotten a lot of requests for similar things and the number of environment variables is starting to add up.

I'll leave this issue open until we can add a setting.

@akashgreninja
Copy link
Collaborator

I made a stupid mistake of exec into the wrong running container with the same name so i didnt find any cache
check if this also might be the reason

@felipou
Copy link

felipou commented Nov 28, 2024

I've just experienced this issue, and in addition to ignoring the Cache-Control header, I also had to ignore the Expires header for it to work:

proxy_ignore_headers Cache-Control;
proxy_ignore_headers Expires;

@kristoferlundgren
Copy link
Author

@dekobon @4141done You two seem to be the current maintainers. I really appreciate your effort to keep the project alive!

From reading various discussions on the subject of caching in this GitHub project, there seems to be a general request to have more control of the ingress and egress cache configuration.
Mounting my own cache.conf, replacing the default, still seems like a hack. Is there a more intuitive way to manage cache configuration, or can one be developed with a reasonable effort?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants