Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure CloudWatch to count occurences of http.client.RemoteDisconnected #35

Open
tschaffter opened this issue Aug 26, 2021 · 8 comments
Assignees

Comments

@tschaffter
Copy link
Member

Example of the complete message from the container that sends curl requests to the data node and tools:

STDERR: 2021-08-24T16:19:41.193648030Z 	http.client.RemoteDisconnected: Remote end closed connection without response
@tschaffter
Copy link
Member Author

Can not find Disconnected in the logs of the controller.

docker logs workflow_orchestrator_workflow-orchestrator_1 2>&1 | less

@tschaffter
Copy link
Member Author

tschaffter commented Aug 26, 2021

The log of the controller are currently being saved to this file: /var/lib/docker/containers/276d89e10d8743d52e1411fc5f486cfcb08d7d21536c6ae34bc31aff03b95039/276d89e10d8743d52e1411fc5f486cfcb08d7d21536c6ae34bc31aff03b95039-json.log

@tschaffter
Copy link
Member Author

@thomasyu888 The default logging driver used by Docker is json-file, which does not include log rotation and may lead to disk exhaustion.

Current default logging driver on the controller instance:

# docker info --format '{{.LoggingDriver}}'
json-file

The controller currently uses json-file:

# docker inspect workflow_orchestrator_workflow-orchestrator_1 | grep -i -C 5 log
        },
        "Image": "sha256:7e9ad7d2bfda444e002dea485465d2c1efb0c62fc47878199d4c605c1079dd45",
        "ResolvConfPath": "/var/lib/docker/containers/276d89e10d8743d52e1411fc5f486cfcb08d7d21536c6ae34bc31aff03b95039/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/276d89e10d8743d52e1411fc5f486cfcb08d7d21536c6ae34bc31aff03b95039/hostname",
        "HostsPath": "/var/lib/docker/containers/276d89e10d8743d52e1411fc5f486cfcb08d7d21536c6ae34bc31aff03b95039/hosts",
        "LogPath": "/var/lib/docker/containers/276d89e10d8743d52e1411fc5f486cfcb08d7d21536c6ae34bc31aff03b95039/276d89e10d8743d52e1411fc5f486cfcb08d7d21536c6ae34bc31aff03b95039-json.log",
        "Name": "/workflow_orchestrator_workflow-orchestrator_1",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
--
            "Binds": [
                "workflow_orchestrator_shared:/shared:rw",
                "/var/run/docker.sock:/var/run/docker.sock:rw"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "workflow_orchestrator_default",
            "PortBindings": {},

@thomasyu888 Using the logging driver local is recommended by Docker as it provides log rotation.

@tschaffter
Copy link
Member Author

  • The controller container is constantly outputting the string Top level loop: checking progress or starting new job to the log file. Use this as an example to count occurrences in log files and report this value to Cloud Watch.

@thomasyu888
Copy link
Member

thomasyu888 commented Aug 26, 2021

@tschaffter You can also see these logs in the ELK stack. I would query against all the containers to check, I don't think it would be part of the workflow_orchestrator logs. I suspect the RemoteDisconnect error would be cause from this image

nlpsandbox/cli:v....

Here is the query in ELK stack

@tschaffter
Copy link
Member Author

@thomasyu888 The remote disconnect error appears for two different type of containers: some container have a fully random name (red in the screenshot below) and other container names include the submission ID and curl (in blue). What are the differences between these two types of container/docker.name?

image

@tschaffter
Copy link
Member Author

tschaffter commented Aug 28, 2021

@thomasyu888 The log below are all the log available for the container /9712969_curl_1. What does this container and are there more container started that have the same name? For example, do you run this container once and it is responsible for making all the curl queries relative to a submission? Or do you start this container for individual curl queries?

It would be informative to include in the log for this container(s) the exact curl request performed. So that we can see in the log what is the curl request that generated the remote disconnect error. More specifically, I don't know at this point whether the remote disconnect error affect the data node OR the tools OR both.

image

@thomasyu888
Copy link
Member

thomasyu888 commented Sep 7, 2021

@tschaffter

What are the differences between these two types of container/docker.name?

  • The docker.name that is {submission_id}_curl_1 is interacting with the tool during the validate_tool.py step. Surprisingly this is the step that gets the tool (Tool object)
  • The docker.name that is auto generated by docker cool_bartik is when attempting to get the gold standard or the notes when interacting with the data node.
  • If you see docker.name that is {submissionid}_curl_{value between 10 to 1000}, and the docker.image is curlimages/curl:7.73.0, these are curl commands per note against the tool.

What does this container and are there more container started that have the same name? For example, do you run this container once and it is responsible for making all the curl queries relative to a submission? Or do you start this container for individual curl queries?

The container runs once and is only responsible for getting the tool, each container is removed immediately after.

  • {submission_id}_curl_1: Runs nlpsandbox/cli:tag tool get-tool --annotator_host tool.url
  • {submission_id}_curl_2: Runs nlpsandbox/cli:tag tool check-url --annotator_host tool.url/api/v1/ui (Makes sure /ui is implemented)
  • {submission_id}_curl_3: Runs nlpsandbox/cli:tag tool annotate-note .... Annotates an example note.
  • {submission_id}_curl_4: Runs nlpsandbox/cli:tag tool annotate-note .... Annotates the same note to make sure results are the same.

It would be informative to include in the log for this container(s) the exact curl request performed. So that we can see in the log what is the curl request that generated the remote disconnect error. More specifically, I don't know at this point whether the remote disconnect error affect the data node OR the tools OR both.

Unfortunately, ELK doesn't show the commands...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants