Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doing a backup after a restore fails #1178

Open
rzvoncek opened this issue Jan 18, 2024 · 0 comments
Open

Doing a backup after a restore fails #1178

rzvoncek opened this issue Jan 18, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@rzvoncek
Copy link
Contributor

rzvoncek commented Jan 18, 2024

What happened?
During testing the DSE support, I ran into an issue were a backup of an already restored cluster does not happen. I did:

  • Used the most recent medusa (with the DSE snapshot recursion bug) (DefaultMedusaVersion = "c8609c8-tmp")
  • Configured a bigger minio volume (30G)
  • Make a DSE cluster with make single-up ; E2E_TEST="TestOperator/CreateSingleDseSearchDatacenterCluster" make e2e-test, but killed it before it created any backups.
  • Ran a backup with 1 node cluster.
  • Scaled the cluster to 3 nodes.
  • Started up an ubuntu pod, installed tlp-stress, loaded some data
  • Ran a few more backups
  • Created an index, tested a query.
  • Did one more backup.
  • Did a restore.
  • Confirmed a the data is back.
  • Rebuilt the search index, verified the search works again.
  • Did another backup, which failed. 1 node completed, 1 failed mid-way, 1 never started.

On the failing node, there was this in the medusa log:

# a lot of
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: minio-service.minio.svc.cluster.local
# then
ERROR:root:Error occurred during backup: Connection was closed before we received a valid response from endpoint URL: "http://minio-service.minio.svc.cluster.local:9000/k8ssandra-medusa/test/test-dc1-default-sts-2/data/tlp_stress/sensor_data-41ab3cd0b60f11ee8f9665415b594bf9/bb-244-bti-Data.db?uploadId=NmM2ZDlmOGYtYWZlYy00MzlhLThmMWMtYzE5NGNkMzAwMDBmLjU3MWFiMDg4LWFjMGMtNGRkZC1hOGJjLTc4YmFlMzdmMWNiMQ&partNumber=194".
[2024-01-18 15:22:03,162] ERROR: Error occurred during backup: Connection was closed before we received a valid response from endpoint URL: "http://minio-service.minio.svc.cluster.local:9000/k8ssandra-medusa/test/test-dc1-default-sts-2/data/tlp_stress/sensor_data-41ab3cd0b60f11ee8f9665415b594bf9/bb-244-bti-Data.db?uploadId=NmM2ZDlmOGYtYWZlYy00MzlhLThmMWMtYzE5NGNkMzAwMDBmLjU3MWFiMDg4LWFjMGMtNGRkZC1hOGJjLTc4YmFlMzdmMWNiMQ&partNumber=194".
Traceback (most recent call last):
  File "/home/cassandra/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/home/cassandra/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
  File "/usr/lib/python3.10/http/client.py", line 995, in send
    self.sock.sendall(datablock)
ConnectionResetError: [Errno 104] Connection reset by peer

So it seems like a closed connection, but it's unclear where in Medusa we retry to handle this.

Doing another backup after this does not work. The operator reports a started backup job, but medusa-status does not recognise the backup.

Did you expect to see something different?

How to reproduce it (as minimally and precisely as possible):

Environment

  • K8ssandra Operator version:

    Insert image tag or Git SHA here

    * Kubernetes version information: `kubectl version` * Kubernetes cluster kind:```

insert how you created your cluster: kops, bootkube, etc.


* Manifests:

insert manifests relevant to the issue


* K8ssandra Operator Logs:

insert K8ssandra Operator logs relevant to the issue here


**Anything else we need to know?**:



┆Issue is synchronized with this [Jira Story](https://datastax.jira.com/browse/K8OP-50) by [Unito](https://www.unito.io)
┆Issue Number: K8OP-50
@rzvoncek rzvoncek added the bug Something isn't working label Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: No status
Development

No branches or pull requests

1 participant