Doing a backup after a restore fails #1178

rzvoncek · 2024-01-18T15:46:38Z

What happened?
During testing the DSE support, I ran into an issue were a backup of an already restored cluster does not happen. I did:

Used the most recent medusa (with the DSE snapshot recursion bug) (DefaultMedusaVersion = "c8609c8-tmp")
Configured a bigger minio volume (30G)
Make a DSE cluster with make single-up ; E2E_TEST="TestOperator/CreateSingleDseSearchDatacenterCluster" make e2e-test, but killed it before it created any backups.
Ran a backup with 1 node cluster.
Scaled the cluster to 3 nodes.
Started up an ubuntu pod, installed tlp-stress, loaded some data
Ran a few more backups
Created an index, tested a query.
Did one more backup.
Did a restore.
Confirmed a the data is back.
Rebuilt the search index, verified the search works again.
Did another backup, which failed. 1 node completed, 1 failed mid-way, 1 never started.

On the failing node, there was this in the medusa log:

# a lot of
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: minio-service.minio.svc.cluster.local
# then
ERROR:root:Error occurred during backup: Connection was closed before we received a valid response from endpoint URL: "http://minio-service.minio.svc.cluster.local:9000/k8ssandra-medusa/test/test-dc1-default-sts-2/data/tlp_stress/sensor_data-41ab3cd0b60f11ee8f9665415b594bf9/bb-244-bti-Data.db?uploadId=NmM2ZDlmOGYtYWZlYy00MzlhLThmMWMtYzE5NGNkMzAwMDBmLjU3MWFiMDg4LWFjMGMtNGRkZC1hOGJjLTc4YmFlMzdmMWNiMQ&partNumber=194".
[2024-01-18 15:22:03,162] ERROR: Error occurred during backup: Connection was closed before we received a valid response from endpoint URL: "http://minio-service.minio.svc.cluster.local:9000/k8ssandra-medusa/test/test-dc1-default-sts-2/data/tlp_stress/sensor_data-41ab3cd0b60f11ee8f9665415b594bf9/bb-244-bti-Data.db?uploadId=NmM2ZDlmOGYtYWZlYy00MzlhLThmMWMtYzE5NGNkMzAwMDBmLjU3MWFiMDg4LWFjMGMtNGRkZC1hOGJjLTc4YmFlMzdmMWNiMQ&partNumber=194".
Traceback (most recent call last):
  File "/home/cassandra/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/home/cassandra/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
  File "/usr/lib/python3.10/http/client.py", line 995, in send
    self.sock.sendall(datablock)
ConnectionResetError: [Errno 104] Connection reset by peer

So it seems like a closed connection, but it's unclear where in Medusa we retry to handle this.

Doing another backup after this does not work. The operator reports a started backup job, but medusa-status does not recognise the backup.

Did you expect to see something different?

How to reproduce it (as minimally and precisely as possible):

Environment

K8ssandra Operator version:

Insert image tag or Git SHA here
* Kubernetes version information: `kubectl version` * Kubernetes cluster kind:```

insert how you created your cluster: kops, bootkube, etc.


* Manifests:

insert manifests relevant to the issue


* K8ssandra Operator Logs:

insert K8ssandra Operator logs relevant to the issue here


**Anything else we need to know?**:



┆Issue is synchronized with this [Jira Story](https://datastax.jira.com/browse/K8OP-50) by [Unito](https://www.unito.io)
┆Issue Number: K8OP-50

The text was updated successfully, but these errors were encountered:

rzvoncek added the bug Something isn't working label Jan 18, 2024

adejanovski added this to K8ssandra Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doing a backup after a restore fails #1178

Doing a backup after a restore fails #1178

rzvoncek commented Jan 18, 2024 •

edited by sync-by-unito bot

Loading

Doing a backup after a restore fails #1178

Doing a backup after a restore fails #1178

Comments

rzvoncek commented Jan 18, 2024 • edited by sync-by-unito bot Loading

rzvoncek commented Jan 18, 2024 •

edited by sync-by-unito bot

Loading