Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve rebuild logic #5

Open
LeeSmet opened this issue Dec 18, 2020 · 20 comments · Fixed by #142
Open

Improve rebuild logic #5

LeeSmet opened this issue Dec 18, 2020 · 20 comments · Fixed by #142
Assignees
Labels
type_feature New feature or request

Comments

@LeeSmet
Copy link
Contributor

LeeSmet commented Dec 18, 2020

The current rebuild logic is fairly simple: retrieve data, reencode, and send back to the new backends. However, we can check if any of the new backends is also used in the old metadata. If it is, we can assign it the same shard, eliminating the write to that backend, saving some space.

Need toch check if encoding is deterministic for this, especially if it is a parity shard

@LeeSmet LeeSmet added the type_feature New feature or request label Dec 18, 2020
@LeeSmet LeeSmet self-assigned this Dec 18, 2020
@sasha-astiadi sasha-astiadi added this to the Next milestone Dec 29, 2020
@sasha-astiadi sasha-astiadi modified the milestones: Next, Later Jun 14, 2021
@LeeSmet LeeSmet removed this from the Later milestone Nov 3, 2021
@LeeSmet LeeSmet removed their assignment Nov 3, 2021
@despiegk despiegk added this to the Later milestone Nov 3, 2021
@xmonader xmonader removed this from the Later milestone Jul 4, 2022
@scottyeager
Copy link

IMO this is fairly essential. Under the current scheme, the backend data usage is multiplied by the number of rebuild operations that have been carried out, plus one for the initial write. So in the case of an initial backend configuration with some data stored, replacing a single backend and rebuilding means doubling the data usage in all of the backends that didn't get replaced, since a duplicate of all data is written to them again.

@iwanbk iwanbk self-assigned this Dec 9, 2024
@iwanbk
Copy link
Member

iwanbk commented Dec 9, 2024

Need toch check if encoding is deterministic for this, especially if it is a parity shard

i did a quick check for this, by replaced two out of six backends, and print which shard is different/same

with this config

minimal_shards = 4
expected_shards = 6

and got this

024-12-09 16:30:40 +07:00: INFO Shard 0 is DIFFERENT
2024-12-09 16:30:40 +07:00: INFO Shard 1 is DIFFERENT
2024-12-09 16:30:40 +07:00: INFO Shard 2 is DIFFERENT
2024-12-09 16:30:40 +07:00: INFO Shard 3 is DIFFERENT
...

.....
.....
2024-12-09 16:30:54 +07:00: INFO Shard 0 is the SAME
2024-12-09 16:30:54 +07:00: INFO Shard 1 is DIFFERENT
2024-12-09 16:30:54 +07:00: INFO Shard 2 is DIFFERENT
2024-12-09 16:30:54 +07:00: INFO Shard 3 is DIFFERENT
.....
2024-12-09 16:30:54 +07:00: INFO Rebuild file from 127.0.0.1:9903,127.0.0.1:9907,127.0.0.1:9906,127.0.0.1:9902,127.0.0.1:9908,127.0.0.1:9909 to 127.0.0.1:9902,127.0.0.1:9904,127.0.0.1:9908,127.0.0.1:9903,127.0.0.1:
9905,127.0.0.1:9909
...
....
2024-12-09 16:31:17 +07:00: INFO Shard 0 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 1 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 2 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 3 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 4 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 5 is the SAME
...
2024-12-09 16:31:17 +07:00: INFO Rebuild file from 127.0.0.1:9908,127.0.0.1:9904,127.0.0.1:9903,127.0.0.1:9905,127.0.0.1:9902,127.0.0.1:9909 to 127.0.0.1:9905,127.0.0.1:9908,127.0.0.1:9909,127.0.0.1:9902,127.0.0.1:
9904,127.0.0.1:9903

not sure about the last one tho, why there are 6 shards there.

What we can see from there:

  1. sometimes there are same shards, but sometimes all are different
  2. even with same shard, the zdb chosen was different

For (1), we have to check it deeper.
For (2), i think we could improve it.

@iwanbk
Copy link
Member

iwanbk commented Dec 10, 2024

Please forget my previous command.
I found that my test code is wrong.
The encoder always give consistent result.

So, what we need to do is to make sure the shard is put on the same zdb.

@iwanbk
Copy link
Member

iwanbk commented Dec 11, 2024

PR #142 is a working code for this, it doesn't set the shard that is not broken/missing.
The PR still to be polished to.

@scottyeager
Copy link

PR #142 is a working code for this, it doesn't set the shard that is not broken/missing.
The PR still to be polished to.

Great, I'll take it for a test drive asap.

@scottyeager
Copy link

scottyeager commented Dec 14, 2024

I did a build of the dedup branch at 3accd7 and tested.

Here's status after the initial data is written:

+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                         | reachable | objects | used space | free space | usage percentage |
+=================================================================+===========+=========+============+============+==================+
| [300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 - 504-59257-meta1 | Yes       |      32 |      15960 | 1073741824 |                0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900 - 504-59258-meta2 | Yes       |      32 |      15960 | 1073741824 |                0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1eb4:b62c:8ba6:7310:3f1c:54b3:846b]:9900 - 504-59256-meta3 | Yes       |      32 |      15960 | 1073741824 |                0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900 - 504-59260-meta5 | Yes       |      32 |      15960 | 1073741824 |                0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                         | reachable | objects | used space | free space | usage percentage |
+=================================================================+===========+=========+============+============+==================+
| [300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 - 504-59257-data1 | Yes       |     271 |  502878178 | 1073741824 |               46 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900 - 504-59258-data2 | Yes       |     271 |  502878178 | 1073741824 |               46 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1eb4:b62c:8ba6:7310:3f1c:54b3:846b]:9900 - 504-59256-data3 | Yes       |     271 |  502878178 | 1073741824 |               46 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900 - 504-59260-data5 | Yes       |     271 |  502878178 | 1073741824 |               46 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+

After replacing one backend and letting the rebuild finish:

+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                         | reachable | objects | used space | free space | usage percentage |
+=================================================================+===========+=========+============+============+==================+
| [300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 - 504-59257-meta1 | Yes       |      36 |      17520 | 1073741824 |                0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900 - 504-59258-meta2 | Yes       |      36 |      17520 | 1073741824 |                0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900 - 504-59260-meta5 | Yes       |      36 |      17520 | 1073741824 |                0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9fab:2cda:749d:aa8e:43c:aeee:dfef]:9900 - 504-59261-meta7  | Yes       |      36 |      17520 | 1073741824 |                0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                         | reachable | objects | used space | free space | usage percentage |
+=================================================================+===========+=========+============+============+==================+
| [300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 - 504-59257-data1 | Yes       |     286 |  524878754 | 1073741824 |               48 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900 - 504-59258-data2 | Yes       |     286 |  524878754 | 1073741824 |               48 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900 - 504-59260-data5 | Yes       |     286 |  524878754 | 1073741824 |               48 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9fab:2cda:749d:aa8e:43c:aeee:dfef]:9900 - 504-59261-data7  | Yes       |     271 |  524733279 | 1073741824 |               48 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+

So this looks way better. There's still a few extra objects that have been stored in the original backends though. I'll attach a log file in case it can help deduce why.

zstor-dedup.log

@iwanbk
Copy link
Member

iwanbk commented Dec 16, 2024

There's still a few extra objects that have been stored in the original backends though.

interesting, i've checked the logs and find nothing suspicious.

I think i'll re-test and add more logs

@iwanbk
Copy link
Member

iwanbk commented Dec 16, 2024

i've checked the logs (even created Go script for this) and nothing wrong with it in the rebuild file section.

This section is where we rebuild the shard

this kind of log (splitted into several lines for readability)

1214:2024-12-14 00:40:20 +00:00: INFO Rebuild file from 
[302:1eb4:b62c:8ba6:7310:3f1c:54b3:846b]:9900,[300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900,[300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900,[300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 
to 
[300:9fab:2cda:749d:aa8e:43c:aeee:dfef]:9900,[300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900,[300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900,[300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900

See the order of the backends there.
All are the same, except the first/replaced backend 302:1eb4:b62c:8ba6:7310:3f1c:54b3:846b]:9900

@iwanbk
Copy link
Member

iwanbk commented Dec 16, 2024

One thing i can notice which different with my test is: you use the same backends for both meta and data while i use different one for the backend that i replaced.

But i don't see how it relates with this issue

@iwanbk
Copy link
Member

iwanbk commented Dec 17, 2024

unfortunately i still can't reproduce it.

@scottyeager
one thing i notice here is that the meta objects in your test increased from 32 to 36, did you add another store operation before the rebuild kicked in?

@scottyeager
Copy link

I'm testing in the context of qsfs, so the store operations are bring triggered automatically. It's possible that the rotate timer in zdb expired during my test and another store was triggered. I'll try again and make sure to rule that out.

Just looking at the status output though I didn't suspect an additional store as the root of the discrepancy. The new backend has 271 objects, just like all of the original backends did before the rebuild operation. If more data was stored, I'd expect all backends to have a larger number of objects.

@scottyeager
Copy link

I did another test, being absolutely sure that no extra data is added after changing the backends config. Here it's a bit different result.

Before rebuild

+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                            | reachable | objects | used space | free space | usage percentage |
+====================================================================+===========+=========+============+============+==================+
| [304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900 - 5545-728559-meta10 | Yes       |       7 |       2921 | 1073741824 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:24:6b96:f25f:6734:75e4:13:2f59]:9900 - 5545-728556-meta11     | Yes       |       7 |       2921 | 1073741824 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 - 5545-728557-meta13 | Yes       |       7 |       2921 | 1073741824 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 - 5545-728558-meta8   | Yes       |       7 |       2921 | 1073741824 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| backend                                                            | reachable | objects | used space | free space  | usage percentage |
+====================================================================+===========+=========+============+=============+==================+
| [304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900 - 5545-728559-data10 | Yes       |      31 |   52466633 | 10737418240 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [300:24:6b96:f25f:6734:75e4:13:2f59]:9900 - 5545-728556-data11     | Yes       |      31 |   52466633 | 10737418240 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 - 5545-728557-data13 | Yes       |      31 |   52466633 | 10737418240 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 - 5545-728558-data8   | Yes       |      31 |   52466633 | 10737418240 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+

And after rebuild:

+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                            | reachable | objects | used space | free space | usage percentage |
+====================================================================+===========+=========+============+============+==================+
| [304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900 - 5545-728559-meta10 | Yes       |       7 |       2921 | 1073741824 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:24:6b96:f25f:6734:75e4:13:2f59]:9900 - 5545-728556-meta11     | Yes       |       7 |       2921 | 1073741824 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:f37f:4d04:af80:4810:3b53:bd09:a4c1]:9900 - 5545-728562-meta24 | Yes       |       7 |       2921 | 1073741824 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 - 5545-728558-meta8   | Yes       |       7 |       2921 | 1073741824 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| backend                                                            | reachable | objects | used space | free space  | usage percentage |
+====================================================================+===========+=========+============+=============+==================+
| [304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900 - 5545-728559-data10 | Yes       |      31 |   52466633 | 10737418240 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [300:24:6b96:f25f:6734:75e4:13:2f59]:9900 - 5545-728556-data11     | Yes       |      31 |   52466633 | 10737418240 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [300:f37f:4d04:af80:4810:3b53:bd09:a4c1]:9900 - 5545-728562-data24 | Yes       |      31 |   52466633 | 10737418240 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 - 5545-728558-data8   | Yes       |      47 |   85973382 | 10737418240 |                0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+

So this time the metadata objects are constant, but there's still extra objects stored in one of the data backends that wasn't replaced. Logs for this run are below.

zstor-dedup2.log

@scottyeager scottyeager reopened this Dec 18, 2024
@iwanbk
Copy link
Member

iwanbk commented Dec 18, 2024

OK, i found the issue.

The new logic is working as expected, but there is something else: the backend that supposed to be healthy was failed during the checking but then up again during restoring the data

The dead backend is

302:1d81:cef8:3049:b337:7aa0:19c7:e8ea

So, we only expect that all errors are coming only from this backend, not the other.

This is the normal log, when everything going as expected.
(logs reformatted for better clarity)

2024-12-18 02:23:48 +00:00: WARN could not download shard 3: error during storage: ZDB at [302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 5545-728557-data13, error operation READ caused by Namespace: not found
......
2024-12-18 02:23:49 +00:00: INFO Rebuild file from 
[302:1297:52fe:57ba:e03d:e9da:161f:777]:9900,[304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900,[300:24:6b96:f25f:6734:75e4:13:2f59]:9900,[302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 to
[302:1297:52fe:57ba:e03d:e9da:161f:777]:9900,[304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900,[300:24:6b96:f25f:6734:75e4:13:2f59]:9900,[300:f37f:4d04:af80:4810:3b53:bd09:a4c1]:9900
  1. failed to donwload only from ....:e8ea IP
  2. in rebuild, ...:e8ea was replaced by :ac41

This is the troubling logs

2024-12-18 02:26:17 +00:00: WARN could not download shard 0: error during storage: ZDB at [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 5545-728558-data8, error operation READ caused by timeout
2024-12-18 02:26:18 +00:00: WARN could not download shard 3: error during storage: ZDB at [302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 5545-728557-data13, error operation READ caused by Namespace: not found
......
2024-12-18 02:28:03 +00:00: INFO Rebuild file from 
[302:1297:52fe:57ba:e03d:e9da:161f:777]:9900,[300:24:6b96:f25f:6734:75e4:13:2f59]:9900,[304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900,[302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 to 
[300:f37f:4d04:af80:4810:3b53:bd09:a4c1]:9900,[300:24:6b96:f25f:6734:75e4:13:2f59]:9900,[304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900,[302:1297:52fe:57ba:e03d:e9da:161f:777]:9900
  1. failed to download from :777 which should be healthy
  2. in rebuild :777 also replaced, but it also storing another shard

There are some things i could think of right now:

  1. retry the download
  • with the risk of prolong the rebuild,
  1. only retry the download if the backend is listed in the config
  • in this, 777 is in the config, so it supposed to be healthy
  • it also has the risk of prolong the rebuild
  • in this case, i want to implement it after i implemented cache unhealthy backend check of Improvements to data rebuild #136
  1. the actual rebuild knows that 777 is alive, it should retry to store there.

@iwanbk
Copy link
Member

iwanbk commented Dec 18, 2024

the actual rebuild knows that 777 is alive, it should retry to store there.

I'm not sure how 0-db in sequential mode behave in this situation, whether the number of objects will be increased or not.
I failed to experiment using redis-cli

@scottyeager
Copy link

Indeed, this second test was done with some connectivity trouble to the backends, which I didn't notice in the first test.

@iwanbk
Copy link
Member

iwanbk commented Dec 18, 2024

only retry the download if the backend is listed in the config

I'm thinking of more heuristics for this:

  1. if listed in the config
  2. if the error is considered temporary for example timeout. "Namespace not found" is absolutely not temporary.

Let me know if you have more idea.

@LeeSmet
Copy link
Contributor Author

LeeSmet commented Dec 18, 2024

I agree that there needs to be something smarter for deciding if a backend is dead or not, instead of the binary dead/alive we have now. I'd suggest treating connection failures something like "if we can't reach the backend for X amount of time, consider it to be down" rather than immediately assuming it is down (and of course like you said if the namespace is not found its immediately down). Whether or not the backend is in the config does not matter for the heuristic, since we are considering the read operations here. The config is only relevant for store operations so the system knows what db's it can use. It is perfectly fine to remove a full namespace/db from the config but keep the data there for years.

@scottyeager
Copy link

the actual rebuild knows that 777 is alive, it should retry to store there.

I'm not sure how 0-db in sequential mode behave in this situation, whether the number of objects will be increased or not. I failed to experiment using redis-cli

I'm not 100% sure on this, but my own brief test suggests that attempting to SET the same data on the same key does not result in storage growth. Zdb returns a nil when attempting to do this, versus returning the key itself when new data is provided. I can't imagine that Zdb would write the same data again when it clearly recognizes that it's already stored under that key.

So I think this is a good solution, if we can try to put the shard back on the same zdb at the same key in this case where retrieving the shard failed.

@iwanbk
Copy link
Member

iwanbk commented Dec 19, 2024

I'm not 100% sure on this, but my own brief test suggests that attempting to SET the same data on the same key does not result in storage growth. Zdb returns a nil when attempting to do this, versus returning the key itself when new data is provided. I can't imagine that Zdb would write the same data again when it clearly recognizes that it's already stored under that key.

So I think this is a good solution, if we can try to put the shard back on the same zdb at the same key in this case where retrieving the shard failed.

Checked the code and discussed with Lee.
It still results in additional objects.

For now, i'll try to improve it by retry the get if got timeout error, max 3 attempts.

@iwanbk
Copy link
Member

iwanbk commented Dec 19, 2024

^ implemented on #151

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants