Feature request: Sharding metrics #71

penguinlav · 2023-01-08T18:13:53Z

Hi! Our bioyino cluster consumes a lot of resources from each host. Probably, the cluster node will run out of resources soon. Are there any plans to shard the collected metrics?

Albibek · 2023-01-08T18:28:25Z

Hi. Are you using agent-based approach so far? You probably could shard your metrics per-agent first.

UPD:
I thought I had documentation for agent-based approach published, but there is not. Though, we are mentioning some in this article.

The idea is to have separate non-clustered bioyino instances listening on UDP that will pre-aggregate metrics and send them to a cluster. In this case some of the cluster's load will be offloaded to the agents. It also can help with manual sharding if you have e.g. groups of servers you can easily separate from each other metric-wise. In such case you could point these groups to different agents and point agents to different clusters. If your case doesn't fit into what I've said above, it could be great if you gave more details or a case of usage, I could propose some solution for you.

I was also thinking about "pure" sharding, based on hash rings or kind of this, but could not find a good case for that. Usually what we actually wanted was some kind of routing of incoming metrics more based on prefix or the source than based on a hash. All these many ways of distributing metrics among different destinations seemed to me worth an entire separate product, something similar to carbon-c-relay. Adding all the possible routing algorithms to bioyino would bloat the codebase too much.

penguinlav · 2023-01-11T08:27:07Z

Yes, we already use this approach with local aggregate nodes. And we are close to the limit of resources on master nodes.

We can increase period of time between sending snapshots (2 sec). As far I can see, it will help reduce cpu consumption for snapshot merging by reducing the number of snapshots in the interval (send metrics to carbon).

But what if there is a sharding feature? Unlimited ability to increase the number of metrics collected by bioyino :)

Albibek · 2023-02-18T19:59:28Z

To be clear, snapshots interval is regulated by snapshot-interval, not by interval, but you are right, it may reduce some load because of less TCP connection overheads and a lower number of total TCP connections.

Sharding is not so easy when it comes to failing nodes. Let's say you have a 3-node cluster and 10 agents. How would you distribute data among them and how you expect to configure it? What behavior do you expect when 1 cluster node or 2 cluster nodes fail?

penguinlav changed the title ~~Sharding metrics~~ Feature request: Sharding metrics Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Sharding metrics #71

Feature request: Sharding metrics #71

penguinlav commented Jan 8, 2023

Albibek commented Jan 8, 2023 •

edited

Loading

penguinlav commented Jan 11, 2023 •

edited

Loading

Albibek commented Feb 18, 2023

Feature request: Sharding metrics #71

Feature request: Sharding metrics #71

Comments

penguinlav commented Jan 8, 2023

Albibek commented Jan 8, 2023 • edited Loading

penguinlav commented Jan 11, 2023 • edited Loading

Albibek commented Feb 18, 2023

Albibek commented Jan 8, 2023 •

edited

Loading

penguinlav commented Jan 11, 2023 •

edited

Loading