Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectors failes to complete sync #2925

Open
sjors101 opened this issue Oct 29, 2024 · 10 comments
Open

Connectors failes to complete sync #2925

sjors101 opened this issue Oct 29, 2024 · 10 comments
Labels
bug Something isn't working community-driven

Comments

@sjors101
Copy link
Contributor

Bug Description

We are using the connector framework for a while now with over 100 connectors configured. Since a few weeks we experiencing connector jobs failing with the following error: connectors.sync_job_runner.ConnectorJobNotRunningError: Connector job (ID: Wsl5npIBp9FxXy_8Hx2C) is not running but in status of JobStatus.ERROR. We can't really pinpoint the issue, some runs fail after a few second, others after 90 minutes, and others finish successfully. It seems related to the > 100 connectors.

We did found a workaround, we noticed when we run ~ less than 10 active connector containers on our kubernetes platform, the issue won't occur. This make me wonder if there is some queue on the Elastic side that is full. We also tried increasing the DEFAULT_PAGE_SIZE in connectors/es/index.py, but this did not solve the issue.

To Reproduce

Steps to reproduce the behavior:

  1. Configure more that 100 connectors
  2. Run > 10 containers
  3. Run multiple jobs in parallel for multiple hours

Environment

Elasticsearch 8.15
ConnectorFW: 8.15.3.0

Logs / config

Attached logs of one connector container, and connector config (i replaced the sensitive records). We dont see any logs at Elasticsearch or Enterprisesearch. We notice the same behaviour on different connector_types.

container-config.txt
container-logs.txt

@sjors101 sjors101 added the bug Something isn't working label Oct 29, 2024
@seanstory
Copy link
Member

Congrats on having over 100 connectors at once!
Thanks for reporting. We'll dig into this.

I'm wondering if this is related to elastic/kibana#195127, and if Kibana is marking syncs as "error".
Did you just notice this after an upgrade? Or did you only recently scale up to so many connectors?

@sjors101
Copy link
Contributor Author

Great, thanks! Each connector we configure gets a dedicated container, we don't run multiple connectors in the same container. So i don't think it should not be related to elastic/kibana#195127. Its quite likely that its related to the scale-up, but a lot of developments happen in parallel, in the mean time we also moved from ES-stack 8.14 > 8.15.

We have a script to configure multiple connectors at once, which uses the connector API's (https://www.elastic.co/guide/en/elasticsearch/reference/current/connector-apis.html). As we speak we have 109 connectors configured, i could try to delete 10 and see if the issue still exists.

@artem-shelkovnikov
Copy link
Member

Hi @sjors101,

Is there any chance you can collect the logs from all of your connector hosts in one place and grep by the failed job id there (in your log file that'd be Wsl5npIBp9FxXy_8Hx2C).

Connectors should not affect each other, but they seem to do it somehow: as if another service is marking the connector sync job as failed. Could it be that you have services running with identical config, so that they attempt to serve the same connector?

@seanstory
Copy link
Member

seanstory commented Nov 6, 2024

as if another service is marking the connector sync job as failed. Could it be that you have services running with identical config, so that they attempt to serve the same connector?

This got me thinking, what if you had one service, configured to be responsible for more than 100 connectors all at once. Do we correctly fetch all connectors from Elasticsearch to compare against what's configured in YAML?

I don't think we do.

  • note that 100 is our page size. https://github.com/elastic/connectors/blob/main/connectors/es/index.py#L13
    That's sus, given this bug report.
  • This logic looks buggy to me.
    hits = resp["hits"]["hits"]
    total = resp["hits"]["total"]["value"]
    count += len(hits)
    for hit in hits:
    yield self._create_object(hit)
    if count >= total:
    break
    offset += len(hits)

    total gets reset each iteration, but count gets incremented. This probably can't return more than 2 pages of hits, right? Because on the second page, count will be larger than total, so we'll break.

@sjors101 you might be able to test this faster than we can set up an env with 100 connectors. Can you you change that hardcoded page size to something like 1000 and see if that fixes things? (obviously not a good long-term fix, just as an investigation step).

@artem-shelkovnikov
Copy link
Member

total gets reset each iteration, but count gets incremented. This probably can't return more than 2 pages of hits, right? Because on the second page, count will be larger than total, so we'll break.

I think total is independent - it's just number of documents matching the query, so it's okay to overwrite it. Although we don't really use PIT here so modification of collection can cause weird bugs. On the other hand, if indices are not added/removed, it should not be a problem and this inconsistency will be very eventual.

@seanstory
Copy link
Member

🤦 you're right, total isn't the number of hits in the hits array, it's the total that matches the query. My misread.

I still think the hardcoded DEFAULT_PAGE_SIZE = 100 is sus. Even if I don't spot where the bug is.

@sjors101
Copy link
Contributor Author

Hi @artem-shelkovnikov

I checked the logs of all our elastic nodes but no log records with the job-id or connector-id. The only log messages i just saw during a crash are the following, but dont think they are relevant:

{"@timestamp":"2024-11-21T18:07:14.550Z", "log.level": "INFO", "message":"could not get token document [token_weLo5m5WH1O3jKTxoV377C_p_azAkVfCiGm_rZd0eqA] that should have been created, retrying", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-es-coordinating-0][transport_worker][T#15]","log.logger":"org.elasticsearch.xpack.security.authc.TokenService","trace.id":"520380caaa394bac07f6ab93cc2a15c9","elasticsearch.cluster.uuid":"mKB46yE8TTyd9j_87znstg","elasticsearch.node.id":"1tDQZUAPTpKu1-07XRep6A","elasticsearch.node.name":"elasticsearch-es-coordinating-0","elasticsearch.cluster.name":"elasticsearch"}

@artem-shelkovnikov
Copy link
Member

Hi @sjors101, it seems to be a log from Elasticsearch. We're looking for logs from connector containers :)

@sjors101
Copy link
Contributor Author

sjors101 commented Dec 6, 2024

@artem-shelkovnikov ah right, unfortunately the attached logs are the only ones i have. To clarify our situation, we run a single connector in a container on top of kubernetes. Each connector gets a dedicated container. What i did notice is that a single connector, when idle, does a lot of queries to the .elastic-connectors-v1 index. Perhapse it is related, we will try to tweak a bit with the service.idling, and see if it improves.

@artem-shelkovnikov
Copy link
Member

Thanks! We've also noticed that amount of requests connectors send to our data indices is too high and will try to address it in future. For now indeed adjusting service.idling might help - it will slow down scheduling, of course, but will decrease RPS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community-driven
Projects
None yet
Development

No branches or pull requests

3 participants