Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES 8.x / v2 integration test often fails #6094

Open
yurishkuro opened this issue Oct 15, 2024 · 9 comments · May be fixed by #6157 or #6229
Open

ES 8.x / v2 integration test often fails #6094

yurishkuro opened this issue Oct 15, 2024 · 9 comments · May be fixed by #6157 or #6229
Labels
area/storage good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement storage/elasticsearch v2

Comments

@yurishkuro
Copy link
Member

https://github.com/jaegertracing/jaeger/actions/runs/11341694226/job/31567998272?pr=6091

The test is running this script: bash scripts/es-integration-test.sh elasticsearch 8.x v2

some errors that might explain

This one looks like there was a retry save that resulted in duplicate spans in the storage:

=== NAME  TestElasticsearchStorage/GetTrace
    integration.go:279: 
        	Error Trace:	/home/runner/work/jaeger/jaeger/plugin/storage/integration/integration.go:279
        	Error:      	Should be true
        	Test:       	TestElasticsearchStorage/GetTrace
    trace_compare.go:66: Removed spans with duplicate span IDs; before=10, after=5

This one looks like the test might have started running before the index mappings were created:

  2024-10-15T16:23:38.766Z	error	config/config.go:151	Elasticsearch part of bulk request failed	{"kind": "extension", "name": "jaeger_storage", "map-key": "index", "response": {"_index":"jaeger-main-jaeger-service-2024-10-15","_id":"f0b7d270dd871f4","status":404,"error":{"type":"index_not_found_exception","reason":"no such index [jaeger-main-jaeger-service-2024-10-15]","resource.type":"index_or_alias","resource.id":"jaeger-main-jaeger-service-2024-10-15","index":"jaeger-main-jaeger-service-2024-10-15"}}}
  github.com/jaegertracing/jaeger/pkg/es/config.NewClient.func2
  	/home/runner/work/jaeger/jaeger/pkg/es/config/config.go:151
@yurishkuro yurishkuro added good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement labels Oct 15, 2024
@yurishkuro
Copy link
Member Author

The logs from ES for creating the index and failing to query it are quite close in the timestamps:

create index @ 2024-10-15T16:23:38.491Z:

  elasticsearch-1  | {"@timestamp":"2024-10-15T16:23:38.491Z", "log.level": "INFO", "message":"[jaeger-main-jaeger-span-2024-10-15] creating index, cause [auto(bulk api)], templates [jaeger-main-jaeger-span], shards [5]/[1]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[25d7d119ecca][masterService#updateTask][T#2]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"yjRpwgx_Rpeo0HIgerdzjQ","elasticsearch.node.id":"96DWKX6kSLCH7fWb7vVjpw","elasticsearch.node.name":"25d7d119ecca","elasticsearch.cluster.name":"docker-cluster"}

query failure @ 2024-10-15T16:23:38.669Z (only 170ms later):

  elasticsearch-1  | {"@timestamp":"2024-10-15T16:23:38.669Z", "log.level": "WARN", "message":"path: /jaeger-main-jaeger-service-2024-10-15%2Cjaeger-main-jaeger-service-2024-10-14%2Cjaeger-main-jaeger-service-2024-10-13%2Cjaeger-main-jaeger-service-2024-10-12/_search, params: {rest_total_hits_as_int=true, ignore_unavailable=true, index=jaeger-main-jaeger-service-2024-10-15,jaeger-main-jaeger-service-2024-10-14,jaeger-main-jaeger-service-2024-10-13,jaeger-main-jaeger-service-2024-10-12}, status: 503", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[25d7d119ecca][transport_worker][T#1]","log.logger":"rest.suppressed","elasticsearch.cluster.uuid":"yjRpwgx_Rpeo0HIgerdzjQ","elasticsearch.node.id":"96DWKX6kSLCH7fWb7vVjpw","elasticsearch.node.name":"25d7d119ecca","elasticsearch.cluster.name":"docker-cluster","error.type":"org.elasticsearch.action.search.SearchPhaseExecutionException","error.message":"all shards failed","error.stack_trace":"Failed to execute phase [query], all shards failed; shardFailures {[_na_][jaeger-main-jaeger-service-2024-10-15][0]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][1]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][2]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][3]: org.elasticsearch.action.NoShardAvailableActionException\n}{[_na_][jaeger-main-jaeger-service-2024-10-15][4]: org.elasticsearch.action.NoShardAvailableActionException\n}\n\tat 

This could indicate a race condition where writes are happening too quickly, while the templates are still being created. I have not been seeing this error in v1 integration tests.

@won-js
Copy link
Contributor

won-js commented Oct 16, 2024

When running in the local environment, the same issue does not occur. 😅

@yurishkuro
Copy link
Member Author

Depends on how many times you run it. It's intermittent.

@madmecodes
Copy link

madmecodes commented Oct 22, 2024

The test attempts to create an index and immediately tries to query it, leading to race conditions:

create index @ 2024-10-15T16:23:38.491Z
query failure @ 2024-10-15T16:23:38.669Z (only 170ms later)

Image

Hi @yurishkuro

I'm new to the Jaeger project and would like to work on fixing this ES integration test race condition. I've already reproduced the issue locally and here are my observations:

  1. The race condition occurs when queries are attempted ~170ms after index creation
  2. I can reproduce it consistently using the integration test

There are multiple errors indicating that Elasticsearch is either failing to store spans correctly or that the test is querying indices that have not yet been created.

Would you like to assign issue #6094 to me?

@madmecodes
Copy link

The test attempts to create an index and immediately tries to query it, leading to race conditions:

create index @ 2024-10-15T16:23:38.491Z query failure @ 2024-10-15T16:23:38.669Z (only 170ms later)

Image

Hi @yurishkuro

I'm new to the Jaeger project and would like to work on fixing this ES integration test race condition. I've already reproduced the issue locally and here are my observations:

  1. The race condition occurs when queries are attempted ~170ms after index creation
  2. I can reproduce it consistently using the integration test

There are multiple errors indicating that Elasticsearch is either failing to store spans correctly or that the test is querying indices that have not yet been created.

Would you like to assign issue #6094 to me?

Got dublicate span error also

Image

@yurishkuro
Copy link
Member Author

How are you planning to fix it?

madmecodes added a commit to madmecodes/jaeger that referenced this issue Nov 3, 2024
This change fixes two issues in ES integration tests:
1. Prevents index not found exceptions by proper index readiness checks
2. Prevents duplicate span detection issues

The solution uses sync.Map for thread-safe:
- Index existence caching
- Span tracking

Fixes jaegertracing#6094

Signed-off-by: ayush-gupta-dev <[email protected]>
madmecodes added a commit to madmecodes/jaeger that referenced this issue Nov 3, 2024
This change fixes two issues in ES integration tests:
1. Prevents index not found exceptions by proper index readiness checks
2. Prevents duplicate span detection issues

The solution uses sync.Map for thread-safe:
- Index existence caching
- Span tracking

Fixes jaegertracing#6094

Signed-off-by: ayush-gupta-dev <[email protected]>
@madmecodes madmecodes linked a pull request Nov 3, 2024 that will close this issue
@madmecodes
Copy link

Hi @yurishkuro,

Sorry for the delay in addressing this issue. I've implemented a fix for the ES integration test race conditions that addresses both problems:

Index Creation Race Condition:

  • Added proper index readiness verification with wait/retry logic
  • Added index caching to prevent duplicate creation attempts

Duplicate Span Detection Issue:

  • Added thread-safe span tracking using sync.Map

The fix has been submitted in PR #6157

Would appreciate your review and feedback.

madmecodes added a commit to madmecodes/jaeger that referenced this issue Nov 20, 2024
- Add template creation retry with timeout in factory
- Add context to CreateTemplates in writer
- Ensure templates are created before returning writer

Fixes jaegertracing#6094
madmecodes added a commit to madmecodes/jaeger that referenced this issue Nov 20, 2024
- Add template creation retry with timeout in factory
- Add context to CreateTemplates in writer
- Ensure templates are created before returning writer

Fixes jaegertracing#6094
@madmecodes madmecodes linked a pull request Nov 20, 2024 that will close this issue
@Manik2708
Copy link

Hi @madmecodes Are you still working on this? Or is there something where I could assist?

@madmecodes
Copy link

Hi @Manik2708 i am having my university examination at the moment, will start contributing again from 11 december, mean while yes you can assist from the conversation we had.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement storage/elasticsearch v2
Projects
None yet
4 participants