-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Store a encoded copy of the original document for fields validation #2016
Comments
Done some tests manually in a Elastic stack locally, updating the
Some fields are filtered in that code, since it looks like they are added or taken from the doc itself. Adding that script processor, result in a new field like this: {
"doc.before_ingested": [
"agent: {name=elastic-agent-83113, id=4711410d-f9bf-416b-8b5e-eb829b9866c1, type=metricbeat, ephemeral_id=6a54ede2-9e62-4b0b-ad0e-2614c02e489b, version=8.15.2}",
"@timestamp: 2024-10-03T14:27:14.869Z",
"nginx: {stubstatus={hostname=svc-nginx:80, current=10, waiting=0, accepts=343, handled=343, writing=1, dropped=0, reading=0, active=1, requests=378}}",
"ecs: {version=8.0.0}",
"service: {address=http://svc-nginx:80/server-status, type=nginx}",
"data_stream: {namespace=81181, type=metrics, dataset=nginx.stubstatus}",
"elastic_agent: {id=4711410d-f9bf-416b-8b5e-eb829b9866c1, version=8.15.2, snapshot=false}",
"host: {hostname=elastic-agent-83113, os={kernel=6.8.0-45-generic, codename=focal, name=Ubuntu, type=linux, family=debian, version=20.04.6 LTS (Focal Fossa), platform=ubuntu}, containerized=false, ip=[172.19.0.2, 172.18.0.7], name=elastic-agent-83113, id=93db770e92a444c98362aee1860ae326, mac=[02-42-AC-12-00-07, 02-42-AC-13-00-02], architecture=x86_64}",
"metricset: {period=10000, name=stubstatus}",
"event: {duration=282897, agent_id_status=verified, ingested=2024-10-03T14:27:15Z, module=nginx, dataset=nginx.stubstatus}",
"_version_type: internal", <-- filtered
"_index: metrics-nginx.stubstatus-81181", <-- filtered
"_id: null", <-- filtered
"_version: -4" <-- filtered
]
} The last 4 fields ( In order to avoid failures in tests run by elastic-package, it is also required to skipped that new field. This skip can be added here:
Even with this new field which value is an encoded copy of the document, it would have similar issues, since it does not keep the same format as it was in the document. For instance:
I've tried to look for another method/processor in the ingest pipeline to transform this to a JSON string, but I didn't find any way to achieve this. Could that be possible defining some other processor? Would there be another option to get a copy of the document before being ingested? Example of script processor keeping the same structure (objects, arrays, ...)For completeness, the following script processor code would copy the document fields with the same structure. However, this would have the same issues when synthetic source, runtime fields or other features:
|
Summary
Store a encoded copy of the original document with a processor in the final pipeline, before ingestion, and use this copy to validate fields and generate sample documents instead of rebuilding the document from the ingested data.
Split fields validations in two sets, one that uses this encoded copy, and another one for the indexed data.
Background
When validating fields we use the documents as they are stored in Elasticsearch. With the adoption of features like
constant_keyword
, runtime fields,synthetic
index mode, orindex: false
in packages it can be difficult to rebuild the original document. Some mappings could also introduce additional multifields, that in some cases we are ignoring, or have to ignore.We have now quite some code attempting to handle all these cases, and corner cases in combinations between them. Every time a new feature of this kind is added new corner cases appear.
Going back to the original objectives of these tests, we want to validate these two things:
With the current approach of checking the documents ingested as returned by the search API, we are missing the first point, as in many cases we don't have the data the package is generating, and we attempt to rebuild the documents from the indexed data.
So the proposal would be to explicitly split validations in two:
final_pipeline
.Some tests will do only one set of validations or both. The encoded copy could be additionally used for the generation of sample documents.
The text was updated successfully, but these errors were encountered: