forked from pflooky/data-caterer-docs
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Clean up roadmap to make it clear from a feature perspective, add in …
…blog, first blog post about shifting data quality left
- Loading branch information
Showing
46 changed files
with
9,936 additions
and
929 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Design | ||
|
||
This document shows the thought process behind the design of Data Caterer to help give you insights as to how and why | ||
it was created to what it is today. Also, this serves as a reference for future design decisions which will get updated | ||
here and thus is a living document. | ||
|
||
## Motivation | ||
|
||
The main difficulties that I faced as a developer and team lead relating to testing were: | ||
|
||
- Difficulty in testing with multiple data sources, both batch and real time | ||
- Reliance on other teams for stable environments or domain knowledge | ||
- Test environments with no reliable or consistent data flows | ||
- Complex data masking/anonymization solutions | ||
- Relying on production data (potential privacy and data breach issues) | ||
- Cost of data production issues can be very high | ||
- Unknown unknowns staying hidden until problems occur in production | ||
- Underutilised metadata | ||
|
||
## Guiding Principles | ||
|
||
These difficulties helped formed the basis of the principles for which Data Caterer should follow: | ||
|
||
- **Data source agnostic**: Connect to any batch or real time data sources for data generation or validation | ||
- **Configurable**: Run the application the way you want | ||
- **Extensible**: Allow for new innovations to seamlessly integrate with Data Caterer | ||
- **Integrate with existing solutions**: Utilise existing metadata to make it easy for users to use straight away | ||
- **Secure**: No production connections required, metadata based solution | ||
- **Fast**: Give developers fast feedback loops to encourage them to thoroughly test data flows | ||
|
||
## High level flow | ||
|
||
``` mermaid | ||
graph LR | ||
subgraph userTasks [User Configuration] | ||
dataGen[Data Generation] | ||
dataValid[Data Validation] | ||
runConf[Runtime Config] | ||
end | ||
subgraph dataProcessor [Processor] | ||
dataCaterer[Data Caterer] | ||
end | ||
subgraph existingMetadata [Metadata] | ||
metadataService[Metadata Services] | ||
metadataDataSource[Data Sources] | ||
end | ||
subgraph output [Output] | ||
outputDataSource[Data Sources] | ||
report[Report] | ||
end | ||
dataGen --> dataCaterer | ||
dataValid --> dataCaterer | ||
runConf --> dataCaterer | ||
direction TB | ||
dataCaterer -.-> metadataService | ||
dataCaterer -.-> metadataDataSource | ||
direction LR | ||
dataCaterer ---> outputDataSource | ||
dataCaterer ---> report | ||
``` | ||
|
||
1. User Configuration | ||
1. Users define data generation, validation and runtime configuration | ||
2. Processor | ||
1. Engine will take user configuration to decide how to run | ||
2. User defined configuration merged with metadata from external sources | ||
3. Metadata | ||
1. Automatically retrieve schema, data profiling, relationship or validation rule metadata from data sources or metadata services | ||
4. Output | ||
1. Execute data generation and validation tasks on data sources | ||
2. Generate report summarising outcome | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# Shifting Data Quality Left with Data Catering | ||
|
||
## Empowering Proactive Data Management | ||
|
||
In the ever-evolving landscape of data-driven decision-making, ensuring data quality is non-negotiable. Traditionally, | ||
data quality has been a concern addressed late in the development lifecycle, often leading to reactive measures and | ||
increased costs. However, a paradigm shift is underway with the adoption of a "shift left" approach, placing data | ||
quality at the forefront of the development process. | ||
|
||
### Today | ||
|
||
``` mermaid | ||
graph LR | ||
subgraph badQualityData[<b>Manually generated data, data quality always passes</b>] | ||
local[<b>Local</b>\nManual test, unit test] | ||
dev[<b>Dev</b>\nManual test, integration test] | ||
stg[<b>Staging</b>\nSanity checks] | ||
end | ||
subgraph qualityData[<b>Reliable data, the true test</b>] | ||
prod[<b>Production</b>\nData quality checks, monitoring, observaibility] | ||
end | ||
style badQualityData fill:#d9534f,fill-opacity:0.7 | ||
style qualityData fill:#5cb85c,fill-opacity:0.7 | ||
local --> dev | ||
dev --> stg | ||
stg --> prod | ||
``` | ||
|
||
### With Data Caterer | ||
|
||
|
||
``` mermaid | ||
graph LR | ||
subgraph qualityData[<b>Reliable data for testing anywhere</b>] | ||
direction LR | ||
local[<b>Local</b>\nManual test, unit test] | ||
dev[<b>Dev</b>\nManual test, integration test] | ||
stg[<b>Staging</b>\nSanity checks] | ||
prod[<b>Production</b>\nData quality checks, monitoring, observaibility] | ||
end | ||
style qualityData fill:#5cb85c,fill-opacity:0.7 | ||
local --> dev | ||
dev --> stg | ||
stg --> prod | ||
``` | ||
|
||
## Understanding the Shift Left Approach | ||
|
||
"Shift left" is a philosophy that advocates for addressing tasks and concerns earlier in the development lifecycle. | ||
Applied to data quality, it means tackling data issues as early as possible, ideally during the development and testing | ||
phases. This approach aims to catch data anomalies, inaccuracies, or inconsistencies before they propagate through the | ||
system, reducing the likelihood of downstream errors. | ||
|
||
## Data Caterer: The Catalyst for Shifting Left | ||
|
||
Enter Data Caterer, a metadata-driven data generation and validation tool designed to empower organizations in shifting | ||
data quality left. By incorporating Data Caterer into the early stages of development, teams can proactively test | ||
complex data flows, validate data sources, and ensure data quality before it reaches downstream processes. | ||
|
||
## Key Advantages of Shifting Data Quality Left with Data Caterer | ||
|
||
1. **Early Issue Detection:** | ||
- Identify data quality issues early in the development process, reducing the risk of errors downstream. | ||
2. **Proactive Validation:** | ||
- Validate data sources and complex data flows in a simplified manner, promoting a proactive approach to data quality. | ||
3. **Efficient Testing Across Sources:** | ||
- Seamlessly test data across various sources, including databases, file formats, HTTP, and messaging, all within | ||
your local laptop or development environment. | ||
- Fast feedback loop to motivate developers to ensure thorough testing of data scenarios. | ||
4. **Integration with Development Pipelines:** | ||
- Easily integrate Data Caterer as a task in your development pipelines, ensuring that data quality is a continuous | ||
consideration rather than an isolated event. | ||
5. **Integration with Existing Metadata:** | ||
- By harnessing the power of existing metadata from data catalogs, schema registries, or other data validation tools, | ||
Data Caterer streamlines the process, automating the generation and validation of your data effortlessly. | ||
6. **Improved Collaboration:** | ||
- Facilitate collaboration between developers, testers, and data professionals by providing a common platform for | ||
early data validation. | ||
- No need to rely on seeking domain expertise or external teams for data testing. | ||
|
||
## Realizing the Vision of Proactive Data Quality | ||
|
||
As organizations strive for excellence in their data-driven endeavors, the shift left approach with Data Caterer | ||
becomes a strategic imperative. By instilling a proactive data quality culture, teams can minimize the risk of costly | ||
errors, enhance the reliability of their data, and streamline the entire development lifecycle. | ||
|
||
In conclusion, the marriage of the shift left philosophy and Data Caterer brings forth a new era of data management, | ||
where data quality is not just a final checkpoint but an integral part of every development milestone. Embrace the shift | ||
left approach with Data Caterer and empower your teams to build robust, high-quality data solutions from the very | ||
beginning. | ||
|
||
*Shift Left, Validate Early, and Accelerate with Data Caterer.* |
Oops, something went wrong.