Clean up roadmap to make it clear from a feature perspective, add in …

…blog, first blog post about shifting data quality left
data-catering · Nov 29, 2023 · d047410 · d047410
1 parent 3b455ed
commit d047410
Show file tree

Hide file tree

Showing 46 changed files with 9,936 additions and 929 deletions.
diff --git a/docs/setup/design.md b/docs/setup/design.md
@@ -0,0 +1,76 @@
+# Design
+
+This document shows the thought process behind the design of Data Caterer to help give you insights as to how and why
+it was created to what it is today. Also, this serves as a reference for future design decisions which will get updated 
+here and thus is a living document.
+
+## Motivation
+
+The main difficulties that I faced as a developer and team lead relating to testing were:
+
+- Difficulty in testing with multiple data sources, both batch and real time
+  - Reliance on other teams for stable environments or domain knowledge
+- Test environments with no reliable or consistent data flows
+  - Complex data masking/anonymization solutions
+  - Relying on production data (potential privacy and data breach issues)
+- Cost of data production issues can be very high
+- Unknown unknowns staying hidden until problems occur in production
+- Underutilised metadata
+
+## Guiding Principles
+
+These difficulties helped formed the basis of the principles for which Data Caterer should follow:
+
+- **Data source agnostic**: Connect to any batch or real time data sources for data generation or validation
+- **Configurable**: Run the application the way you want
+- **Extensible**: Allow for new innovations to seamlessly integrate with Data Caterer
+- **Integrate with existing solutions**: Utilise existing metadata to make it easy for users to use straight away
+- **Secure**: No production connections required, metadata based solution
+- **Fast**: Give developers fast feedback loops to encourage them to thoroughly test data flows
+
+## High level flow
+
+``` mermaid
+graph LR
+  subgraph userTasks [User Configuration]
+  dataGen[Data Generation]
+  dataValid[Data Validation]
+  runConf[Runtime Config]
+  end
+  
+  subgraph dataProcessor [Processor]
+  dataCaterer[Data Caterer]
+  end
+  
+  subgraph existingMetadata [Metadata]
+  metadataService[Metadata Services]
+  metadataDataSource[Data Sources]
+  end
+  
+  subgraph output [Output]
+  outputDataSource[Data Sources]
+  report[Report]
+  end
+  
+  dataGen --> dataCaterer
+  dataValid --> dataCaterer
+  runConf --> dataCaterer
+  direction TB
+  dataCaterer -.-> metadataService
+  dataCaterer -.-> metadataDataSource
+  direction LR
+  dataCaterer ---> outputDataSource
+  dataCaterer ---> report
+```
+
+1. User Configuration
+    1. Users define data generation, validation and runtime configuration
+2. Processor
+    1. Engine will take user configuration to decide how to run
+    2. User defined configuration merged with metadata from external sources
+3. Metadata
+    1. Automatically retrieve schema, data profiling, relationship or validation rule metadata from data sources or metadata services
+4. Output
+    1. Execute data generation and validation tasks on data sources
+    2. Generate report summarising outcome
+
diff --git a/docs/use-case/blog/shift-left-data-quality.md b/docs/use-case/blog/shift-left-data-quality.md
@@ -0,0 +1,97 @@
+# Shifting Data Quality Left with Data Catering
+
+## Empowering Proactive Data Management
+
+In the ever-evolving landscape of data-driven decision-making, ensuring data quality is non-negotiable. Traditionally,
+data quality has been a concern addressed late in the development lifecycle, often leading to reactive measures and
+increased costs. However, a paradigm shift is underway with the adoption of a "shift left" approach, placing data
+quality at the forefront of the development process.
+
+### Today
+
+``` mermaid
+graph LR
+  subgraph badQualityData[<b>Manually generated data, data quality always passes</b>]
+  local[<b>Local</b>\nManual test, unit test]
+  dev[<b>Dev</b>\nManual test, integration test]
+  stg[<b>Staging</b>\nSanity checks]
+  end
+  
+  subgraph qualityData[<b>Reliable data, the true test</b>]
+  prod[<b>Production</b>\nData quality checks, monitoring, observaibility]
+  end
+  
+  style badQualityData fill:#d9534f,fill-opacity:0.7
+  style qualityData fill:#5cb85c,fill-opacity:0.7
+  
+  local --> dev
+  dev --> stg
+  stg --> prod
+```
+
+### With Data Caterer
+
+
+``` mermaid
+graph LR
+  subgraph qualityData[<b>Reliable data for testing anywhere</b>]
+  direction LR
+  local[<b>Local</b>\nManual test, unit test]
+  dev[<b>Dev</b>\nManual test, integration test]
+  stg[<b>Staging</b>\nSanity checks]
+  prod[<b>Production</b>\nData quality checks, monitoring, observaibility]
+  end
+  
+  style qualityData fill:#5cb85c,fill-opacity:0.7
+  
+  local --> dev
+  dev --> stg
+  stg --> prod
+```
+
+## Understanding the Shift Left Approach
+
+"Shift left" is a philosophy that advocates for addressing tasks and concerns earlier in the development lifecycle.
+Applied to data quality, it means tackling data issues as early as possible, ideally during the development and testing
+phases. This approach aims to catch data anomalies, inaccuracies, or inconsistencies before they propagate through the
+system, reducing the likelihood of downstream errors.
+
+## Data Caterer: The Catalyst for Shifting Left
+
+Enter Data Caterer, a metadata-driven data generation and validation tool designed to empower organizations in shifting
+data quality left. By incorporating Data Caterer into the early stages of development, teams can proactively test
+complex data flows, validate data sources, and ensure data quality before it reaches downstream processes.
+
+## Key Advantages of Shifting Data Quality Left with Data Caterer
+
+1. **Early Issue Detection:**
+    - Identify data quality issues early in the development process, reducing the risk of errors downstream.
+2. **Proactive Validation:**
+    - Validate data sources and complex data flows in a simplified manner, promoting a proactive approach to data quality.
+3. **Efficient Testing Across Sources:**
+    - Seamlessly test data across various sources, including databases, file formats, HTTP, and messaging, all within 
+      your local laptop or development environment.
+    - Fast feedback loop to motivate developers to ensure thorough testing of data scenarios.
+4. **Integration with Development Pipelines:**
+    - Easily integrate Data Caterer as a task in your development pipelines, ensuring that data quality is a continuous 
+      consideration rather than an isolated event.
+5. **Integration with Existing Metadata:**
+    - By harnessing the power of existing metadata from data catalogs, schema registries, or other data validation tools,
+      Data Caterer streamlines the process, automating the generation and validation of your data effortlessly.
+6. **Improved Collaboration:**
+    - Facilitate collaboration between developers, testers, and data professionals by providing a common platform for
+      early data validation.
+    - No need to rely on seeking domain expertise or external teams for data testing.
+
+## Realizing the Vision of Proactive Data Quality
+
+As organizations strive for excellence in their data-driven endeavors, the shift left approach with Data Caterer
+becomes a strategic imperative. By instilling a proactive data quality culture, teams can minimize the risk of costly
+errors, enhance the reliability of their data, and streamline the entire development lifecycle.
+
+In conclusion, the marriage of the shift left philosophy and Data Caterer brings forth a new era of data management,
+where data quality is not just a final checkpoint but an integral part of every development milestone. Embrace the shift
+left approach with Data Caterer and empower your teams to build robust, high-quality data solutions from the very
+beginning.
+
+*Shift Left, Validate Early, and Accelerate with Data Caterer.*