Skip to content

Latest commit

 

History

History
47 lines (35 loc) · 3.88 KB

README.md

File metadata and controls

47 lines (35 loc) · 3.88 KB

awesome-lakehouse

a curated list of awesome Lakehouse frameworks, applications, etc

Table of Contents

Table Format

  • Apache Iceberg [Java] - a high-performance format for huge analytic tables, bringing the reliability and simplicity of SQL tables to big data.
  • Apache Hudi [Java] - a transactional data lake platform that brings database and data warehouse capabilities to the data lake.
  • Apache Paimon (incubating) [Java] - a streaming data lake platform with high-speed data ingestion, changelog tracking and efficient real-time analytics.
  • Apache XTable (incubating) [Java] - a cross-table converter for table formats that facilitates omni-directional interoperability across data processing systems and query engines.
  • Delta [Scala] - an open-source storage framework that enables building a Lakehouse architecture with various compute engines and languages.

Lakehouse System

  • Apache Amoro (incubating) [Java] - a management system built on open data lake formats, bringing pluggable and self-managed features for Lakehouse.
  • GeoLake [Java] - Universal solution for geospatial data tailored to data lakehouse systems.
  • LakeSoul [Rust] - a cloud-native Lakehouse framework that supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and unified streaming & batch processing.
  • Lakehouse Engine [Python] - a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.
  • Smart Data Lake [Scala] - a data lake automation framework that makes loading and transforming data a breeze.

Metadata Service

  • Apache Gravitino [Java] - a high-performance, geo-distributed, and federated metadata lake.
  • Apache Polaris (incubating) [Java] - The interoperable, open source catalog for Apache Iceberg
  • DeltaCAT [Python] - a Pythonic Data Catalog powered by Ray.
  • lakeFS [Go] - data version control for data lake.
  • Lakekeeper [Rust] - A Rust native Iceberg REST Catalog.
  • Metacat [Java] - a unified metadata exploration API service.
  • Nessie [Java] - a Transactional Catalog for Data Lakes with Git-like semantics.
  • OpenHouse [Java] - an open source control plane designed for efficient management of tables within open data lakehouse deployments.
  • UnityCatalog [Java] - an open and interoperable catalog for data and AI

Machine Learning

  • Space [Python] - Unified storage framework for the entire machine learning lifecycle.

Benchmark

  • LHBench [Scala] - a benchmark for Lakehouse storage systems.
  • LST-Bench [Java] - a framework that allows users to run benchmarks specifically designed for evaluating the performance, efficiency, and stability of Log-Structured Tables (LSTs).