cs286A

General Repo for CS286A projects

This document under heavy revision

In the large, the class project is intended to prototype infrastructure for managing metadata and lineage in the Apache Big Data context. The project should have three phases:

requirements gathering (2 weeks)
functional specification and design (2 weeks)
prototype implementation (4 weeks) These phases will likely overlap and have feedback loops. That's OK.

We envision three basic components:

A metadata repository that (a) has a schema to capture relevant information from a set of prototypical tasks and tools, (b) is extensible to new tasks and tools with varying degrees of opacity, and (c) can scale up to large volumes of metadata and high access rates.
A crawler that can walk large repositories of information, and call out to an extensible set of external data "recognizers" or "profilers" that can assess the contents of individual data files or sets. Candidate datastores include HDFS, POSIX filesystems, and relational databases (via standards like JDBC), and perhaps some special file types like iPython notebooks. The crawler should interface with a standard scheduling infrastructure at two levels 1. macro: fire off crawls on a schedule (nightly, weekly, etc) 2. micro: execute the crawl through the scheduler: i.e. visit files and feed them up for REST calls at a load-sensitive pace.
A metadata mover facility that provides (a) an API for inserting metadata into the repository, (b) a facility for reliable bulk movement of large volumes data into the repository, and (c) an interface to the same scheduling infrastructure as the crawler for executing bulk metadata movement

The goal here is not to write everything from scratch, but rather to use and augment well-supported open source components. Potentially useful components include:

Gobblin. This is an excellent starting point for the crawler and metadata mover projects.
A variety of open-source databases could be used for the metadata repository. We will have to decide very early on whether we want to use a relational database or a key-value store. A concern with relational databases is that there isn't currently a well-used scale-out (parallel/distributed) relational database in the typical Apache Big Data environment
Kafka is a standard open source tool for reliable bulk data movement that could be useful and will likely work well with Gobblin as both are from LinkedIn.

We will endeavor to support some real-world use cases from scientists on campus, as well as partners in industry. On campus, we have access to experts on two important toolchains:

The BDAS stack including Spark and MLlib.
Project Jupyter, and especially iPython.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
notes		notes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cs286A

About

Releases

Packages

License

kjdillon/cs286A

Folders and files

Latest commit

History

Repository files navigation

cs286A

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages