Skip to content

SampleClean User Guide

Sanjay Krishnan edited this page May 14, 2014 · 6 revisions

Creating and Manipulating Samples

Creating A Sample

SCINITIALIZE creates a SampleClean view, this view is maintained in memory and its schema is recorded in an in-memory catalog. We currently support uniform samples from the entire base table.

    SCINITIALIZE sampleName(arg1, arg2, ..., argk) FROM table SAMPLEWITH ratio;

Internally, we maintain two views: _clean and _dirty, which also contain a hash and a duplicate count. Before running any data cleaning techniques, these two views are identical. Example Creation Query:

    sampleclean> SCINITIALIZE cities_sample (city, country, population, area, density) FROM cities SAMPLEWITH 0.1;

Currently, we only support samples of entire tables. A work around is to create a temporary table with a CREATE TABLE ... AS SELECT statement. Furthermore, we currently do not support typing of data and represent all attributes as JAVA Strings; which are automatically processed as numbers if you run a numerical query.

Resetting A Sample

If we are unhappy with our data cleaning results, we can reset our clean sample back to its original state with:

    screset sampleName;

Deleting A Sample

To delete a sample:

    sampleclean> drop table sampleName_clean;
    sampleclean> drop table sampleName_dirty;

Data Cleaning

Text Formatting

TODO