Skip to content

SampleClean Retreat Summer 2014 Demo

Sanjay Krishnan edited this page May 16, 2014 · 8 revisions

Create the table

   sampleclean> CREATE TABLE wikipedia (title string, text string)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' 
   LINES TERMINATED BY '\n';
   OK
   Time taken: 0.035 seconds

Load the data into the table

   sampleclean> LOAD DATA LOCAL INPATH 'data/files/wikipedia_abstracts.csv' 
   OVERWRITE INTO TABLE wikipedia;
   OK
   Time taken: 14.109 seconds

Run some queries on the dataset

   sampleclean> SELECT COUNT(1) FROM wikipedia;                                                       
   OK
   4004479
   Time taken: 8.535 seconds

Number of "dirty" articles referring to "Apple"

   sampleclean> SELECT COUNT(1) FROM wikipedia where lower(text) like '%apple%';
   OK
   11261
   Time taken: 36.853 seconds

Number of "dirty" articles referring to "Google"

   sampleclean> SELECT COUNT(1) FROM wikipedia where lower(text) like '%google%';
   OK
   7400
   Time taken: 33.109 seconds

We don't support stratified sampling yet, so in there interim run the following command:

  sampleclean> CREATE TABLE wikipedia_stratified AS SELECT * FROM WIKIPEDIA
  WHERE (lower(text) LIKE '%google%') 
  OR (lower(text) LIKE '%apple%') 
  OR (rand() < 0.01);
  OK
  Time taken: 65.346 seconds
  sampleclean> select count(1) from wikipedia_stratified;
  OK
  58491
  Time taken: 1.545 seconds

Initialize the SampleClean view

  sampleclean> scinitialize wikipedia_stratified_sample (title,text) 
  from wikipedia_stratified samplewith 0.1;

Get the sample size:

  sampleclean> sccount wikipedia_stratified_sample;
  OK
  5795
  Time taken: 3.249 seconds
  sampleclean> set sampleclean.dataset.size = 58491;
  sampleclean> set sampleclean.sample.size = 5795;

Get the approximate versions of the initial query:

 sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%apple%';       
 OK
 11719.995426829268 +/- 1880.0 (95% Confidence Clean) 
 Time taken: 1.661 seconds
 sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%google%';
 225.500: [GC 321692K->29898K(1025536K), 0.0028120 secs]
 226.379: [GC 331978K->30925K(1025024K), 0.0030140 secs]
 OK
 7222.211890243902 +/- 1545.0 (95% Confidence Clean)

Filter out "fruits":

 sampleclean> scfilter wikipedia_stratified_sample lower(text) not like '%fruit%';
 OK
 Time taken: 2.052 seconds
 sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%apple%'; 
 OK
 10659.94512195122 +/- 1813.0 (95% Confidence Clean) 
 Time taken: 1.47 seconds

Filter out "plants" (eg. apple blossom):

 sampleclean>scfilter wikipedia_stratified_sample lower(text) not like '%plant%'; 
 OK
 Time taken: 1.991 seconds
 sampleclean>selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%apple%';
 OK
 10154.6875 +/- 1779.0 (95% Confidence Clean) 
 Time taken: 1.452 seconds

Find apple only on word boundaries (eg. Appleton). First, we replace all punctuation with spaces:

 sampleclean> scformat wikipedia_stratified_sample text replace ',' ' ';
 OK
 Time taken: 1.812 seconds
 sampleclean> scformat wikipedia_stratified_sample text replace '\\.' ' ';                             
 OK
 Time taken: 1.987 seconds
 sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '% apple %';
 OK
 5181.367378048781 +/- 1335.0 (95% Confidence Clean)
 Time taken: 1.287 seconds

Apple now has a smaller count that Google. Next, we filter out lists.

 sampleclean> scfilter wikipedia_stratified_sample lower(title) not like 'list%';
 sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '% apple %';
 4963.413109756098 +/- 1309.0 (95% Confidence Clean) 
 Time taken: 1.324 seconds
 sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%google%';  
 1578.959: [GC 355552K->28115K(1038336K), 0.0028100 secs]
 OK
 5934.300304878048 +/- 1418.0 (95% Confidence Clean) 
 Time taken: 1.249 seconds

We can now run any query we want on the cleaned dataset and estimate results. For example, if we want to know how often they show up in the same article.

sampleclean> selectrawsc count(1) from wikipedia_stratified_sample where lower(text) like '%google%' and    
lower(text) like '% apple %';
OK
188.23323170731706 +/- 266.0 (95% Confidence Clean) 
Time taken: 1.304 seconds