-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
114 lines (80 loc) · 2.29 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: "sparklytd example notebook"
output:
html_document:
df_print: paged
---
sparklytd is an extension to read and write TD data from/to R.
## Installation
You can install this package via devtools with:
```{r eval=FALSE}
install.packages("devtools")
devtools::install_github("chezou/sparklytd")
```
## Example
This is a basic example to handle TD data:
```{r}
# Assuming TD API key is set in environment varraible "TD_API_KEY".
library(sparklyr)
library(sparklytd)
# Enable Java 9 if needed
options(sparklyr.java9 = TRUE)
```
Before first execution, you need to download td-spark jar.
```{r eval=FALSE}
download_jar()
```
```{r}
# Connect td-spark API
default_conf <- spark_config()
default_conf$spark.td.apikey=Sys.getenv("TD_API_KEY")
default_conf$spark.serializer="org.apache.spark.serializer.KryoSerializer"
default_conf$spark.sql.execution.arrow.enabled="true"
sc <- spark_connect(master = "local", config = default_conf)
```
Read and manupilate data on TD with dplyr.
```{r}
library(dplyr)
df <- spark_read_td(sc, "www_access", "sample_datasets.www_access")
df %>% count()
```
```{r}
df %>%
filter(method == "GET", code != 200) %>%
head(50) %>%
collect
```
Next, plot the summarized info from access log.
```{r}
library(ggplot2)
codes <- df %>% group_by(code) %>% summarise(count = n(), size_mean = mean(size)) %>% collect
codes$code <- as.character(codes$code)
ggplot(data=codes, aes(x=code, y=count)) + geom_bar(stat="identity") + scale_y_continuous(trans='log10')
```
You can copy R data.frame to TD.
```{r}
iris_tbl <- copy_to(sc, iris)
spark_write_td(iris_tbl, "aki.iris", mode="overwrite")
```
## Execute SQL with Presto/Hive
If you want to execute Presto SQL on TD, you can use `spark_read_td_presto` function. An execution result will be wirtten as a temporaly view of Spark.
```{r}
spark_read_td_presto(sc,
"sample",
"sample_datasets",
"select count(1) from www_access") %>% collect()
```
You can run DDL using `spark_execute_td_presto` function.
```{r}
spark_execute_td_presto(sc,
"aki",
"create table if not exists orders (key bigint, status varchar, price double)")
```
You can execute Presto or Hive SQL on TD as a regular TD job.
```{r}
spark_read_td_query(sc,
"sample",
"sample_datasets",
"select count(1) from www_access",
engine="presto") %>% collect()
```