-
Notifications
You must be signed in to change notification settings - Fork 3
/
pageviews-flask-cassandra.html
223 lines (185 loc) · 10.8 KB
/
pageviews-flask-cassandra.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
<!DOCTYPE html>
<html lang="en" itemscope itemtype="http://schema.org/Article">
<head>
<title>Collect pageviews with Flask and Cassandra</title>
<meta charset="utf-8">
<meta property="og:title" content="Collect pageviews with Flask and Cassandra">
<meta property="og:site_name" content="Modesto Mas | Blog">
<meta property="og:image" content="https://mmas.github.io/images/profile.jpg">
<meta property="og:image:width" content="200">
<meta property="og:image:height" content="200">
<meta property="og:url" content="https://mmas.github.io/pageviews-flask-cassandra">
<meta property="og:locale" content="en_GB">
<meta name="twitter:image" content="https://mmas.github.io/images/profile.jpg">
<meta name="twitter:url" content="https://mmas.github.io/pageviews-flask-cassandra">
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="mmas.github.io">
<meta name="twitter:title" content="Collect pageviews with Flask and Cassandra">
<meta name="description" content="Here is a simple example of collecting pageviews using Flask and Cassandra. The correct way from the client side to make a cross-site request to save...">
<meta name="twitter:description" content="Here is a simple example of collecting pageviews using Flask and Cassandra. The correct way from the client side to make a cross-site request to save...">
<meta property="og:description" content="Here is a simple example of collecting pageviews using Flask and Cassandra. The correct way from the client side to make a cross-site request to save...">
<meta name="keywords" content="cassandra,data-warehousing,flask,python,web-analytics">
<meta property="og:type" content="blog">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta property="og:type" content="article">
<meta property="article:author" content="https://github.com/mmas">
<meta property="article:section" content="cassandra">
<meta property="article:tag" content="cassandra,data-warehousing,flask,python,web-analytics">
<meta property="article:published_time" content="2016-07-10">
<meta property="article:modified_time" content="2016-07-10">
<link rel="stylesheet" type="text/css" href="/css/main.css">
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
CommonHTML: {
scale: 93,
showMathMenu: false
},
tex2jax: {
"inlineMath": [["$","$"], ["\\(","\\)"]]
}
});
</script>
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML"></script>
</head>
<body class="entry-detail">
<header>
<div>
<img src="https://mmas.github.io/images/profile.jpg">
<a class="brand" href="/">Modesto Mas</a>
<span>Data/Python/DevOps Engineer</span>
<nav>
<ul>
<li><a href="/tags">Tags</a></li>
<li><a href="https://github.com/mmas/mmas.github.io/issues" target="_blank">Issues</a></li>
</ul>
</nav>
</div>
</header>
<section id="content" role="main">
<article>
<header>
<h1><a href="/pageviews-flask-cassandra">Collect pageviews with Flask and Cassandra</a></h1>
<time datetime="2016-07-10">Jul 10, 2016</time>
<a class="tag" href="/tags?tag=cassandra">cassandra</a>
<a class="tag" href="/tags?tag=data-warehousing">data-warehousing</a>
<a class="tag" href="/tags?tag=flask">flask</a>
<a class="tag" href="/tags?tag=python">python</a>
<a class="tag" href="/tags?tag=web-analytics">web-analytics</a>
</header>
<aside id="article-nav"></aside>
<section class="body">
<p>Here is a simple example of collecting pageviews using <a target="_blank" href="http://flask.pocoo.org/">Flask</a> and <a target="_blank" href="https://cassandra.apache.org/">Cassandra</a>. The correct way from the client side to make a cross-site request to save a pageview is using <a target="_blank" href="https://en.wikipedia.org/wiki/Cross-origin_resource_sharing">CORS</a>, but since old browsers don't support CORS we will request a light image with some browser arguments.</p>
<p>Python requirements:</p>
<pre><code>Flask==0.11
cassandra-driver==3.5.0
pytz==2016.4</code></pre>
<p>Create a Cassandra keyspace:</p>
<pre class="cql"><code>CREATE KEYSPACE wa
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE wa;</code></pre>
<p>To support multiple applications, create the table <code>apps</code>:</p>
<pre class="cql"><code>CREATE TABLE apps (
id uuid PRIMARY KEY,
name text,
url text
);</code></pre>
<p>The pageviews will be saved in a composite-keyed table with <code>app</code> as a partition key, thereby pageviews for the same app will be stored physically together. Create the composite-keyed table <code>pageviews</code>:</p>
<pre class="cql"><code>CREATE TABLE pageviews (
app uuid,
date timestamp,
utma uuid,
utmb uuid,
path text,
title text,
ip text,
referrer text,
useragent text,
platform text,
language text,
screensize text,
pixelratio float,
PRIMARY KEY (app, date)
);</code></pre>
<p>Inside the <code><header></code> of each page to collect pageviews add the following script, editing the <code>APP_ID</code> to match to the database and the <code>ANALYTICS_URL</code> where the Flask app is running (also it is a good idea to <a target="_blank" href="http://javascript-minifier.com/">minimize the script</a>):</p>
<pre class="javacript"><code><script type="text/javascript">
var d,i,q,x;
d = {
app: '{{APP_UUID}}',
path: location.pathname,
title: document.title,
platform: navigator.platform,
language: navigator.language,
screensize: screen.width+'x'+screen.height,
pixelratio: devicePixelRatio,
referrer: document.referrer
};
q = [];
for (i in d) q.push([i,encodeURIComponent(d[i])].join('='));
new Image().src = '{{ANALYTICS_URL}}?'+q.join('&');
</script></code></pre>
<p>Now, in Flask, save the pageviews. Also, we are using the cookies <code>_utma</code> and <code>_utmb</code> in the same way Google Analytics does (more <a target="_blank" href="https://developers.google.com/analytics/devguides/collection/analyticsjs/cookie-usage#gajs">here</a>): <code>_utma</code> is used to "remember" a user (expires in two years) and "_utmb" is used to record the visit duration (expires in 30 minutes):</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">from</span> datetime <span class="ch">import</span> datetime, timedelta
<span class="ch">from</span> uuid <span class="ch">import</span> UUID, uuid4
<span class="ch">from</span> flask <span class="ch">import</span> Flask, request, send_file
<span class="ch">from</span> cassandra.cluster <span class="ch">import</span> Cluster
<span class="ch">import</span> pytz
app = Flask(<span class="ot">__name__</span>)
app.config.from_pyfile(<span class="st">'config.py'</span>)
<span class="ot">@app.before_request</span>
<span class="kw">def</span> before_request():
app.cluster = Cluster()
app.db = app.cluster.<span class="ot">connect</span>(<span class="st">'wa'</span>)
<span class="ot">@app.teardown_request</span>
<span class="kw">def</span> teardown_request(exception):
app.cluster.shutdown()
<span class="ot">@app.route</span>(<span class="st">'/'</span>)
<span class="kw">def</span> pageview():
data = request.args.to_dict()
response = send_file(<span class="st">'img.gif'</span>, mimetype=<span class="st">'image/gif'</span>)
<span class="co"># Verify app.</span>
<span class="kw">try</span>:
data[<span class="st">'app'</span>] = UUID(data[<span class="st">'app'</span>])
<span class="kw">except</span> <span class="ot">ValueError</span>:
<span class="kw">return</span> response
query = <span class="st">'SELECT id FROM apps WHERE id=</span><span class="ot">%s</span><span class="st">'</span>
<span class="kw">if</span> not <span class="dt">list</span>(app.db.execute(query, [data[<span class="st">'app'</span>]])):
<span class="kw">return</span> response
<span class="co"># Tracking cookies.</span>
now = datetime.now(pytz.timezone(<span class="st">'Europe/London'</span>))
<span class="kw">if</span> <span class="st">'_utma'</span> in request.cookies:
utma = UUID(request.cookies[<span class="st">'_utma'</span>])
<span class="kw">else</span>:
utma = uuid4()
response.set_cookie(<span class="st">'_utma'</span>, <span class="dt">str</span>(utma), expires=now+timedelta(days=<span class="dv">730</span>))
<span class="kw">if</span> <span class="st">'_utmb'</span> in request.cookies:
utmb = UUID(request.cookies[<span class="st">'_utmb'</span>])
<span class="kw">else</span>:
utmb = uuid4()
response.set_cookie(
<span class="st">'_utmb'</span>, <span class="dt">str</span>(utmb), expires=now+timedelta(seconds=<span class="dv">1800</span>))
<span class="co"># Save pageview.</span>
data.update(utma=utma,
utmb=utmb,
date=now,
ip=request.remote_addr,
useragent=request.headers[<span class="st">'User-Agent'</span>],
pixelratio=<span class="dt">float</span>(data.get(<span class="st">'pixelratio'</span>) or <span class="dv">1</span>))
query = <span class="st">'INSERT INTO pageviews (</span><span class="ot">%s</span><span class="st">) VALUES (</span><span class="ot">%s</span><span class="st">)'</span> % (
<span class="st">','</span>.join(data.keys()), <span class="st">','</span>.join([<span class="st">'</span><span class="ot">%s</span><span class="st">'</span>]*<span class="dt">len</span>(data)))
app.db.execute(query, data.values())
<span class="co"># Prevent HTTP caching.</span>
response.headers[<span class="st">'Last-Modified'</span>] = now
response.headers[<span class="st">'Cache-Control'</span>] = <span class="st">'no-cache, no-store, must-revalidate'</span>
response.headers[<span class="st">'Pragma'</span>] = <span class="st">'no-cache'</span>
response.headers[<span class="st">'Expires'</span>] = <span class="st">'0'</span>
<span class="kw">return</span> response
<span class="kw">if</span> <span class="ot">__name__</span> == <span class="st">'__main__'</span>:
app.run()</code></pre>
<p>Remember to add the <code>img.gif</code> and <code>config.py</code> file with extra settings (like <code>SERVER_NAME</code>).</p>
</section>
</article>
</section>
<footer></footer>
<script type="text/javascript" src="/js/article.js"></script>
</body>
</html>