-
Notifications
You must be signed in to change notification settings - Fork 3
/
analyze-apache-access-log-pandas.html
325 lines (285 loc) · 12.2 KB
/
analyze-apache-access-log-pandas.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
<!DOCTYPE html>
<html lang="en" itemscope itemtype="http://schema.org/Article">
<head>
<title>Analyze Apache HTTP server access log with Pandas</title>
<meta charset="utf-8">
<meta property="og:title" content="Analyze Apache HTTP server access log with Pandas">
<meta property="og:site_name" content="Modesto Mas | Blog">
<meta property="og:image" content="https://mmas.github.io/images/profile.jpg">
<meta property="og:image:width" content="200">
<meta property="og:image:height" content="200">
<meta property="og:url" content="https://mmas.github.io/analyze-apache-access-log-pandas">
<meta property="og:locale" content="en_GB">
<meta name="twitter:image" content="https://mmas.github.io/images/profile.jpg">
<meta name="twitter:url" content="https://mmas.github.io/analyze-apache-access-log-pandas">
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="mmas.github.io">
<meta name="twitter:title" content="Analyze Apache HTTP server access log with Pandas">
<meta name="description" content="In the last post we saw how to read an Apache HTTP server access log with pandas. We ended with a dataframe structured like this: ip timestamp size re...">
<meta name="twitter:description" content="In the last post we saw how to read an Apache HTTP server access log with pandas. We ended with a dataframe structured like this: ip timestamp size re...">
<meta property="og:description" content="In the last post we saw how to read an Apache HTTP server access log with pandas. We ended with a dataframe structured like this: ip timestamp size re...">
<meta name="keywords" content="data-analysis,pandas,python">
<meta property="og:type" content="blog">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta property="og:type" content="article">
<meta property="article:author" content="https://github.com/mmas">
<meta property="article:section" content="data-analysis">
<meta property="article:tag" content="data-analysis,pandas,python">
<meta property="article:published_time" content="2015-11-23">
<meta property="article:modified_time" content="2015-11-23">
<link rel="stylesheet" type="text/css" href="/css/main.css">
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
CommonHTML: {
scale: 93,
showMathMenu: false
},
tex2jax: {
"inlineMath": [["$","$"], ["\\(","\\)"]]
}
});
</script>
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML"></script>
</head>
<body class="entry-detail">
<header>
<div>
<img src="https://mmas.github.io/images/profile.jpg">
<a class="brand" href="/">Modesto Mas</a>
<span>Data/Python/DevOps Engineer</span>
<nav>
<ul>
<li><a href="/tags">Tags</a></li>
<li><a href="https://github.com/mmas/mmas.github.io/issues" target="_blank">Issues</a></li>
</ul>
</nav>
</div>
</header>
<section id="content" role="main">
<article>
<header>
<h1><a href="/analyze-apache-access-log-pandas">Analyze Apache HTTP server access log with Pandas</a></h1>
<time datetime="2015-11-23">Nov 23, 2015</time>
<a class="tag" href="/tags?tag=data-analysis">data-analysis</a>
<a class="tag" href="/tags?tag=pandas">pandas</a>
<a class="tag" href="/tags?tag=python">python</a>
</header>
<aside id="article-nav"></aside>
<section class="body">
<p>In <a target="_blank" href="/read-apache-access-log-pandas">the last post</a> we saw how to read an <a target="_blank" href="http://httpd.apache.org/">Apache HTTP server</a> access log with <a target="_blank" href="http://pandas.pydata.org/">pandas</a>. We ended with a dataframe structured like this:</p>
<table>
<thead>
<tr>
<th>ip</th>
<th>timestamp</th>
<th>size</th>
<th>referer</th>
<th>user_agent</th>
<th>resource</th>
</tr>
</thead>
<tbody>
<tr>
<td>X.X.X.X</td>
<td>2015-11-23 18:17:40+00:00</td>
<td>5303</td>
<td>NaN</td>
<td>Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...</td>
<td>/</td>
</tr>
<tr>
<td>X.X.X.X</td>
<td>2015-11-23 18:52:14+00:00</td>
<td>1550</td>
<td>https://duckduckgo.com</td>
<td>Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:42....</td>
<td>/</td>
</tr>
<tr>
<td>X.X.X.X</td>
<td>2015-11-23 19:16:48+00:00</td>
<td>1513</td>
<td>NaN</td>
<td>Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...</td>
<td>/</td>
</tr>
<tr>
<td>X.X.X.X</td>
<td>2015-11-23 19:16:56+00:00</td>
<td>5303</td>
<td>NaN</td>
<td>Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...</td>
<td>/</td>
</tr>
<tr>
<td>X.X.X.X</td>
<td>2015-11-23 19:24:38+00:00</td>
<td>2754</td>
<td>https://www.google.com/</td>
<td>Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.3...</td>
<td>/querying_hive</td>
</tr>
</tbody>
</table>
<p>Here's we'll do some data wrangling and aggregation to display information about this website visits. Let's start:</p>
<pre><code class="python">import pandas as pd
import matplotlib.pyplot as plt
</code></pre>
<h2>Referer</h2>
<p>Information about the page that linked to a resource of our page.</p>
<pre><code class="python">referers = data['referer'].dropna()
</code></pre>
<h3>Referers domain</h3>
<p>The two more common referer domains from the total referers (normed):</p>
<pre><code class="python">domains = referers.str.extract(r'^(https?://)?(www.)?([^/]*)')[2].str.lower()
domains.value_counts()[:2].divide(domains.count())
</code></pre>
<pre><code class="stdout">mastortosa.com 0.280564
google.com 0.145877
Name: 2, dtype: float64
</code></pre>
<p>The most common referers are pages of this website.</p>
<h3>Google searches</h3>
<p>Google queries that linked this website:</p>
<pre><code class="python">google_searches = referers[referers.str.contains(
r'^(https?://)?(www.)?(google.[^/]*)/search?')]
google_queries = google_searches.str.extract(r'[?&]q=([^&]*)&?')
google_queries = google_queries.str.replace('+', ' ')
google_queries[:5]
</code></pre>
<pre><code class="stdout">3812 scikit image vs opencv
4143 pandas code datetime.timedelta
5276 opencv get skimage
5277 opencv get skimage
5974 comparison opencv scipy.ndimage
Name: referer, dtype: object
</code></pre>
<h2>Time</h2>
<p>Information about the visits over time.</p>
<h3>Visits by week day</h3>
<p>Normed count of visits by week day (being Monday the first day of the week):</p>
<pre><code class="python">data['weekday'] = data['timestamp'].apply(lambda x: x.isoweekday())
weekdays = data.groupby('weekday')['ip'].agg(len)
weekdays = weekdays.divide(weekdays.sum())
weekdays
</code></pre>
<pre><code class="stdout">weekday
1 0.145757
2 0.142374
3 0.134118
4 0.144539
5 0.139261
6 0.152389
7 0.141562
Name: ip, dtype: float64
</code></pre>
<pre><code class="python">weekdays.index = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
weekdays.plot(kind='barh')
plt.title('Visits over the week')
plt.xlabel('visits (normed)')
plt.show()
</code></pre>
<p><img alt="Pandas barh plot" src="/images/pandas_barh_plot.png"></p>
<p>Not big differences between them, but Saturday leads the number of visits.</p>
<h3>Total daily visits</h3>
<p>Plot the daily visits counts since June 2015:</p>
<pre><code class="python">visits = data['resource'].copy()
visits.index = data['timestamp']
visits = visits.resample('D', how='count', kind='period')
visits.index.name = 'date'
visits['6/2015':].plot()
plt.title('Total visits')
plt.ylabel('visits')
plt.show()
</code></pre>
<p><img alt="Pandas timeseries plot" src="/images/pandas_timeseries_plot.png"></p>
<h2>Content</h2>
<p>Information about the content visited within the website.</p>
<h3>Tags searches</h3>
<p>Searches from the site made by tag:</p>
<pre><code class="python">visits = data['resource'].copy()
tags = visits[visits.str.match(r'/tags/')]
tags = tags.str.extract(r'/tags/(.*)')
tags.value_counts().plot(kind='pie', colors=list('rgbymc'))
plt.title('Tag searches')
plt.xlabel('')
plt.ylabel('')
plt.show()
</code></pre>
<p><img alt="Pandas pie plot" src="/images/pandas_pie_plot.png"></p>
<h3>Entries visited</h3>
<p>To get the visits per entry we need to clean the resource URI. In this case, we have to filter out the home page (/) visits as well as the searches by tag (/tags/{tag}). Also, previously there were two sections in this website, <em>blog</em> and <em>lab</em>, so the visits to these sections have to be filtered out too and, later, assume the entries within these sections as entries with the current URL map (this means everything after /blog/ or /lab/ will be assumed as directly appended to the root /):</p>
<pre><code class="python">visits = data['resource'].copy()
visits.index = data['timestamp']
entries = visits[visits.str.match(
r'(?!.*/tags)(?!^/blog\/$)(?!^/lab/$)/[^\?]+$')]
entries = entries.str.replace(r'/blog/|/lab/', '/')
entries[:10]
</code></pre>
<pre><code class="stdout">timestamp
2015-02-13 08:25:06+00:00 /simple-web-analytics-python-pandas
2015-02-13 08:25:17+00:00 /web-visits-data-analytics-javascript-python--...
2015-02-13 08:42:14+00:00 /web-visits-data-analytics-javascript-python--...
2015-02-13 08:42:24+00:00 /simple-web-analytics-python-pandas
2015-02-13 08:58:19+00:00 /simple-web-analytics-python-pandas
2015-02-13 08:58:26+00:00 /web-visits-data-analytics-javascript-python--...
2015-02-13 09:01:38+00:00 /web-visits-data-analytics-javascript-python--...
2015-02-13 09:02:04+00:00 /simple-web-analytics-python-pandas
2015-02-13 10:18:36+00:00 /web-visits-data-analytics-javascript-python--...
2015-02-13 10:18:39+00:00 /simple-web-analytics-python-pandas
Name: resource, dtype: object
</code></pre>
<p>The set of entries:</p>
<pre><code class="python">for i in entries.unique():
print i
</code></pre>
<pre><code class="stdout">/simple-web-analytics-python-pandas
/web-visits-data-analytics-javascript-python--geoip
/python-image-processing-libraries-performance-opencv-scipy-scikit-image
/freelance-invoices-manager
/hadoop_practical_introduction_mapreduce_python
/hadoop_streaming_practical_introduction_mapreduce_python
/data_analysis_apache_pig_practical_introduction
/data_analyisis_apache_hive_practical_introduction
/loading_data_hive
/querying_hive
/interpolation_scipy
/optimization_scipy
/least_squares_fitting_numpy_scipy
/read_apache_access_log_pandas
</code></pre>
<p>To get the number of visits of an entry within a period of time, like <a target="_blank" href="/optimization_scipy">Optimization methods in Scipy</a> in November, just:</p>
<pre><code class="python">(entries['11/2015']=='/optimization_scipy').sum()
</code></pre>
<pre><code class="stdout">21
</code></pre>
<p>Show the daily visits over November of three different entries and the total of entries visits:</p>
<pre><code class="python">urls = [
'/python-image-processing-libraries-performance-opencv-scipy-scikit-image',
'/interpolation_scipy',
'/optimization_scipy'
]
entries = entries['11/2015']
all_entries = entries.resample('D', how='count', kind='period')
for i, url in enumerate(urls):
entry = entries[entries==url].resample('D', how='count', kind='period')
entry.index.name = 'date'
plt.subplot(int('31%d' % (i+1)))
plt.title(url)
all_entries.plot(kind='area', color='k', alpha=.1)
entry.plot(kind='area', color='g', alpha=.7)
plt.legend(['all entries', 'entry'], prop={'size': 12})
plt.xlabel('')
plt.ylabel('visits')
plt.subplots_adjust(hspace=.5)
plt.show()
</code></pre>
<p><img alt="Pandas timeseries area plot" src="/images/pandas_timeseries_area_plot.png"></p>
<p>This month, the entry <a target="_blank" href="/python-image-processing-libraries-performance-opencv-scipy-scikit-image">Python image processing libraries performance: OpenCV vs Scipy vs Scikit-Image</a> is the most popular in this blog.</p>
</section>
</article>
</section>
<footer></footer>
<script type="text/javascript" src="/js/article.js"></script>
</body>
</html>