analyze-apache-access-log-pandas.html

<!DOCTYPE html>
<html lang="en" itemscope itemtype="http://schema.org/Article">
  <head>
    <title>Analyze Apache HTTP server access log with Pandas</title>
    

    <meta charset="utf-8">
    <meta property="og:title" content="Analyze Apache HTTP server access log with Pandas">
    <meta property="og:site_name" content="Modesto Mas | Blog">
    <meta property="og:image" content="https://mmas.github.io/images/profile.jpg">
    <meta property="og:image:width" content="200">
    <meta property="og:image:height" content="200">
    <meta property="og:url" content="https://mmas.github.io/analyze-apache-access-log-pandas">
    <meta property="og:locale" content="en_GB">
    <meta name="twitter:image" content="https://mmas.github.io/images/profile.jpg">
    <meta name="twitter:url" content="https://mmas.github.io/analyze-apache-access-log-pandas">
    <meta name="twitter:card" content="summary">
    <meta name="twitter:domain" content="mmas.github.io">
    <meta name="twitter:title" content="Analyze Apache HTTP server access log with Pandas">
    <meta name="description" content="In the last post we saw how to read an Apache HTTP server access log with pandas. We ended with a dataframe structured like this: ip timestamp size re...">
    <meta name="twitter:description" content="In the last post we saw how to read an Apache HTTP server access log with pandas. We ended with a dataframe structured like this: ip timestamp size re...">
    <meta property="og:description" content="In the last post we saw how to read an Apache HTTP server access log with pandas. We ended with a dataframe structured like this: ip timestamp size re...">
    <meta name="keywords" content="data-analysis,pandas,python">
    <meta property="og:type" content="blog">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    
<meta property="og:type" content="article">
<meta property="article:author" content="https://github.com/mmas">
<meta property="article:section" content="data-analysis">
<meta property="article:tag" content="data-analysis,pandas,python">
<meta property="article:published_time" content="2015-11-23">
<meta property="article:modified_time" content="2015-11-23">

    <link rel="stylesheet" type="text/css" href="/css/main.css">
    <script type="text/x-mathjax-config">
    MathJax.Hub.Config({
        CommonHTML: {
            scale: 93,
            showMathMenu: false
        },
        tex2jax: {
            "inlineMath": [["$","$"], ["\\(","\\)"]]
        }
    });
    </script>
    <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML"></script>
  </head>
  <body class="entry-detail">

    
    <header>
  <div>
    <img src="https://mmas.github.io/images/profile.jpg">
    <a class="brand" href="/">Modesto Mas</a>
    <span>Data/Python/DevOps Engineer</span>
    <nav>
      <ul>
        <li><a href="/tags">Tags</a></li>
        <li><a href="https://github.com/mmas/mmas.github.io/issues" target="_blank">Issues</a></li>
      </ul>
    </nav>
  </div>
</header>
    

    <section id="content" role="main">
    

<article>
  
<header>
  <h1><a href="/analyze-apache-access-log-pandas">Analyze Apache HTTP server access log with Pandas</a></h1>
  <time datetime="2015-11-23">Nov 23, 2015</time>
  
  <a class="tag" href="/tags?tag=data-analysis">data-analysis</a>
  
  <a class="tag" href="/tags?tag=pandas">pandas</a>
  
  <a class="tag" href="/tags?tag=python">python</a>
  
</header>

  <aside id="article-nav"></aside>
  <section class="body">
    <p>In <a target="_blank" href="/read-apache-access-log-pandas">the last post</a> we saw how to read an <a target="_blank" href="http://httpd.apache.org/">Apache HTTP server</a> access log with <a target="_blank" href="http://pandas.pydata.org/">pandas</a>. We ended with a dataframe structured like this:</p>
<table>
  <thead>
    <tr>
      <th>ip</th>
      <th>timestamp</th>
      <th>size</th>
      <th>referer</th>
      <th>user_agent</th>
      <th>resource</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>X.X.X.X</td>
      <td>2015-11-23 18:17:40+00:00</td>
      <td>5303</td>
      <td>NaN</td>
      <td>Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...</td>
      <td>/</td>
    </tr>
    <tr>
      <td>X.X.X.X</td>
      <td>2015-11-23 18:52:14+00:00</td>
      <td>1550</td>
      <td>https://duckduckgo.com</td>
      <td>Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:42....</td>
      <td>/</td>
    </tr>
    <tr>
      <td>X.X.X.X</td>
      <td>2015-11-23 19:16:48+00:00</td>
      <td>1513</td>
      <td>NaN</td>
      <td>Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...</td>
      <td>/</td>
    </tr>
    <tr>
      <td>X.X.X.X</td>
      <td>2015-11-23 19:16:56+00:00</td>
      <td>5303</td>
      <td>NaN</td>
      <td>Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...</td>
      <td>/</td>
    </tr>
    <tr>
      <td>X.X.X.X</td>
      <td>2015-11-23 19:24:38+00:00</td>
      <td>2754</td>
      <td>https://www.google.com/</td>
      <td>Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.3...</td>
      <td>/querying_hive</td>
    </tr>
  </tbody>
</table>

<p>Here's we'll do some data wrangling and aggregation to display information about this website visits. Let's start:</p>
<pre><code class="python">import pandas as pd
import matplotlib.pyplot as plt
</code></pre>

<h2>Referer</h2>
<p>Information about the page that linked to a resource of our page.</p>
<pre><code class="python">referers = data['referer'].dropna()
</code></pre>

<h3>Referers domain</h3>
<p>The two more common referer domains from the total referers (normed):</p>
<pre><code class="python">domains = referers.str.extract(r'^(https?://)?(www.)?([^/]*)')[2].str.lower()
domains.value_counts()[:2].divide(domains.count())
</code></pre>

<pre><code class="stdout">mastortosa.com    0.280564
google.com        0.145877
Name: 2, dtype: float64
</code></pre>

<p>The most common referers are pages of this website.</p>
<h3>Google searches</h3>
<p>Google queries that linked this website:</p>
<pre><code class="python">google_searches = referers[referers.str.contains(
    r'^(https?://)?(www.)?(google.[^/]*)/search?')]
google_queries = google_searches.str.extract(r'[?&amp;]q=([^&amp;]*)&amp;?')
google_queries = google_queries.str.replace('+', ' ')
google_queries[:5]
</code></pre>

<pre><code class="stdout">3812             scikit image vs opencv
4143     pandas code datetime.timedelta
5276                 opencv get skimage
5277                 opencv get skimage
5974    comparison opencv scipy.ndimage
Name: referer, dtype: object
</code></pre>

<h2>Time</h2>
<p>Information about the visits over time.</p>
<h3>Visits by week day</h3>
<p>Normed count of visits by week day (being Monday the first day of the week):</p>
<pre><code class="python">data['weekday'] = data['timestamp'].apply(lambda x: x.isoweekday())
weekdays = data.groupby('weekday')['ip'].agg(len)
weekdays = weekdays.divide(weekdays.sum())
weekdays
</code></pre>

<pre><code class="stdout">weekday
1    0.145757
2    0.142374
3    0.134118
4    0.144539
5    0.139261
6    0.152389
7    0.141562
Name: ip, dtype: float64
</code></pre>

<pre><code class="python">weekdays.index = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
weekdays.plot(kind='barh')
plt.title('Visits over the week')
plt.xlabel('visits (normed)')
plt.show()
</code></pre>

<p><img alt="Pandas barh plot" src="/images/pandas_barh_plot.png"></p>
<p>Not big differences between them, but Saturday leads the number of visits.</p>
<h3>Total daily visits</h3>
<p>Plot the daily visits counts since June 2015:</p>
<pre><code class="python">visits = data['resource'].copy()
visits.index = data['timestamp']
visits = visits.resample('D', how='count', kind='period')
visits.index.name = 'date'
visits['6/2015':].plot()
plt.title('Total visits')
plt.ylabel('visits')
plt.show()
</code></pre>

<p><img alt="Pandas timeseries plot" src="/images/pandas_timeseries_plot.png"></p>
<h2>Content</h2>
<p>Information about the content visited within the website.</p>
<h3>Tags searches</h3>
<p>Searches from the site made by tag:</p>
<pre><code class="python">visits = data['resource'].copy()
tags = visits[visits.str.match(r'/tags/')]
tags = tags.str.extract(r'/tags/(.*)')
tags.value_counts().plot(kind='pie', colors=list('rgbymc'))
plt.title('Tag searches')
plt.xlabel('')
plt.ylabel('')
plt.show()
</code></pre>

<p><img alt="Pandas pie plot" src="/images/pandas_pie_plot.png"></p>
<h3>Entries visited</h3>
<p>To get the visits per entry we need to clean the resource URI. In this case, we have to filter out the home page (/) visits as well as the searches by tag (/tags/{tag}). Also, previously there were two sections in this website, <em>blog</em> and <em>lab</em>, so the visits to these sections have to be filtered out too and, later, assume the entries within these sections as entries with the current URL map (this means everything after /blog/ or /lab/ will be assumed as directly appended to the root /):</p>
<pre><code class="python">visits = data['resource'].copy()
visits.index = data['timestamp']
entries = visits[visits.str.match(
    r'(?!.*/tags)(?!^/blog\/$)(?!^/lab/$)/[^\?]+$')]
entries = entries.str.replace(r'/blog/|/lab/', '/')
entries[:10]
</code></pre>

<pre><code class="stdout">timestamp
2015-02-13 08:25:06+00:00                  /simple-web-analytics-python-pandas
2015-02-13 08:25:17+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 08:42:14+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 08:42:24+00:00                  /simple-web-analytics-python-pandas
2015-02-13 08:58:19+00:00                  /simple-web-analytics-python-pandas
2015-02-13 08:58:26+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 09:01:38+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 09:02:04+00:00                  /simple-web-analytics-python-pandas
2015-02-13 10:18:36+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 10:18:39+00:00                  /simple-web-analytics-python-pandas
Name: resource, dtype: object
</code></pre>

<p>The set of entries:</p>
<pre><code class="python">for i in entries.unique():
    print i
</code></pre>

<pre><code class="stdout">/simple-web-analytics-python-pandas
/web-visits-data-analytics-javascript-python--geoip
/python-image-processing-libraries-performance-opencv-scipy-scikit-image
/freelance-invoices-manager
/hadoop_practical_introduction_mapreduce_python
/hadoop_streaming_practical_introduction_mapreduce_python
/data_analysis_apache_pig_practical_introduction
/data_analyisis_apache_hive_practical_introduction
/loading_data_hive
/querying_hive
/interpolation_scipy
/optimization_scipy
/least_squares_fitting_numpy_scipy
/read_apache_access_log_pandas
</code></pre>

<p>To get the number of visits of an entry within a period of time, like <a target="_blank" href="/optimization_scipy">Optimization methods in Scipy</a> in November, just:</p>
<pre><code class="python">(entries['11/2015']=='/optimization_scipy').sum()
</code></pre>

<pre><code class="stdout">21
</code></pre>

<p>Show the daily visits over November of three different entries and the total of entries visits:</p>
<pre><code class="python">urls = [
    '/python-image-processing-libraries-performance-opencv-scipy-scikit-image',
    '/interpolation_scipy',
    '/optimization_scipy'
]
entries = entries['11/2015']
all_entries = entries.resample('D', how='count', kind='period')
for i, url in enumerate(urls):
    entry = entries[entries==url].resample('D', how='count', kind='period')
    entry.index.name = 'date'
    plt.subplot(int('31%d' % (i+1)))
    plt.title(url)
    all_entries.plot(kind='area', color='k', alpha=.1)
    entry.plot(kind='area', color='g', alpha=.7)
    plt.legend(['all entries', 'entry'], prop={'size': 12})
    plt.xlabel('')
    plt.ylabel('visits')
plt.subplots_adjust(hspace=.5)
plt.show()
</code></pre>

<p><img alt="Pandas timeseries area plot" src="/images/pandas_timeseries_area_plot.png"></p>
<p>This month, the entry <a target="_blank" href="/python-image-processing-libraries-performance-opencv-scipy-scikit-image">Python image processing libraries performance: OpenCV vs Scipy vs Scikit-Image</a> is the most popular in this blog.</p>
  </section>
</article>


    </section>
    <footer></footer>

    
<script type="text/javascript" src="/js/article.js"></script>

  </body>
</html>