-
Notifications
You must be signed in to change notification settings - Fork 0
/
week13.html
190 lines (184 loc) · 21.4 KB
/
week13.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Week 13</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="css/bootstrap.min.css" rel="stylesheet">
<link href="css/custom.css" rel="stylesheet">
</head>
<body class="markdown github">
<header class="navbar-inverse navbar-fixed-top">
<div class="container">
<nav role="navigation">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a href="index.html" class="navbar-brand">J298 Data Journalism</a>
</div> <!-- /.navbar-header -->
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Class notes<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="week1.html">What is data?</a></li>
<li><a href="week2.html">Types of stories</a></li>
<li><a href="week3.html">Working with spreadsheets</a></li>
<li><a href="week4.html">Acquiring, cleaning, and formatting data</a></li>
<li><a href="week5.html">R, RStudio, and the tidyverse</a></li>
<li><a href="week6.html">Data journalism in the tidyverse</a></li>
<li><a href="week7.html">Don't let the data lie to you</a></li>
<li><a href="week8.html">Databases and SQL</a></li>
<li><a href="week9.html">Finding stories using maps</a></li>
<li><a href="week10.html">Maps meet databases</a></li>
<li><a href="week11.html">More PostGIS</a></li>
<li><a href="week12.html">R practice</a></li>
<li><a href="week13.html">PostGIS practice</a></li>
<li><a href="week14.html">More fun with R</a></li>
</ul>
</li>
<li><a href="software.html">Software</a></li>
<li><a href="datasets.html">Data</a></li>
<li><a href="questions.html">If you get stuck</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Email instructors<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="mailto:[email protected]">Peter Aldhous</a></li>
<li><a href="mailto:[email protected]">Amanda Hickman</a></li>
</ul>
</li>
</ul>
</div><!-- /.navbar-collapse -->
</nav>
</div> <!-- /.navbar-header -->
</header>
<div class="container all">
<h2 id="week-13----april-19,-2018"><a name="week-13----april-19,-2018" href="#week-13----april-19,-2018"></a>Week 13 — April 19, 2018</h2><p><em>Instructor: Amanda Hickman</em></p><h1 id="fixing-your-path"><a name="fixing-your-path" href="#fixing-your-path"></a>Fixing your path</h1><p>This is a good explanation of the command path issues we were navigating last time we worked in the terminal:</p><p><a href="http://www.tech-recipes.com/rx/2621/os_x_change_path_environment_variable/k">http://www.tech-recipes.com/rx/2621/os_x_change_path_environment_variable/k</a></p><p>Roughly, your command line interpreter, <code>bash</code>, uses a variable called <code>PATH</code> to decide where to look for programs to run. If you run <code>echo $PATH</code> at the command line, you can see a list of all the directories your shell looks in for executable programs. Yours is probably different from mine, but on my primary laptop (which runs Ubuntu, not OSX), my path looks like this:</p><pre class="bash hljs"><code class="bash" data-origin="<pre><code class="bash">amanda@mona:~$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/heroku/bin:/opt/venvs/vdirsyncer-latest/bin/
</code></pre>">amanda@mona:~$ <span class="hljs-built_in">echo</span> <span class="hljs-variable">$PATH</span>
/usr/<span class="hljs-built_in">local</span>/sbin:/usr/<span class="hljs-built_in">local</span>/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/<span class="hljs-built_in">local</span>/games:/snap/bin:/usr/<span class="hljs-built_in">local</span>/heroku/bin:/opt/venvs/vdirsyncer-latest/bin/
</code></pre><p>Note that I manually added a few paths there: <code>/usr/local/heroku/bin</code> lets me run a handful of specialized heroku commands, and <code>/opt/venvs/vdirsyncer-latest/bin/</code> lets me run vidirsyncer, which I’m super happy to talk about but is kind of a rabbit hole in this context.</p><p>If you want to run a program that isn’t in your <code>PATH</code> variable, you have to use the “absolute path” — the full path to the program, starting with <code>/</code> for the root directory. Alternatively, you can edit your <code>PATH</code> to include the path to the directory that includes that program.</p><p>To add <code>shp2pgsql</code> so you can access it easily, you want to tell your shell that it needs to look in <code>/Applications/Postgres.app/Contents/Versions/10/bin/</code> for programs.</p><p>To do that, you’re going to edit a hidden file in your home directory (probably <code>/Users/yourname/</code>) called <code>.profile</code>, to add a line like:</p><pre><code data-origin="<pre><code>export PATH=/Applications/Postgres.app/Contents/Versions/10/bin/:$PATH
</code></pre>">export PATH=/Applications/Postgres.app/Contents/Versions/10/bin/:$PATH
</code></pre><p><code>PATH</code> is the <a href="https://www.gnu.org/software/bash/manual/html_node/Shell-Parameters.html#index-shell-variable">shell variable</a> that contains the array of directories that bash looks in to find executable programs, and we print the contents of that variable with <code>$PATH</code>. This command <code>exports</code> or “sets” a new <code>PATH</code> by concatenating the new directory (<code>/Applications/Postgres.app/Contents/Versions/10/bin/</code>) that you want to include, with the existing values stored in <code>PATH</code>.</p><h1 id="do-you-have-enough-ram-for-this?"><a name="do-you-have-enough-ram-for-this?" href="#do-you-have-enough-ram-for-this?"></a>Do you have enough RAM for this?</h1><p>I was surprised to discover that quite a few of you couldn’t actually load the US zipcode file into postgres via Postico. My suspicion is that you could load it if you use the command line tools, but Postico wasn’t able to load the whole thing into its preview window — queueing it up so you can scroll around in it takes more memory than just loading it.</p><p>Use <a href="https://support.apple.com/en-us/HT201260">these instructions</a> to find some basic information about your computer. Please complete this survey: <a href="https://goo.gl/forms/oIsHQAUuyWREvlrC3">https://goo.gl/forms/oIsHQAUuyWREvlrC3</a> so we can get a handle on the problem.</p><h1 id="data-of-the-week"><a name="data-of-the-week" href="#data-of-the-week"></a>Data of the Week</h1><p>Charlotte and Susie are up:<br><a href="https://docs.google.com/spreadsheets/d/11JLkkyWZf3fvVz3aebgMjcZ6mxV-j5Gw7hEpeiPAGY4/edit#gid=0">https://docs.google.com/spreadsheets/d/11JLkkyWZf3fvVz3aebgMjcZ6mxV-j5Gw7hEpeiPAGY4/edit#gid=0</a></p><h1 id="sql-vocabulary"><a name="sql-vocabulary" href="#sql-vocabulary"></a>SQL Vocabulary</h1><p>Quick follow up to a conversation we had two weeks ago about why FLOAT is a synonym for DOUBLE PRECISION, and what that means to begin with. The short answer is “it has something to do with how computers store data” — As for almost any arcane topic in mathematics and computing there is a wikipedia article that goes deep on the finer points. <a href="https://en.wikipedia.org/wiki/Floating-point_arithmetic">Floating point arithmetic</a> is no exception.</p><p>You can see the full list of data types that Postgres supports in <a href="https://www.postgresql.org/docs/10/static/datatype.html">the Postgres documentation</a>. You’re almost always going to be using one of only a few: numeric, float, integer, text, char, varchar, and geometry.</p><table>
<thead>
<tr>
<th>format</th>
<th>usage</th>
</tr>
</thead>
<tbody>
<tr>
<td> NUMERIC</td>
<td>any number — this is the preferable format but some systems default to <code>FLOAT</code> if you don’t specify <code>NUMERIC</code></td>
</tr>
<tr>
<td> FLOAT</td>
<td>a decimal number, synonymous with <code>DOUBLE PRECISION</code>— if you plan to do any math on the number, <code>NUMERIC</code> is preferable</td>
</tr>
<tr>
<td> INTEGER</td>
<td>any whole number</td>
</tr>
<tr>
<td> BIGINT</td>
<td>very large integer (> 2.1B or < -2.1B)</td>
</tr>
<tr>
<td> TEXT</td>
<td>any text — this is an efficient format to use in Postgres but in other SQL databases you may not be able to index, search or sort a <code>TEXT</code> column. It isn’t standard across SQL implementations.</td>
</tr>
<tr>
<td> CHAR</td>
<td>text that is always the same length</td>
</tr>
<tr>
<td> VARCHAR</td>
<td>text of varying length</td>
</tr>
<tr>
<td> GEOMETRY</td>
<td>geographic data (points, lines, polygons)</td>
</tr>
</tbody>
</table><h1 id="hands-on-postgis"><a name="hands-on-postgis" href="#hands-on-postgis"></a>Hands On PostGIS</h1><p>We started this two weeks ago and hit some walls that I wasn’t expecting, so let’s try again. Download the <a href="data/week13.zip">Week 13 data</a> bundle so you have everything in one place.</p><h2 id="where-are-you-working?"><a name="where-are-you-working?" href="#where-are-you-working?"></a>Where are you working?</h2><p><a href="week11.html">Remember</a> that to do PostGIS queries you will need to make sure you’ve enabled it on the database you’re working in, with <code>CREATE EXTENSION postgis;</code> — so make sure that you’re working in a database where you’ve enabled postgis.</p><h2 id="find-alameda-county-zipcodes"><a name="find-alameda-county-zipcodes" href="#find-alameda-county-zipcodes"></a>Find Alameda County zipcodes</h2><p>what are the zipcodes that are overlap with Alameda county. Start by spelling out how you’d approach this. Can you describe in words what you’re trying to do? Write this in a comment at the top of your script.</p><p>You’re looking for a way to capture all the shapes in one layer that intersect with a single shape in another layer. The census publishes <a href="https://www.census.gov/geo/maps-data/data/cbf/cbf_zcta.html">zipcode boundaries for the whole US</a>, but for some of your computers that was too much data, so I cut it down to just California zipcodes for you. You’ll need the <a href="https://ucb-dataj.github.io/2018/data/week13/ca_zips.sql">sql</a> to create a table and the data itself is in a <a href="https://ucb-dataj.github.io/2018/data/week13/ca_zips.csv">separate csv</a>.</p><p>Alameda County’s open data portal publishes a <a href="https://data.acgov.org/Geospatial-Data/County-Boundary/rygg-x9nr">county boundary file</a>. I already used <a href="http://bostongis.com/pgsql2shp_shp2pgsql_quickguide.bqg"><code>shp2pgsql</code></a> to convert the shapefile to SQL.</p><ol>
<li>Use <code>shp2pgsql</code> to convert the Shapefile into SQL.</li><li>Import the SQL into Postgres. You can either<br>a. use Postico’s “Load Query” button to load <code>alameda_county_boundary.sql</code>, or<br>b. use <code>psql</code> at the command line with <code>psql -d week11_postgis -f alameda_county_boundary.sql</code> (but note that <code>week11_postgis</code> is my database name. You may not have named your database <code>week11_postgis</code>.)</li><li>Create a spatial index. You can either<br>a. Load it into QGIS and click “create index” or<br>b. run <code>CREATE INDEX sidx_alameda_geom ON alameda USING gist (geom);</code></li></ol><h3 id="find-your-postgis-command"><a name="find-your-postgis-command" href="#find-your-postgis-command"></a>Find Your PostGIS command</h3><p> There are a few that sound like they might be what we want: <a href="http://postgis.net/docs/ST_Within.html">ST_Within</a>, <a href="http://postgis.net/docs/ST_Contains.html">ST_Contains</a>, <a href="http://postgis.net/docs/ST_Intersection.html">ST_Intersection </a>, <a href="http://postgis.net/docs/ST_Intersects.html">ST_Intersects</a>. Take a look at the documentation: how do these differ?</p><p>We’re going to use <a href="http://postgis.net/docs/ST_Intersects.html">ST_Intersects</a> to find all the California zipcode shapes that intersect with Alameda county.</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class="SQL">
SELECT
alameda.geom AS county_geom,
alameda.name AS county,
zipcodes.zcta5ce10 AS zipcode,
zipcodes.geom as zip_geom
FROM
alameda, zipcodes
WHERE
ST_Intersects(zipcodes.geom, alameda.geom) AND alameda.name = 'Alameda County';
</code></pre>">
<span class="hljs-operator"><span class="hljs-keyword">SELECT</span>
alameda.geom <span class="hljs-keyword">AS</span> county_geom,
alameda.name <span class="hljs-keyword">AS</span> county,
zipcodes.zcta5ce10 <span class="hljs-keyword">AS</span> zipcode,
zipcodes.geom <span class="hljs-keyword">as</span> zip_geom
<span class="hljs-keyword">FROM</span>
alameda, zipcodes
<span class="hljs-keyword">WHERE</span>
ST_Intersects(zipcodes.geom, alameda.geom) <span class="hljs-keyword">AND</span> alameda.name = <span class="hljs-string">'Alameda County'</span>;</span>
</code></pre><p>Since that seems like it worked, make a table out of that:</p><pre class="sql hljs"><code class="sql" data-origin="<pre><code class="sql">CREATE TABLE alameda_zipcodes AS
SELECT zipcodes.* FROM zipcodes, alameda
WHERE ST_Intersects(zipcodes.geom, alameda.geom) AND alameda.name = 'Alameda County';
</code></pre>"><span class="hljs-operator"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> alameda_zipcodes <span class="hljs-keyword">AS</span>
<span class="hljs-keyword">SELECT</span> zipcodes.* <span class="hljs-keyword">FROM</span> zipcodes, alameda
<span class="hljs-keyword">WHERE</span> ST_Intersects(zipcodes.geom, alameda.geom) <span class="hljs-keyword">AND</span> alameda.name = <span class="hljs-string">'Alameda County'</span>;</span>
</code></pre><h3 id="troubleshooting"><a name="troubleshooting" href="#troubleshooting"></a>Troubleshooting</h3><p>You’re going to wind up capturing a few zipcodes that only cross the county line in tiny spots. Can you brainstorm some ways to address those?</p><h2 id="find-san-francisco-zipcodes"><a name="find-san-francisco-zipcodes" href="#find-san-francisco-zipcodes"></a>Find San Francisco zipcodes</h2><p>San Francisco doesn’t publish a handy county file, so I pulled down the <a href="https://www.census.gov/geo/maps-data/data/cbf/cbf_counties.html">TIGER county shapefiles</a>. This is smaller than the zipcode file but it’s still pretty big, so I also ran it through <a href="http://bostongis.com/pgsql2shp_shp2pgsql_quickguide.bqg"><code>shp2pgsql</code></a> with <code>shp2pgsql cb_2017_us_county_20m.shp counties postgres > counties.sql</code>, and then pruned it by loading it into Postgres and running:</p><pre><code data-origin="<pre><code>DELETE FROM counties WHERE statefp != '06';
</code></pre>">DELETE FROM counties WHERE statefp != '06';
</code></pre><p>You can load the smaller, California only file from the <a href="data/week13.zip">week 13 data</a> file — it’s <code>counties.sql</code>.</p><h3 id="find-all-the-zipcodes-in-san-francisco-county?-what-about-fresno-county?"><a name="find-all-the-zipcodes-in-san-francisco-county?-what-about-fresno-county?" href="#find-all-the-zipcodes-in-san-francisco-county?-what-about-fresno-county?"></a>Find all the zipcodes in San Francisco county? What about Fresno county?</h3><p>Start by writing down the steps you need to take, then … take them.</p><h2 id="create-mappable-points"><a name="create-mappable-points" href="#create-mappable-points"></a>Create mappable points</h2><p>San Francisco publishes <a href="https://data.sfgov.org/Public-Safety/Car-Break-ins/6har-q36k">car break in data</a>.</p><p>Can you …</p><ul>
<li>Create a new table for this data</li><li>Import the data into a PostGIS database</li><li>Do you remember the function you need to make WKT point out of latitude and longitude values?</li><li>View it in QGIS</li></ul><p>I did use <a href="https://csvkit.readthedocs.io/en/1.0.3/scripts/csvsql.html">csvsql</a> to generate a CREATE statement for you.</p><pre class="sql hljs"><code class="sql" data-origin="<pre><code class="sql">CREATE TABLE car_breakins (
incidntnum VARCHAR(9) NOT NULL,
category VARCHAR(13) NOT NULL,
descript VARCHAR(28) NOT NULL,
dayofweek VARCHAR(9) NOT NULL,
event_date DATE NOT NULL,
event_time TIME WITHOUT TIME ZONE NOT NULL,
pddistrict VARCHAR(10) NOT NULL,
resolution VARCHAR(21),
address VARCHAR(42) NOT NULL,
x FLOAT NOT NULL,
y FLOAT NOT NULL,
location VARCHAR(41) NOT NULL
);
</code></pre>"><span class="hljs-operator"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> car_breakins (
incidntnum <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">9</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
category <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">13</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
descript <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">28</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
<span class="hljs-keyword">dayofweek</span> <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">9</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
event_date <span class="hljs-built_in">DATE</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
event_time <span class="hljs-keyword">TIME</span> WITHOUT <span class="hljs-keyword">TIME</span> ZONE <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
pddistrict <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">10</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
resolution <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">21</span>),
address <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">42</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
x <span class="hljs-built_in">FLOAT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
y <span class="hljs-built_in">FLOAT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
location <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">41</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>
);</span>
</code></pre><p>Bonus questions: where is this data from? Who compiled it? Can you sniff anything out from the Socrata metadata?</p><h3 id="assigment"><a name="assigment" href="#assigment"></a>Assigment</h3><p>Please submit your final data journalism project by <strong>Tuesday May 1 at 8pm</strong>.</p><ul>
<li><p>In bCourses, submit a full write-up of your project. This should be written so that someone with no prior knowledge of your project can understand why this is interesting as well as what you have done. It should include:</p>
<ul>
<li><p>A clear and direct lede that tells us why we should keep reading, and what the main points of your work are. This should be written for a new audience and should demonstrate the writing skills you have learned in your other classes. Think of this as your chance to sell an editor on this work, by showing that you can write and that you understand this data.</p>
</li><li><p>A description of the data you used, and how it was cleaned and processed for analysis. Again, imagine you are writing for an editor who doesn’t know you and needs to know that you have made smart decisions about data to base your reporting on and that you truly understand what you are doing.</p>
</li><li><p>The questions you asked of the data, and the tools you used to ask them.</p>
</li><li><p>The conclusions you have drawn. Include any charts/maps that are relevant.</p>
</li><li><p>Any further analyses you would need to run to turn this project into a story.</p>
</li><li><p>The additional reporting, beyond data analysis, that would be required to turn this project into a story.</p>
</li></ul>
</li><li><p>Share the data and code for your analysis. Code should be commented so that your instructors are able to understand each step of your analysis. You can share this in a zipped folder and email if the data files are not too large. If they are too large to email, you can share via a Dropbox link.</p>
</li><li><p>Plan to come to class on May 3 prepared to give a 5-7 minute presentation on your project, and answer questions from other students. Plan, also, to listen thoughtfully to your classroom colleagues and engage them on their work this semester.</p>
</li></ul><h1 id="keep-learning"><a name="keep-learning" href="#keep-learning"></a>Keep Learning</h1><ul>
<li><a href="https://multimedia.journalism.berkeley.edu/tutorials/qgis-basics-journalists/">QGIS Tutorial</a></li><li><a href="https://www.lynda.com/search?q=qgis">Lynda QGIS</a> | <a href="https://www.lynda.com/search?q=postgis">Lynda PostGIS</a></li></ul>
</div> <!-- /.container all -->
<script src="https://code.jquery.com/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
</body>
</html>