forked from swcarpentry/web-data-python
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-generalize.html
283 lines (273 loc) · 21.6 KB
/
03-generalize.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Working With Data on the Web</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="title">Working With Data on the Web</h1>
<h2 class="subtitle">Generalizing and Handling Errors</h2>
<div id="learning-objectives" class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Turn a script into a function.</li>
<li>Make a function more robust by explicitly handling errors.</li>
</ul>
</div>
</div>
<p>Now that we know how to get the data for Canada, let’s create a function that will do the same thing for an arbitrary country. The steps are simple: copy the code we’ve written into a function that takes a 3-letter country code as a parameter, and insert that country code into the URL at the appropriate place:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> get_annual_mean_temp_by_country(country):
<span class="co">'''Get the annual mean temperature for a country given its 3-letter ISO code (such as "CAN").'''</span>
url = <span class="st">'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/'</span> + country + <span class="st">'.csv'</span>
response = requests.get(url)
<span class="kw">if</span> response.status_code != <span class="dv">200</span>:
<span class="dt">print</span>(<span class="st">'Failed to get data:'</span>, response.status_code)
<span class="kw">else</span>:
reader = io.StringIO(response.text)
wrapper = csv.reader(reader)
results = []
<span class="kw">for</span> record in wrapper:
<span class="kw">if</span> record[<span class="dv">0</span>] != <span class="st">'year'</span>:
year = <span class="dt">int</span>(record[<span class="dv">0</span>])
value = <span class="dt">float</span>(record[<span class="dv">1</span>])
results.append([year, value])
<span class="kw">return</span> results</code></pre>
<p>This works:</p>
<pre class="sourceCode python"><code class="sourceCode python">canada = get_annual_mean_temp_by_country(<span class="st">'CAN'</span>)
<span class="dt">print</span>(<span class="st">'first five entries for Canada:'</span>, canada[:<span class="dv">5</span>])</code></pre>
<pre class="output"><code>first five entries for Canada: [[1901, -7.67241907119751], [1902, -7.862711429595947], [1903, -7.910782814025879], [1904, -8.155729293823242], [1905, -7.547311305999756]]</code></pre>
<p>but there’s a problem. Look what happens when we pass in an invalid country identifier:</p>
<pre class="sourceCode python"><code class="sourceCode python">latveria = get_annual_mean_temp_by_country(<span class="st">'LTV'</span>)
<span class="dt">print</span> <span class="st">'first five entries for Latveria:'</span>, latveria[:<span class="dv">5</span>]</code></pre>
<pre class="output"><code>first five entries for Latveria: []</code></pre>
<p>Latveria doesn’t exist, so why is our function returning an empty list rather than printing an error message? The non-appearance of an error message must mean that the response code was 200; if so, we would have gone into the <code>else</code> branch, assigned an empty list to <code>results</code>, and then… hm… All right, if the response code was 200 and there was no data, that would explain what we’re seeing. Let’s check:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> get_annual_mean_temp_by_country(country):
<span class="co">'''Get the annual mean temperature for a country given its 3-letter ISO code (such as "CAN").'''</span>
url = <span class="st">'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/'</span> + country + <span class="st">'.csv'</span>
<span class="dt">print</span>(<span class="st">'url used is'</span>, url)
response = requests.get(url)
<span class="dt">print</span>(<span class="st">'response code:'</span>, response.status_code)
<span class="dt">print</span>(<span class="st">'length of data:'</span>, <span class="dt">len</span>(response.text))
<span class="kw">if</span> response.status_code != <span class="dv">200</span>:
<span class="dt">print</span>(<span class="st">'Failed to get data:'</span>, response.status_code)
<span class="kw">else</span>:
reader = io.StringIO(response.text)
wrapper = csv.reader(reader)
results = []
<span class="kw">for</span> record in wrapper:
<span class="kw">if</span> record[<span class="dv">0</span>] != <span class="st">'year'</span>:
year = <span class="dt">int</span>(record[<span class="dv">0</span>])
value = <span class="dt">float</span>(record[<span class="dv">1</span>])
results.append([year, value])
<span class="kw">return</span> results
latveria = get_annual_mean_temp_by_country(<span class="st">'LTV'</span>)
<span class="dt">print</span>(<span class="st">'number of records for Latveria:'</span>, <span class="dt">len</span>(latveria))</code></pre>
<pre class="output"><code>url used is http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/LTV.csv
response code: 200
length of data: 0
number of records for Latveria: 0</code></pre>
<p>Great: after a bit more experimenting, we discover that the site <em>always</em> returns a 200 status code. The only way to tell if there’s real data or not will be to check if <code>response.text</code> is empty. Here’s the updated function:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> get_annual_mean_temp_by_country(country):
<span class="co">'''</span>
<span class="co"> Get the annual mean temperature for a country given its 3-letter ISO code (such as "CAN").</span>
<span class="co"> Returns an empty list if the country code is invalid.</span>
<span class="co"> '''</span>
url = <span class="st">'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/'</span> + country + <span class="st">'.csv'</span>
response = requests.get(url)
results = []
<span class="kw">if</span> <span class="dt">len</span>(response.text) > <span class="dv">0</span>:
reader = io.StringIO(response.text)
wrapper = csv.reader(reader)
<span class="kw">for</span> record in wrapper:
<span class="kw">if</span> record[<span class="dv">0</span>] != <span class="st">'year'</span>:
year = <span class="dt">int</span>(record[<span class="dv">0</span>])
value = <span class="dt">float</span>(record[<span class="dv">1</span>])
results.append([year, value])
<span class="kw">return</span> results
<span class="dt">print</span>(<span class="st">'number of records for Canada:'</span>, <span class="dt">len</span>(get_annual_mean_temp_by_country(<span class="st">'CAN'</span>)))
<span class="dt">print</span>(<span class="st">'number of records for Latveria:'</span>, <span class="dt">len</span>(get_annual_mean_temp_by_country(<span class="st">'LTV'</span>)))</code></pre>
<pre class="output"><code>number of records for Canada: 109
number of records for Latveria: 0</code></pre>
<p>Now that we can get surface temperatures for different countries, we can write a function to compare those values. (We’ll jump straight into writing a function because by now it’s clear that’s what we’re eventually going to do anyway.) Here’s our first cut:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> diff_records(left, right):
<span class="co">'''Given lists of [year, value] pairs, return list of [year, difference] pairs.'''</span>
num_years = <span class="dt">len</span>(left)
results = []
<span class="kw">for</span> i in <span class="dt">range</span>(num_years):
left_year, left_value = left[i]
right_year, right_value = right[i]
difference = left_value - right_value
results.append([left_year, difference])
<span class="kw">return</span> results</code></pre>
<p>Here, we’re using the number of entries in <code>left</code> (which we find with <code>len(left)</code>) to control our loop. The expression:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="kw">for</span> i in <span class="dt">range</span>(num_years):</code></pre>
<p>runs <code>i</code> from 0 to <code>num_years-1</code>, which corresponds exactly to the legal indices of <code>left</code>. Inside the loop we unpack the left and right years and values from the list entries, then append a pair containing a year and a difference to <code>results</code>, which we return at the end.</p>
<p>To see if this function works, we can run a couple of tests on made-up data:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span>(<span class="st">'one record:'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>]],
[[<span class="dv">1900</span>, <span class="fl">2.0</span>]]))
<span class="dt">print</span>(<span class="st">'two records:'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>], [<span class="dv">1901</span>, <span class="fl">10.0</span>]],
[[<span class="dv">1900</span>, <span class="fl">2.0</span>], [<span class="dv">1901</span>, <span class="fl">20.0</span>]]))</code></pre>
<pre class="output"><code>one record: [[1900, -1.0]]
two records: [[1900, -1.0], [1901, -10.0]]</code></pre>
<p>That looks pretty good—but what about these cases?</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span>(<span class="st">'mis-matched years:'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>]],
[[<span class="dv">1999</span>, <span class="fl">2.0</span>]]))
<span class="dt">print</span>(<span class="st">'left is shorter'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>]],
[[<span class="dv">1900</span>, <span class="fl">10.0</span>], [<span class="dv">1901</span>, <span class="fl">20.0</span>]]))
<span class="dt">print</span>(<span class="st">'right is shorter'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>], [<span class="dv">1901</span>, <span class="fl">2.0</span>]],
[[<span class="dv">1900</span>, <span class="fl">10.0</span>]]))</code></pre>
<pre class="error"><code>---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-15-7582f56db8bf> in <module>()
4 [[1900, 10.0], [1901, 20.0]])
5 print('right is shorter', diff_records([[1900, 1.0], [1901, 2.0]],
----> 6 [[1900, 10.0]]))
<ipython-input-13-67464343fd99> in diff_records(left, right)
5 for i in range(num_years):
6 left_year, left_value = left[i]
----> 7 right_year, right_value = right[i]
8 difference = left_value - right_value
9 results.append([left_year, difference])
IndexError: list index out of rangemis-matched years: [[1900, -1.0]]
left is shorter [[1900, -9.0]]
right is shorter</code></pre>
<p>The first test gives us an answer even though the years didn’t match: we get a result, but it’s meaningless. The second case gives us a partial result, again without telling us there’s a problem, while the third crashes because we’re using <code>left</code> to determine the number of records, but <code>right</code> doesn’t have that many.</p>
<p>The first two problems are actually worse than the third because they are <a href="reference.html#silent-failure">silent failures</a>: the function does the wrong thing, but doesn’t indicate that in any way. Let’s fix that:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> diff_records(left, right):
<span class="co">'''</span>
<span class="co"> Given lists of [year, value] pairs, return list of [year, difference] pairs.</span>
<span class="co"> Fails if the inputs are not for exactly corresponding years.</span>
<span class="co"> '''</span>
<span class="kw">assert</span> <span class="dt">len</span>(left) == <span class="dt">len</span>(right), \
<span class="co">'Inputs have different lengths.'</span>
num_years = <span class="dt">len</span>(left)
results = []
<span class="kw">for</span> i in <span class="dt">range</span>(num_years):
left_year, left_value = left[i]
right_year, right_value = right[i]
<span class="kw">assert</span> left_year == right_year, \
<span class="co">'Record {0} is for different years: {1} vs {2}'</span>.<span class="dt">format</span>(i, left_year, right_year)
difference = left_value - right_value
results.append([left_year, difference])
<span class="kw">return</span> results</code></pre>
<p>Do our “good” tests pass?</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span>(<span class="st">'one record:'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>]],
[[<span class="dv">1900</span>, <span class="fl">2.0</span>]]))
<span class="dt">print</span>(<span class="st">'two records:'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>], [<span class="dv">1901</span>, <span class="fl">10.0</span>]],
[[<span class="dv">1900</span>, <span class="fl">2.0</span>], [<span class="dv">1901</span>, <span class="fl">20.0</span>]]))</code></pre>
<pre class="output"><code>one record: [[1900, -1.0]]
two records: [[1900, -1.0], [1901, -10.0]]</code></pre>
<p>What about our the three tests that we now expect to fail?</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span>(<span class="st">'mis-matched years:'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>]],
[[<span class="dv">1999</span>, <span class="fl">2.0</span>]]))</code></pre>
<pre class="error"><code>---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-18-c101917a748e> in <module>()
1 print('mis-matched years:', diff_records([[1900, 1.0]],
----> 2 [[1999, 2.0]]))
<ipython-input-16-d41327791c15> in diff_records(left, right)
10 left_year, left_value = left[i]
11 right_year, right_value = right[i]
---> 12 assert left_year == right_year, 'Record {0} is for different years: {1} vs {2}'.format(i, left_year, right_year)
13 difference = left_value - right_value
14 results.append([left_year, difference])
AssertionError: Record 0 is for different years: 1900 vs 1999mis-matched years:</code></pre>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span>(<span class="st">'left is shorter'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>]],
[[<span class="dv">1900</span>, <span class="fl">10.0</span>], [<span class="dv">1901</span>, <span class="fl">20.0</span>]]))</code></pre>
<pre class="error"><code>---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-19-682d448d921e> in <module>()
1 print('left is shorter', diff_records([[1900, 1.0]],
----> 2 [[1900, 10.0], [1901, 20.0]]))
<ipython-input-16-d41327791c15> in diff_records(left, right)
4 Fails if the inputs are not for exactly corresponding years.
5 '''
----> 6 assert len(left) == len(right), 'Inputs have different lengths.'
7 num_years = len(left)
8 results = []
AssertionError: Inputs have different lengths. left is shorter</code></pre>
<pre class="sourceCode python"><code class="sourceCode python"><span class="dt">print</span>(<span class="st">'right is shorter'</span>, diff_records([[<span class="dv">1900</span>, <span class="fl">1.0</span>], [<span class="dv">1901</span>, <span class="fl">2.0</span>]],
[[<span class="dv">1900</span>, <span class="fl">10.0</span>]]))</code></pre>
<pre class="error"><code>---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-20-a475e608dd70> in <module>()
1 print('right is shorter', diff_records([[1900, 1.0], [1901, 2.0]],
----> 2 [[1900, 10.0]]))
<ipython-input-16-d41327791c15> in diff_records(left, right)
4 Fails if the inputs are not for exactly corresponding years.
5 '''
----> 6 assert len(left) == len(right), 'Inputs have different lengths.'
7 num_years = len(left)
8 results = []
AssertionError: Inputs have different lengths. right is shorter</code></pre>
<p>Excellent: the assertions we’ve added will now alert us if we try to work with badly-formatted or inconsistent data.</p>
<div id="theres-a-better-way-to-do-it" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>There’s a Better Way to Do It</h2>
</div>
<div class="panel-body">
<p>We had to run each test in a cell of its own because Python stops executing the code in a cell as soon as an assertion fails, and we want to make sure all three tests actually run. A <a href="reference.html#unit-testing-tool">unit testing tool</a> would handle this for us, and do much else as well.</p>
</div>
</div>
<div id="when-to-complain" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>When to Complain?</h2>
</div>
<div class="panel-body">
<p>Should <code>get_annual_mean_temp_by_country</code> print an error mesage when it doesn’t get data? Should it use an assertion to fail if it doesn’t get data? Why or why not?</p>
</div>
</div>
<div id="enumerating" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Enumerating</h2>
</div>
<div class="panel-body">
<p>Python includes a function called <code>enumerate</code> that’s often used in <code>for</code> loops. This loop:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="kw">for</span> (i, c) in <span class="dt">enumerate</span>(<span class="st">'abc'</span>):
<span class="dt">print</span>(i, <span class="st">'='</span>, c)</code></pre>
<p>prints:</p>
<pre class="output"><code>0 = a
1 = b
2 = c</code></pre>
<p>Rewrite <code>diff_records</code> to use <code>enumerate</code>.</p>
</div>
</div>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>