forked from propublica/upton
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
379 lines (368 loc) · 24.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>upton.rb</title>
<link rel="stylesheet" href="http://jashkenas.github.com/docco/resources/docco.css">
</head>
<body>
<div id='container'>
<div id="background"></div>
<table cellspacing=0 cellpadding=0>
<thead>
<tr>
<th class=docs><h1>upton.rb</h1></th>
<th class=code></th>
</tr>
</thead>
<tbody>
<tr id='section-1'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-1">¶</a>
</div>
</td>
<td class=code>
<div class='highlight'><pre></pre></div>
</td>
</tr>
<tr id='section-2'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-2">¶</a>
</div>
<p><strong>Upton</strong> is a framework for easy web-scraping with a useful debug mode
that doesn’t hammer your target’s servers. It does the repetitive parts of
writing scrapers, so you only have to write the unique parts for each site.</p>
<p>Upton operates on the theory that, for most scraping projects, you need to
scrape two types of pages:</p>
<ol>
<li>Index pages, which list instance pages. For example, a job search
site’s search page or a newspaper’s homepage.</li>
<li>Instance pages, which represent the goal of your scraping, e.g.
job listings or news articles.</li>
</ol>
</td>
<td class=code>
<div class='highlight'><pre><span class="k">module</span> <span class="nn">Upton</span></pre></div>
</td>
</tr>
<tr id='section-3'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-3">¶</a>
</div>
<p>Upton::Scraper is implemented as an abstract class. Implement a class to
inherit from Upton::Scraper. </p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">class</span> <span class="nc">Scraper</span>
<span class="kp">attr_accessor</span> <span class="ss">:verbose</span><span class="p">,</span> <span class="ss">:debug</span><span class="p">,</span> <span class="ss">:nice_sleep_time</span><span class="p">,</span> <span class="ss">:stash_folder</span></pre></div>
</td>
</tr>
<tr id='section-Basic_use-case_methods.'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-Basic_use-case_methods.">¶</a>
</div>
<h2>Basic use-case methods.</h2>
</td>
<td class=code>
<div class='highlight'><pre></pre></div>
</td>
</tr>
<tr id='section-5'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-5">¶</a>
</div>
<p>This is the main user-facing method for a basic scraper.
Call <code>scrape</code> with a block; this block will be called on
the text of each instance page, (and optionally, its URL and its index
in the list of instance URLs returned by <code>get_index</code>).</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">scrape</span> <span class="o">&</span><span class="n">blk</span>
<span class="nb">self</span><span class="o">.</span><span class="n">scrape_from_list</span><span class="p">(</span><span class="nb">self</span><span class="o">.</span><span class="n">get_index</span><span class="p">,</span> <span class="n">blk</span><span class="p">)</span>
<span class="k">end</span></pre></div>
</td>
</tr>
<tr id='section-Configuration_Variables'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-Configuration_Variables">¶</a>
</div>
<h2>Configuration Variables</h2>
</td>
<td class=code>
<div class='highlight'><pre></pre></div>
</td>
</tr>
<tr id='section-7'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-7">¶</a>
</div>
<p><code>index_url</code>: The URL of the page containing the list of instances.
<code>selector</code>: The XPath or CSS that specifies the anchor elements within
the page.
<code>selector_method</code>: :xpath or :css. By default, :xpath
These options are a shortcut. If you override <code>get_index</code>, you may not
need to set them.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">initialize</span><span class="p">(</span><span class="n">index_url</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span> <span class="n">selector</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span> <span class="n">selector_method</span><span class="o">=</span><span class="ss">:xpath</span><span class="p">)</span></pre></div>
</td>
</tr>
<tr id='section-8'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-8">¶</a>
</div>
<p>If true, then Upton prints information about when it gets
files from the internet and when it gets them from its stash.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="vi">@verbose</span> <span class="o">=</span> <span class="kp">false</span></pre></div>
</td>
</tr>
<tr id='section-9'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-9">¶</a>
</div>
<p>If true, then Upton fetches each page only once
future requests for that file are responded to with the locally stashed
version.
You may want to set @debug to false for production (but maybe not).
You can also control stashing behavior on a per-call basis with the
optional second argument to get_page, if, for instance, you want to
stash instance pages, but not index pages.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="vi">@debug</span> <span class="o">=</span> <span class="kp">true</span></pre></div>
</td>
</tr>
<tr id='section-10'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-10">¶</a>
</div>
<p>In order to not hammer servers, Upton waits for, by default, 30<br>
seconds between requests to the remote server.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="vi">@nice_sleep_time</span> <span class="o">=</span> <span class="mi">30</span> <span class="c1">#seconds</span></pre></div>
</td>
</tr>
<tr id='section-11'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-11">¶</a>
</div>
<p>Folder name for stashes, if you want them to be stored somewhere else,
e.g. under /tmp.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="vi">@stash_folder</span> <span class="o">=</span> <span class="s2">"stashes"</span>
<span class="k">unless</span> <span class="no">Dir</span><span class="o">.</span><span class="n">exists?</span><span class="p">(</span><span class="vi">@stash_folder</span><span class="p">)</span>
<span class="no">Dir</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="vi">@stash_folder</span><span class="p">)</span>
<span class="k">end</span>
<span class="vi">@index_url</span> <span class="o">=</span> <span class="n">index_url</span>
<span class="vi">@index_selector</span> <span class="o">=</span> <span class="n">selector</span>
<span class="vi">@index_selector_method</span> <span class="o">=</span> <span class="n">selector_method</span>
<span class="k">end</span></pre></div>
</td>
</tr>
<tr id='section-Advanced_use-case_methods.'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-Advanced_use-case_methods.">¶</a>
</div>
<h2>Advanced use-case methods.</h2>
</td>
<td class=code>
<div class='highlight'><pre></pre></div>
</td>
</tr>
<tr id='section-13'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-13">¶</a>
</div>
<p>If instance pages are paginated, <strong>you must override</strong>
this method to return the next URL, given the current URL and its index.</p>
<p>If instance pages aren’t paginated, there’s no need to override this.</p>
<p>Return URLs that are empty strings are ignored (and recursion stops.)
e.g. next<em>instance</em>page_url(“http://whatever.com/article/upton-sinclairs-the-jungle?page=1”, 2)
ought to return “http://whatever.com/article/upton-sinclairs-the-jungle?page=2”</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">next_instance_page_url</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">index</span><span class="p">)</span>
<span class="s2">""</span>
<span class="k">end</span></pre></div>
</td>
</tr>
<tr id='section-14'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-14">¶</a>
</div>
<p>The same as <code>next_instance_page_url</code>, except for index pages.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">next_index_page_url</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">index</span><span class="p">)</span>
<span class="s2">""</span>
<span class="k">end</span>
<span class="kp">protected</span></pre></div>
</td>
</tr>
<tr id='section-15'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-15">¶</a>
</div>
<p>Handles getting pages with RestClient or getting them from the local stash</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">get_page</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">stash</span><span class="o">=</span><span class="kp">false</span><span class="p">)</span>
<span class="k">return</span> <span class="s2">""</span> <span class="k">if</span> <span class="n">url</span><span class="o">.</span><span class="n">empty?</span></pre></div>
</td>
</tr>
<tr id='section-16'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-16">¶</a>
</div>
<p>the filename for each stashed version is a cleaned version of the URL.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">if</span> <span class="n">stash</span> <span class="o">&&</span> <span class="no">File</span><span class="o">.</span><span class="n">exists?</span><span class="p">(</span> <span class="no">File</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="vi">@stash_folder</span><span class="p">,</span> <span class="n">url</span><span class="o">.</span><span class="n">gsub</span><span class="p">(</span><span class="sr">/[^A-Za-z0-9\-]/</span><span class="p">,</span> <span class="s2">""</span><span class="p">)</span> <span class="p">)</span> <span class="p">)</span>
<span class="nb">puts</span> <span class="s2">"usin' a stashed copy of "</span> <span class="o">+</span> <span class="n">url</span> <span class="k">if</span> <span class="vi">@verbose</span>
<span class="n">resp</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span> <span class="no">File</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="vi">@stash_folder</span><span class="p">,</span> <span class="n">url</span><span class="o">.</span><span class="n">gsub</span><span class="p">(</span><span class="sr">/[^A-Za-z0-9\-]/</span><span class="p">,</span> <span class="s2">""</span><span class="p">)),</span> <span class="s1">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span>
<span class="k">else</span>
<span class="k">begin</span>
<span class="nb">puts</span> <span class="s2">"getting "</span> <span class="o">+</span> <span class="n">url</span> <span class="k">if</span> <span class="vi">@verbose</span>
<span class="nb">sleep</span> <span class="vi">@nice_sleep_time</span>
<span class="n">resp</span> <span class="o">=</span> <span class="no">RestClient</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="p">{</span><span class="ss">:accept</span><span class="o">=></span> <span class="s2">"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"</span><span class="p">})</span>
<span class="k">rescue</span> <span class="no">RestClient</span><span class="o">::</span><span class="no">ResourceNotFound</span>
<span class="n">resp</span> <span class="o">=</span> <span class="s2">""</span>
<span class="k">rescue</span> <span class="no">RestClient</span><span class="o">::</span><span class="no">InternalServerError</span>
<span class="n">resp</span> <span class="o">=</span> <span class="s2">""</span>
<span class="k">end</span>
<span class="k">if</span> <span class="n">stash</span>
<span class="nb">puts</span> <span class="s2">"I just stashed (</span><span class="si">#{</span><span class="n">resp</span><span class="o">.</span><span class="n">code</span> <span class="k">if</span> <span class="n">resp</span><span class="o">.</span><span class="n">respond_to?</span><span class="p">(</span><span class="ss">:code</span><span class="p">)</span><span class="si">}</span><span class="s2">): </span><span class="si">#{</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span> <span class="k">if</span> <span class="vi">@verbose</span>
<span class="nb">open</span><span class="p">(</span> <span class="no">File</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="vi">@stash_folder</span><span class="p">,</span> <span class="n">url</span><span class="o">.</span><span class="n">gsub</span><span class="p">(</span><span class="sr">/[^A-Za-z0-9\-]/</span><span class="p">,</span> <span class="s2">""</span><span class="p">)</span> <span class="p">),</span> <span class="s1">'w:UTF-8'</span><span class="p">){</span><span class="o">|</span><span class="n">f</span><span class="o">|</span> <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s2">"UTF-8"</span><span class="p">,</span> <span class="ss">:invalid</span> <span class="o">=></span> <span class="ss">:replace</span><span class="p">,</span> <span class="ss">:undef</span> <span class="o">=></span> <span class="ss">:replace</span> <span class="p">))}</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">resp</span>
<span class="k">end</span></pre></div>
</td>
</tr>
<tr id='section-17'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-17">¶</a>
</div>
<p>Return a list of URLs for the instances you want to scrape.
This can optionally be overridden if, for example, the list of instances
comes from an API.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">get_index</span>
<span class="n">parse_index</span><span class="p">(</span><span class="n">get_index_pages</span><span class="p">(</span><span class="vi">@index_url</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="vi">@index_selector</span><span class="p">,</span> <span class="vi">@index_selector_method</span><span class="p">)</span>
<span class="k">end</span></pre></div>
</td>
</tr>
<tr id='section-18'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-18">¶</a>
</div>
<p>Using the XPath or CSS selector and selector_method that uniquely locates
the links in the index, return those links as strings.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">parse_index</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">selector</span><span class="p">,</span> <span class="n">selector_method</span><span class="o">=</span><span class="ss">:xpath</span><span class="p">)</span>
<span class="no">Nokogiri</span><span class="o">::</span><span class="no">HTML</span><span class="p">(</span><span class="n">text</span><span class="p">)</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">selector_method</span><span class="p">,</span> <span class="n">selector</span><span class="p">)</span><span class="o">.</span><span class="n">to_a</span><span class="o">.</span><span class="n">map</span><span class="p">{</span><span class="o">|</span><span class="n">l</span><span class="o">|</span> <span class="n">l</span><span class="o">[</span><span class="s2">"href"</span><span class="o">]</span> <span class="p">}</span>
<span class="k">end</span></pre></div>
</td>
</tr>
<tr id='section-19'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-19">¶</a>
</div>
<p>Returns the concatenated output of each member of a paginated index,
e.g. a site listing links with 2+ pages.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">get_index_pages</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">index</span><span class="p">)</span> <span class="c1">#maybe needs better name</span>
<span class="n">resp</span> <span class="o">=</span> <span class="nb">self</span><span class="o">.</span><span class="n">get_page</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="vi">@debug</span><span class="p">)</span>
<span class="k">if</span> <span class="o">!</span><span class="n">resp</span><span class="o">.</span><span class="n">empty?</span>
<span class="n">next_url</span> <span class="o">=</span> <span class="nb">self</span><span class="o">.</span><span class="n">next_index_page_url</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">unless</span> <span class="n">next_url</span> <span class="o">==</span> <span class="n">url</span>
<span class="n">next_resp</span> <span class="o">=</span> <span class="nb">self</span><span class="o">.</span><span class="n">get_index_pages</span><span class="p">(</span><span class="n">next_url</span><span class="p">,</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">to_s</span>
<span class="n">resp</span> <span class="o">+=</span> <span class="n">next_resp</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">resp</span>
<span class="k">end</span></pre></div>
</td>
</tr>
<tr id='section-20'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-20">¶</a>
</div>
<p>Returns the concatenated output of each member of a paginated instance,
e.g. a news article with 2 pages.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">get_instance</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">resp</span> <span class="o">=</span> <span class="nb">self</span><span class="o">.</span><span class="n">get_page</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="vi">@debug</span><span class="p">)</span>
<span class="k">if</span> <span class="o">!</span><span class="n">resp</span><span class="o">.</span><span class="n">empty?</span>
<span class="n">next_url</span> <span class="o">=</span> <span class="nb">self</span><span class="o">.</span><span class="n">next_instance_page_url</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">unless</span> <span class="n">next_url</span> <span class="o">==</span> <span class="n">url</span>
<span class="n">next_resp</span> <span class="o">=</span> <span class="nb">self</span><span class="o">.</span><span class="n">get_instance</span><span class="p">(</span><span class="n">next_url</span><span class="p">,</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">to_s</span>
<span class="n">resp</span> <span class="o">+=</span> <span class="n">next_resp</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">resp</span>
<span class="k">end</span></pre></div>
</td>
</tr>
<tr id='section-21'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-21">¶</a>
</div>
<p>Just a helper for <code>scrape</code>.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">scrape_from_list</span><span class="p">(</span><span class="n">list</span><span class="p">,</span> <span class="n">blk</span><span class="p">)</span>
<span class="nb">puts</span> <span class="s2">"Scraping </span><span class="si">#{</span><span class="n">list</span><span class="o">.</span><span class="n">size</span><span class="si">}</span><span class="s2"> instances"</span> <span class="k">if</span> <span class="vi">@verbose</span>
<span class="n">list</span><span class="o">.</span><span class="n">each_with_index</span><span class="o">.</span><span class="n">map</span> <span class="k">do</span> <span class="o">|</span><span class="n">instance_url</span><span class="p">,</span> <span class="n">index</span><span class="o">|</span>
<span class="n">blk</span><span class="o">.</span><span class="n">call</span><span class="p">(</span><span class="n">get_instance</span><span class="p">(</span><span class="n">instance_url</span><span class="p">),</span> <span class="n">instance_url</span><span class="p">,</span> <span class="n">index</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">end</span></pre></div>
</td>
</tr>
<tr id='section-22'>
<td class=docs>
<div class="pilwrap">
<a class="pilcrow" href="#section-22">¶</a>
</div>
<p>it’s often useful to have this slug method for uniquely (almost certainly) identifying pages.</p>
</td>
<td class=code>
<div class='highlight'><pre> <span class="k">def</span> <span class="nf">slug</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="s2">"wapo:"</span> <span class="o">+</span> <span class="n">url</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">"/"</span><span class="p">)</span><span class="o">[-</span><span class="mi">1</span><span class="o">].</span><span class="n">gsub</span><span class="p">(</span><span class="sr">/\?.*/</span><span class="p">,</span> <span class="s2">""</span><span class="p">)</span><span class="o">.</span><span class="n">gsub</span><span class="p">(</span><span class="sr">/.html.*/</span><span class="p">,</span> <span class="s2">""</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span></pre></div>
</td>
</tr>
</table>
</div>
</body>