-
-
Notifications
You must be signed in to change notification settings - Fork 95
/
doc.go
163 lines (120 loc) · 6.31 KB
/
doc.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
// Copyright 2014 Martin Angers and Contributors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
/*
Package fetchbot provides a simple and flexible web crawler that follows the robots.txt
policies and crawl delays.
It is very much a rewrite of gocrawl (https://github.com/PuerkitoBio/gocrawl) with a
simpler API, less features built-in, but at the same time more flexibility. As for Go
itself, sometimes less is more!
Installation
To install, simply run in a terminal:
go get github.com/PuerkitoBio/fetchbot
The package has a single external dependency, robotstxt
(https://github.com/temoto/robotstxt). It also integrates code from the iq package
(https://github.com/kylelemons/iq).
The API documentation is available on godoc.org
(http://godoc.org/github.com/PuerkitoBio/fetchbot).
Usage
The following example (taken from /example/short/main.go) shows how to create and
start a Fetcher, one way to send commands, and how to stop the fetcher once all
commands have been handled.
package main
import (
"fmt"
"net/http"
"github.com/PuerkitoBio/fetchbot"
)
func main() {
f := fetchbot.New(fetchbot.HandlerFunc(handler))
queue := f.Start()
queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc")
queue.Close()
}
func handler(ctx *fetchbot.Context, res *http.Response, err error) {
if err != nil {
fmt.Printf("error: %s\n", err)
return
}
fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())
}
A more complex and complete example can be found in the repository, at /example/full/.
Fetcher
Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers.
It receives Commands via the Queue, executes the requests, and calls a Handler to
process the responses. A Command is an interface that tells the Fetcher which URL to
fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).
A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the
thread-safe object that can be used to send commands, or to stop the crawler.
Both the Command and the Handler are interfaces, and may be implemented in various ways.
They are defined like so:
type Command interface {
URL() *url.URL
Method() string
}
type Handler interface {
Handle(*Context, *http.Response, error)
}
A Context is a struct that holds the Command and the Queue, so that the Handler always
knows which Command initiated this call, and has a handle to the Queue.
A Handler is similar to the net/http Handler, and middleware-style combinations can
be built on top of it. A HandlerFunc type is provided so that simple functions
with the right signature can be used as Handlers (like net/http.HandlerFunc), and there
is also a multiplexer Mux that can be used to dispatch calls to different Handlers
based on some criteria.
Command-related Interfaces
The Fetcher recognizes a number of interfaces that the Command may implement, for
more advanced needs.
* BasicAuthProvider: Implement this interface to specify the basic authentication
credentials to set on the request.
* CookiesProvider: If the Command implements this interface, the provided Cookies
will be set on the request.
* HeaderProvider: Implement this interface to specify the headers to set on the
request.
* ReaderProvider: Implement this interface to set the body of the request, via
an io.Reader.
* ValuesProvider: Implement this interface to set the body of the request, as
form-encoded values. If the Content-Type is not specifically set via a HeaderProvider,
it is set to "application/x-www-form-urlencoded". ReaderProvider and ValuesProvider
should be mutually exclusive as they both set the body of the request. If both are
implemented, the ReaderProvider interface is used.
* Handler: Implement this interface if the Command's response should be handled
by a specific callback function. By default, the response is handled by the Fetcher's
Handler, but if the Command implements this, this handler function takes precedence
and the Fetcher's Handler is ignored.
Since the Command is an interface, it can be a custom struct that holds additional
information, such as an ID for the URL (e.g. from a database), or a depth counter
so that the crawling stops at a certain depth, etc. For basic commands that don't
require additional information, the package provides the Cmd struct that implements
the Command interface. This is the Command implementation used when using the
various Queue.SendString\* methods.
There is also a convenience HandlerCmd struct for the commands that should be handled
by a specific callback function. It is a Command with a Handler interface implementation.
Fetcher Options
The Fetcher has a number of fields that provide further customization:
* HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A
different client can be set on the Fetcher.HttpClient field.
* CrawlDelay : That value is used only if there is no delay specified
by the robots.txt of a given host.
* UserAgent : Sets the user agent string to use for the requests and to validate
against the robots.txt entries.
* WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving
new commands to fetch. If the idle time-to-live is reached, the worker goroutine
is stopped and its resources are released. This can be especially useful for
long-running crawlers.
* AutoClose : If true, closes the queue automatically once the number of active hosts
reach 0.
* DisablePoliteness : If true, ignores the robots.txt policies of the hosts.
What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't
keep track of already visited URLs, and it doesn't normalize the URLs. This is outside
the scope of this package - all commands sent on the Queue will be fetched.
Normalization can easily be done (e.g. using
https://github.com/PuerkitoBio/purell) before sending the Command to the Fetcher.
How to keep track of visited URLs depends on the use-case of the specific crawler,
but for an example, see /example/full/main.go.
License
The BSD 3-Clause license (http://opensource.org/licenses/BSD-3-Clause), the same as
the Go language. The iq_slice.go file is under the CDDL-1.0 license (details in
the source file).
*/
package fetchbot