Notes:
Handle malicious and faulty pages
Rate limit
Notes:
- Audience question
Scalable, efficient
Crawl "good" pages more frequently
Keep index up-to-date
Data formats, protocols
Notes:
- Audience question
Notes:
- aka. Crawl frontier
- Priority queue
Notes:
- Partition by domain
- Cache DNS
- Locality
Notes:
Notes:
- Connectivity servers / indices
- Store web graph, in- and out-links
- Support graph queries: in- / out-links, in- / out-degree, traversal
- Used for link analysis, etc.
Notes: