Skip to content

Latest commit

 

History

History
97 lines (52 loc) · 1.87 KB

section_web_crawling.md

File metadata and controls

97 lines (52 loc) · 1.87 KB

Web crawling

Notes:


Web crawler requirements

Robustness

Handle malicious and faulty pages

Politeness

Rate limit

Notes:

  • Audience question

Web crawler requirements

Distributed

Scalable, efficient

Quality

Crawl "good" pages more frequently

Freshness

Keep index up-to-date

Extensible

Data formats, protocols

Notes:

  • Audience question

Crawling

Web crawler architecture

Source

Notes:


URL Queue

  • aka. Crawl frontier
  • Priority queue

$$\text{priority} = f(\text{quality}, \text{importance}, \text{change rate})$$

Notes:


Distributed Crawling

  • Partition by domain
  • Cache DNS
  • Locality

Notes:


Link graph

web graph

Notes:


Link graph

  • Connectivity servers / indices
  • Store web graph, in- and out-links
  • Support graph queries: in- / out-links, in- / out-degree, traversal
  • Used for link analysis, etc.

Notes: