Skip to content

Latest commit

 

History

History
126 lines (87 loc) · 4.51 KB

section_term_document_matrix.md

File metadata and controls

126 lines (87 loc) · 4.51 KB

Term-document Matrix

Aka incidence matrix

Reduce time complexity of grep operation

Rows
Distinct query terms
Columns
Documents
Fields
1 if term in doc else 0

Documents need to be pre-processed

  • #1: a book about information retrieval
  • #2: a book about the search for information
  • #3: a book about information

#1 #2 #3
Book 1 1 1
Information 1 1 1
Retrieval 1 0 0
Search 0 1 0

Notes: Audience question

#1 #2 #3
Book 1 1 1
Information 1 1 1
Retrieval 1 0 0
Search 0 1 0

Document vector
#1: (1, 1, 1, 0)
Term vector
Retrieval: (1, 0, 0)

Notes: Audience question

How to query

Replace every term in query with term vector

  • $\text{(information AND retrieval) OR search}$
  • $= (111 \cap 100) \cup 010$
  • $= 100 \cup 010$
  • $= 110$
  • = #1 and #2
    • ­ #1: a book about information retrieval
    • ­ #2: a book about the search for information

Notes: Audience question

Time complexity

$$O(\text{num query terms} \times \text{num distinct terms})$$

Example

  • ­English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
  • ­grep: 2 query terms × 12GB = 24 billion string comparisons
  • ­Term-Document Matrix: 2 query terms × 1.2M distinct terms = 2.4M lookups

Notes: Audience question

Size

$$\text{num(distinct terms)} \times \text{num(docs)} \times \text{size of bit}$$

Example

  • ­English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
  • ­1.2M distinct terms × 6M articles = 7.200.000.000.000 cells = 7.2 terrabit = 900 gigabyte

That's way too large! How can the size be decreased?

­Matrix is very sparse, has mostly zeros

Notes: Audience question

­ ...

Notes: