Aka incidence matrix
Reduce time complexity of grep
operation
- Rows
- Distinct query terms
- Columns
- Documents
- Fields
- 1 if term in doc else 0
- #1: a book about information retrieval
- #2: a book about the search for information
- #3: a book about information
↓
#1 | #2 | #3 | |
---|---|---|---|
Book | 1 | 1 | 1 |
Information | 1 | 1 | 1 |
Retrieval | 1 | 0 | 0 |
Search | 0 | 1 | 0 |
#1 | #2 | #3 | |
---|---|---|---|
Book | 1 | 1 | 1 |
Information | 1 | 1 | 1 |
Retrieval | 1 | 0 | 0 |
Search | 0 | 1 | 0 |
↓
- Document vector
- #1:
(1, 1, 1, 0)
- Term vector
- Retrieval:
(1, 0, 0)
Replace every term in query with term vector
-
$\text{(information AND retrieval) OR search}$ -
$= (111 \cap 100) \cup 010$ -
$= 100 \cup 010$ -
$= 110$ - = #1 and #2
- #1: a book about information retrieval
- #2: a book about the search for information
- English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
- grep: 2 query terms × 12GB = 24 billion string comparisons
- Term-Document Matrix: 2 query terms × 1.2M distinct terms = 2.4M lookups
- English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
- 1.2M distinct terms × 6M articles = 7.200.000.000.000 cells = 7.2 terrabit = 900 gigabyte
That's way too large! How can the size be decreased?
↓
Matrix is very sparse, has mostly zeros
Notes: