Skip to content

Latest commit

 

History

History
132 lines (91 loc) · 4.79 KB

section_phrase_queries.md

File metadata and controls

132 lines (91 loc) · 4.79 KB

Phrase queries

fh salzburg should not match

In Salzburg there is a University and in Vienna there is an FH

How?


Phrase queries

  • ­ fh salzburg should not match In Salzburg there is a University and in Vienna there is an FH
  • ­ Search for names and concepts: "fh salzburg", "mountain bike"
  • ­ Well accepted by users
  • ­ Needs more advanced index with positional information

Notes:

  • How could we implement this?
  • Can the current index handle this?

Positional index

  • #1: retrieving more information about information retrieval
  • #2: searching and retrieving a book about the search for information
  • #3: a book about information

Terms (excluding stop words) Doc IDs
book #2:[3], #3:[1]
information #1:[2, 3], #2:[5], #3:[2]
retriev #1:[1, 4], #2:[2]
search #2:[1, 4]

Notes:

  • Audience question

Intersection algorithm

"information retrieval"

  1. Fetch postings for each query term:
    • ­ information: #1:[2, 3], #2:[5], #3:[2]
    • ­ retriev: #1:[1, 4], #2:[2]
  2. ­ Calculate term pair distances per document, eg. retrieval - information:
    • ­ #1: retrieving more information about information retrieval * ­ [1, 4] - [2, 3] = -1 != 1
    • ­ #1: retrieving more information about information retrieval * ­ [1, 4] - [2, 3] = -2 != 1
    • ­ #1: retrieving more information about information retrieval * ­ [1, 4] - [2, 3] = 2 != 1
    • ­ #1: retrieving more information about information retrieval * ­ [1, 4] - [2, 3] = 1 → match

Expensive calculation

Notes:

  • Can this use proximity regardless of order, e.g., match "retrieval information" as well?
  • Can this support phrase gaps, i.e. information … retrieval?

Positional index

Supports phrase gaps: "dwayne johnson"~2 matches dwayne the rock johnson

Notes:

  • The most common case is to search for two consecutive words. The intersection algorithm is a bit expensive. Can we speed this up?

Biword index

  • Speed up common phrase queries
  • Auxiliary index
  • Index term pairs
  • Fast lookup of term pairs

#1: "Study at FH Salzburg"

Term Doc IDs
study at #1
at fh #1
fh salzburg #1

­ fh salzburg → #1

Notes: