Skip to content

Commit

Permalink
Remember if a page has previously been matched
Browse files Browse the repository at this point in the history
This is a small step towards preventing false positives: remember if a
page has previously been matched to a ward.

This means that if we match "broadheath" and then later try to find
"heath" in the same PDF, we'll never try to match it against the page
containing "broadheath".

This has the other advantage of being able to list unlatched pages, that
might come in handy for debugging later.
  • Loading branch information
symroe committed Mar 11, 2020
1 parent 367e2c7 commit 0dc349d
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion ynr/apps/sopn_parsing/helpers/pdf_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,21 +53,26 @@ def parse_pages(self):
def get_pages_by_ward_name(self, ward):
ward = clean_text(ward)
matched_pages = []
for page in self.pages:
for page in self.unmatched_pages():
if page.is_top_page:
if matched_pages:
return matched_pages
search_text = clean_text(page.get_page_heading())
wards = ward.split("/")
for ward in wards:
if ward in search_text:
page.matched = ward
matched_pages.append(page)
else:
if matched_pages:
page.matched = ward
matched_pages.append(page)
if matched_pages:
return matched_pages

def unmatched_pages(self):
return [p for p in self.pages if not p.matched]


class SOPNPageText:
"""
Expand All @@ -79,6 +84,7 @@ def __init__(self, page_number, text):
self.raw_text = text
self.text = clean_text(text)
self.is_top_page = True
self.matched = None

def get_page_heading_set(self):
"""
Expand Down

0 comments on commit 0dc349d

Please sign in to comment.