-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add: initial extract from left-bar layout * update: classify and parse url for images-multimedia * update: clean code * version: 0.4.1.dev0 * update: add handling for layouts circa Mar 2024 * version: 0.4.1.dev1 * update: rename text-based header classifier for clarity * update: rename header file for addtl clarity * add: pipeline for query notices component * update: filter empty divs util function * fix: initialize output dict * update: move top image parser to header section parsers * version: 0.4.1.dev2 * fix: missing cmpt_ranks due to empty ad components, filter before adding to the list * fix: broader filtering, sub_types, better title and url parser for medium * fix: handle ads and shopping ads extracted from same serp * update: handling for no subcomponents, pass error and text * clean: quotation formatting * update: readme example * version: 0.4.1.dev3 * update: add query suggestion variation, handle multiple suggestions, drop internal url * update: assert parsed list is not empty * update: reorg, clearer header extractors, handle shopping ads in ads * update: catch location query notices * update: rename query_notice to notice, includes location notices * version: 0.4.1.dev4 * fix: renaming, include more query edit notices * version: 0.4.1.dev5 * update: refactor notices parser as class * version: 0.4.1.dev6 * update: add language tip sub type * update: grab notice divs more directly * fix: wrong get_url usage for images urls * version: 0.4.1.dev7 * Bump to 0.4.1
- Loading branch information
Showing
15 changed files
with
526 additions
and
257 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
from .headers import ClassifyByHeader | ||
from .header_text import ClassifyHeaderText | ||
from .header_components import ClassifyHeaderComponent | ||
from .main import ClassifyMain | ||
from .footer import ClassifyFooter |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
from .. import webutils | ||
import bs4 | ||
|
||
|
||
class ClassifyHeaderComponent: | ||
"""Classify a component from the header section based on its bs4.element.Tag""" | ||
|
||
@staticmethod | ||
def classify(cmpt: bs4.element.Tag) -> str: | ||
"""Classify the component type based on header text""" | ||
|
||
cmpt_type = "unknown" | ||
if webutils.check_dict_value(cmpt.attrs, "id", ["taw", "topstuff"]): | ||
cmpt_type = "notice" | ||
return cmpt_type |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.