Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMED-119 - BC-8072 - Update "master_schulcloud" from "master" #50

Merged
merged 673 commits into from
Oct 7, 2024

Conversation

bergatco
Copy link

@bergatco bergatco commented Jul 16, 2024

- extracts language codes (ISO 639-1) from natural language strings or potentially ambiguous/malformed language strings ("en-US", "en_US", "fr_FR", "französisch", "englisch" etc.)
- the validity of the language code recognition can be tested (and extended) by using the test_language_mapper.py class
…des according to ISO 693-1)

- add: NormLanguagePipeline to settings.py
- feat: implement language string normalization for "lom.general.language" and "lom.educational.language"
- change: "lom.educational.language" to multi-value field (otherwise it would drop all other strings besides the first one)

Signed-off-by: criamos <[email protected]>
feat: language string normalization pipeline (ISO 639-1) and serlo_spider v0.3.2
- upgrade all (safely) upgradeable dependencies to the newest versions
- tested against Python 3.11.6 with serlo_spider

Signed-off-by: criamos <[email protected]>
- change: default 'httpx' timeout value from 5s to 30s
- fix: get_headers()-method omits NoneType values from now on
-- the method previously built a dictionary, where "Content-Type" would more often than not be of "NoneType", which would throw (correct) 'httpx'-Exceptions when trying to encode NoneTypes
- 'httpx' requires 'asyncio' to be enabled in Scrapy
- see: https://docs.scrapy.org/en/latest/topics/asyncio.html

Signed-off-by: criamos <[email protected]>
- change the web_tools implementation to handle requests asynchronously
-- currently uses a Semaphore from 'asyncio' to stay below 10 concurrent connections to the docker container
--- when increasing the "MAX_QUEUE_SIZE"-setting in the "browserless/chrome" container, we might be able to increase the semaphore value accordingly for higher performance
-- attention: if too many requests are made in parallel, the docker container will refuse those connections and items will be dropped!

Signed-off-by: criamos <[email protected]>
- sodix_spider shares a Session object between all GraphQL calls from now on
-- this should (significantly) increase performance and connection reliability (see: https://requests.readthedocs.io/en/latest/user/advanced/#session-objects)
- fix: Type Warning ("Response"-parameter expected 'LomBase'-type because LomBase.shouldImport() was called (instead of 'self.shouldImport()'-override))

Signed-off-by: criamos <[email protected]>
- fix: DeprecationWarning that appeared with Scrapy 2.10 by setting its value to the recommended value "2.7"
-- see: https://docs.scrapy.org/en/latest/topics/request-response.html#request-fingerprinter-implementation

Signed-off-by: criamos <[email protected]>
- ports Python's built-in "functools.lru_cache"-function to asyncio
- see: https://github.com/aio-libs/async-lru

Signed-off-by: criamos <[email protected]>
…e, improve Error-Handling

feat/perf: use LRU cache for Thumbnail-Pipeline

- feat: implements a cache for "thumbnail"-URls that discards 'least recently used' items first
-- while debugging large crawlers it was observed that many website hosters serve generic placeholder-like images for items that don't have a unique thumbnail: by keeping the most commonly requested URLs (and their response) in a cache, we should be able to significantly reduce the amount of outgoing requests and traffic (and increase performance)

change: replace 'httpx' in Thumbnail-Pipeline with Scrapy Requests for Splash

- the screenshot and thumbnail pipeline worked previously in parallel to Scrapy's built-in scheduler, which comes with its own set of problems
-- (ignoring the Scrapy scheduler means that we cannot control the load/traffic in a reasonable / responsible manner)

feat: improve Splash Error-Handling for unsupported URLs

- feat: use 'scrapy.FormRequest'-objects to handle Splash Requests (Splash is queried within the Scrapy Scheduler from now on)
- feat: fallback to Playwright if Splash failed to retrieve Thumbnails
-- if Splash fails to render a websites, it will set the splash_success flag to False and use Playwright instead
-- Splash will be DEPRECATED in the future since it has proven itself more and more unreliable with modern web-pages
- fix: several warnings in regard to shadowed variables
- by limiting concurrent access to 'getUrlData'-method by using two semaphores (one for Splash, one for Playwright), we can increase crawler performance without running into PoolTimeout Exceptions (which typically happen when too many requests are fired at once)

Signed-off-by: criamos <[email protected]>
- work-in-progress: for now, 25 seems to be a reasonable limit

Signed-off-by: criamos <[email protected]>
- fix: during SVG handling the "Content-Type"-field of "response.body" is a bytes-object, but we compared it to a string value
-- feat: made sure that we also typecheck the "mimetype" (and convert it if needed) before trying to save it to its 'thumbnail.mimetype'-field
- change: initiate "splash"-success-flag with None instead of True
-- this should give a better indication if the Thumbnail-Pipeline tried to use Splash up until a specific point in time
- change: observed connection timeouts in both v1 and v2, therefore changed the timeout setting to 60s (from default: 30s)

Signed-off-by: criamos <[email protected]>
- feat: set WebTools (Playwright) default to "trafilatura"
-- when extracting data with Playwright, we'll try to use "trafilatura" for fulltext extraction first and only fall back to html2text if trafilatura returned None
- docs: clean up ToDos, add TypeHints
… URL

- if Splash fails to render a website or a thumbnail URL couldn't be downloaded due to unexpected HTTP Status Codes, try to grab a screenshot of the website as a last resort / fallback

Signed-off-by: criamos <[email protected]>
- during testing/debugging it was observed that even the lenient timeout of 60s (default: 30s) per job is oftentimes not enough time for some website responses

Signed-off-by: criamos <[email protected]>
- change methods to async where necessary to await the coroutines of WebTools and be able to work with their data
- fix/optimize imports

Signed-off-by: criamos <[email protected]>
- change: set default WebEngine to Playwright (Splash isn't able to render most webpages anymore)
- change: make 'parse'-method async
- fix: drop items where 'jslde' cannot extract a JSON-LD (because it doesn exist)
-- this happens typically on "overview"-pages

Signed-off-by: criamos <[email protected]>
bergatco and others added 26 commits September 25, 2024 13:54
- to mitigate "httpx.ReadError"s upon dropped or reset HTTP connections (to / from the edu-sharing repository), the edu-sharing connector will use a shared "requests.Session"-object from now on

Background:
- "httpx.ReadError"s were observed for HTTP Requests that contained (potentially huge) payloads, especially during Thumbnail uploads (via set_node_preview()) and fulltext uploads (via set_node_text())
  - since we cannot reasonably limit the size of the uploaded data, switching back to "requests" to handle these requests (hopefully more graceful than httpx) should fix these HTTP Connection Pool issues
  - the httpx discussion at encode/httpx#3067 pointed towards similar errors which users observed for payloads above 1 MiB

PS: Thank you, Constantin (@bergatco) and Paul, for the collective debugging session!
- when using the 'resetVersion=true' Spider Argument, logging messages did not correctly reflect what was happening during the hash check in the EduSharingCheckPipeline
- LomBase now stores a custom_setting key ("EDU_SHARING_FORCE_UPDATE"), which can be accessed via "spider.custom_settings" for later readouts
- if an active 'resetVersion' or 'forceUpdate' setting was detected, the pipeline's debug message should be easier to understand that even though an item's hash hasn't changed, the item will get updated nonetheless
- after refactoring LomBase, some method calls in merlin_spider were missing awaits and async declarations
- fix ValidationError for getHash() method:
   - getHash used to submit an 'int'-value to the edu-sharing API, but the API actually expects a string value
- optimized imports

PS: thanks Constantin (@bergatco) for the debug logs!
Fix "httpx"-related ReadErrors in `es_connector`
Merge recent HTTPX-related fixes into `master`
…_connector (expected type: str)

- feat: additional check in the pipelines for "lom.technical.duration"-values and conversion to string
background:
- some old crawlers (e.g. merlin_spider) returned an Integer value in their "getHash()"-method, which now causes pydantic ValidationErrors with the new API client (since the API expects a string value for this property)
- this fixes pydantic ValidationErrors in the es_connector during POST requests
…peline

- fix AttributeErrors when a website-screenshot fails:
  - if the first website-screenshot fails (e.g. when the first URL in LOM technical location points towards a .mp3 or .mp4), screenshot_bytes won't be available to work with
  - this caused an AttributeError when the pipeline tried to convert a "None"-type to a thumbnail
- refactor: extracted the functionality of taking a website screenshot with playwright in the pipeline into its own method
- feat: second website screenshot fallback
  - if a second URL is available in LOM technical location, the pipeline will try to take a screenshot of that URL before finally giving up
  - this could happen in edge-cases where the first URL in LOM technical location is unavailable for website-screenshots, but the array contains a second URL that might be more fruitful for a screenshot (e.g. a landing page)
- fix: 2 warnings w.r.t. too broad exception clauses
Improved Exception Handling during website-screenshot fallback and several fixes for `pydantic` ValidationErrors
Merge fixes from PR 115 into `master`
@mamutmk5 mamutmk5 changed the title DMED-119 - Update "master_schulcloud" from "master" DMED-119 - BC-8072 - Update "master_schulcloud" from "master" Oct 2, 2024
@bergatco bergatco merged commit ebd5c96 into master_schulcloud Oct 7, 2024
2 of 4 checks passed
@bergatco bergatco deleted the update_master-schulcloud_from_master branch October 7, 2024 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants