DMED-119 - BC-8072 - Update "master_schulcloud" from "master" #50

bergatco · 2024-07-16T09:54:48Z

Description

Add latest changes from hpi-schul-cloud:master to master_schulcloud.

Links to Tickets or other pull requests

https://ticketsystem.dbildungscloud.de/browse/DMED-119
https://ticketsystem.dbildungscloud.de/browse/BC-8072
DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS dof_app_deploy#713
DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS schulcloud-server#4692
DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS schulcloud-client#3387
DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS nuxt-client#2933

Links to deployments :
https://dmed-119-integration-of-search-environment.dbc.dbildungscloud.dev/
https://dmed-119-integration-of-search-environment.nbc.dbildungscloud.dev/
https://dmed-119-integration-of-search-environment.brb.dbildungscloud.dev/

Changes

Datasecurity

Deployment

New Repos, NPM pakages or vendor scripts

Screenshots of UI changes

Approval for review

All points were discussed with the ticket creator, support-team or product owner. The code upholds all quality guidelines from the PR-template.

Notice: Please remove the WIP label if the PR is ready to review, otherwise nobody will review it.

- extracts language codes (ISO 639-1) from natural language strings or potentially ambiguous/malformed language strings ("en-US", "en_US", "fr_FR", "französisch", "englisch" etc.) - the validity of the language code recognition can be tested (and extended) by using the test_language_mapper.py class

…des according to ISO 693-1) - add: NormLanguagePipeline to settings.py - feat: implement language string normalization for "lom.general.language" and "lom.educational.language" - change: "lom.educational.language" to multi-value field (otherwise it would drop all other strings besides the first one) Signed-off-by: criamos <[email protected]>

Signed-off-by: criamos <[email protected]>

feat: language string normalization pipeline (ISO 639-1) and serlo_spider v0.3.2

- upgrade all (safely) upgradeable dependencies to the newest versions - tested against Python 3.11.6 with serlo_spider Signed-off-by: criamos <[email protected]>

Signed-off-by: criamos <[email protected]>

- change: default 'httpx' timeout value from 5s to 30s - fix: get_headers()-method omits NoneType values from now on -- the method previously built a dictionary, where "Content-Type" would more often than not be of "NoneType", which would throw (correct) 'httpx'-Exceptions when trying to encode NoneTypes

- 'httpx' requires 'asyncio' to be enabled in Scrapy - see: https://docs.scrapy.org/en/latest/topics/asyncio.html Signed-off-by: criamos <[email protected]>

Signed-off-by: criamos <[email protected]>

- change the web_tools implementation to handle requests asynchronously -- currently uses a Semaphore from 'asyncio' to stay below 10 concurrent connections to the docker container --- when increasing the "MAX_QUEUE_SIZE"-setting in the "browserless/chrome" container, we might be able to increase the semaphore value accordingly for higher performance -- attention: if too many requests are made in parallel, the docker container will refuse those connections and items will be dropped! Signed-off-by: criamos <[email protected]>

Signed-off-by: criamos <[email protected]>

- see: https://docs.scrapy.org/en/latest/topics/extensions.html#periodic-log-extension Signed-off-by: criamos <[email protected]>

Signed-off-by: criamos <[email protected]>

- sodix_spider shares a Session object between all GraphQL calls from now on -- this should (significantly) increase performance and connection reliability (see: https://requests.readthedocs.io/en/latest/user/advanced/#session-objects) - fix: Type Warning ("Response"-parameter expected 'LomBase'-type because LomBase.shouldImport() was called (instead of 'self.shouldImport()'-override)) Signed-off-by: criamos <[email protected]>

- fix: DeprecationWarning that appeared with Scrapy 2.10 by setting its value to the recommended value "2.7" -- see: https://docs.scrapy.org/en/latest/topics/request-response.html#request-fingerprinter-implementation Signed-off-by: criamos <[email protected]>

Signed-off-by: criamos <[email protected]>

- ports Python's built-in "functools.lru_cache"-function to asyncio - see: https://github.com/aio-libs/async-lru Signed-off-by: criamos <[email protected]>

…e, improve Error-Handling feat/perf: use LRU cache for Thumbnail-Pipeline - feat: implements a cache for "thumbnail"-URls that discards 'least recently used' items first -- while debugging large crawlers it was observed that many website hosters serve generic placeholder-like images for items that don't have a unique thumbnail: by keeping the most commonly requested URLs (and their response) in a cache, we should be able to significantly reduce the amount of outgoing requests and traffic (and increase performance) change: replace 'httpx' in Thumbnail-Pipeline with Scrapy Requests for Splash - the screenshot and thumbnail pipeline worked previously in parallel to Scrapy's built-in scheduler, which comes with its own set of problems -- (ignoring the Scrapy scheduler means that we cannot control the load/traffic in a reasonable / responsible manner) feat: improve Splash Error-Handling for unsupported URLs - feat: use 'scrapy.FormRequest'-objects to handle Splash Requests (Splash is queried within the Scrapy Scheduler from now on) - feat: fallback to Playwright if Splash failed to retrieve Thumbnails -- if Splash fails to render a websites, it will set the splash_success flag to False and use Playwright instead -- Splash will be DEPRECATED in the future since it has proven itself more and more unreliable with modern web-pages - fix: several warnings in regard to shadowed variables

- by limiting concurrent access to 'getUrlData'-method by using two semaphores (one for Splash, one for Playwright), we can increase crawler performance without running into PoolTimeout Exceptions (which typically happen when too many requests are fired at once) Signed-off-by: criamos <[email protected]>

- work-in-progress: for now, 25 seems to be a reasonable limit Signed-off-by: criamos <[email protected]>

- fix: during SVG handling the "Content-Type"-field of "response.body" is a bytes-object, but we compared it to a string value -- feat: made sure that we also typecheck the "mimetype" (and convert it if needed) before trying to save it to its 'thumbnail.mimetype'-field - change: initiate "splash"-success-flag with None instead of True -- this should give a better indication if the Thumbnail-Pipeline tried to use Splash up until a specific point in time

- change: observed connection timeouts in both v1 and v2, therefore changed the timeout setting to 60s (from default: 30s) Signed-off-by: criamos <[email protected]>

- feat: set WebTools (Playwright) default to "trafilatura" -- when extracting data with Playwright, we'll try to use "trafilatura" for fulltext extraction first and only fall back to html2text if trafilatura returned None - docs: clean up ToDos, add TypeHints

… URL - if Splash fails to render a website or a thumbnail URL couldn't be downloaded due to unexpected HTTP Status Codes, try to grab a screenshot of the website as a last resort / fallback Signed-off-by: criamos <[email protected]>

- during testing/debugging it was observed that even the lenient timeout of 60s (default: 30s) per job is oftentimes not enough time for some website responses Signed-off-by: criamos <[email protected]>

- change methods to async where necessary to await the coroutines of WebTools and be able to work with their data - fix/optimize imports Signed-off-by: criamos <[email protected]>

- change: set default WebEngine to Playwright (Splash isn't able to render most webpages anymore) - change: make 'parse'-method async - fix: drop items where 'jslde' cannot extract a JSON-LD (because it doesn exist) -- this happens typically on "overview"-pages Signed-off-by: criamos <[email protected]>

…on in h5p uploader

…erlin_spider

@bergatco

- to mitigate "httpx.ReadError"s upon dropped or reset HTTP connections (to / from the edu-sharing repository), the edu-sharing connector will use a shared "requests.Session"-object from now on Background: - "httpx.ReadError"s were observed for HTTP Requests that contained (potentially huge) payloads, especially during Thumbnail uploads (via set_node_preview()) and fulltext uploads (via set_node_text()) - since we cannot reasonably limit the size of the uploaded data, switching back to "requests" to handle these requests (hopefully more graceful than httpx) should fix these HTTP Connection Pool issues - the httpx discussion at encode/httpx#3067 pointed towards similar errors which users observed for payloads above 1 MiB PS: Thank you, Constantin (@bergatco) and Paul, for the collective debugging session!

- when using the 'resetVersion=true' Spider Argument, logging messages did not correctly reflect what was happening during the hash check in the EduSharingCheckPipeline - LomBase now stores a custom_setting key ("EDU_SHARING_FORCE_UPDATE"), which can be accessed via "spider.custom_settings" for later readouts - if an active 'resetVersion' or 'forceUpdate' setting was detected, the pipeline's debug message should be easier to understand that even though an item's hash hasn't changed, the item will get updated nonetheless

…auses

@bergatco

- after refactoring LomBase, some method calls in merlin_spider were missing awaits and async declarations - fix ValidationError for getHash() method: - getHash used to submit an 'int'-value to the edu-sharing API, but the API actually expects a string value - optimized imports PS: thanks Constantin (@bergatco) for the debug logs!

Fix "httpx"-related ReadErrors in `es_connector`

Merge recent HTTPX-related fixes into `master`

…_connector (expected type: str) - feat: additional check in the pipelines for "lom.technical.duration"-values and conversion to string

background: - some old crawlers (e.g. merlin_spider) returned an Integer value in their "getHash()"-method, which now causes pydantic ValidationErrors with the new API client (since the API expects a string value for this property)

- this fixes pydantic ValidationErrors in the es_connector during POST requests

…peline - fix AttributeErrors when a website-screenshot fails: - if the first website-screenshot fails (e.g. when the first URL in LOM technical location points towards a .mp3 or .mp4), screenshot_bytes won't be available to work with - this caused an AttributeError when the pipeline tried to convert a "None"-type to a thumbnail - refactor: extracted the functionality of taking a website screenshot with playwright in the pipeline into its own method - feat: second website screenshot fallback - if a second URL is available in LOM technical location, the pipeline will try to take a screenshot of that URL before finally giving up - this could happen in edge-cases where the first URL in LOM technical location is unavailable for website-screenshots, but the array contains a second URL that might be more fruitful for a screenshot (e.g. a landing page) - fix: 2 warnings w.r.t. too broad exception clauses

Improved Exception Handling during website-screenshot fallback and several fixes for `pydantic` ValidationErrors

Merge fixes from PR 115 into `master`

Criamos added 30 commits October 27, 2023 15:05

docs: explain SERLO_INSTANCE default setting

cb2d694

Signed-off-by: criamos <[email protected]>

Merge pull request openeduhub#94 from openeduhub/feat_languagemapper

d9a998f

feat: language string normalization pipeline (ISO 639-1) and serlo_spider v0.3.2

build/chore: upgrade dependencies (Scrapy 2.11 ...)

6a37ea8

- upgrade all (safely) upgradeable dependencies to the newest versions - tested against Python 3.11.6 with serlo_spider Signed-off-by: criamos <[email protected]>

build: upgrade Dockerfile to Python v3.11.6

34c24cf

Signed-off-by: criamos <[email protected]>

build: upgrade valuespace_converter.Dockerfile to Python v3.11.6

2c38142

Signed-off-by: criamos <[email protected]>

build: introduce 'httpx' dependency for async requests

9528934

Signed-off-by: criamos <[email protected]>

change: enable 'asyncio'-support in Scrapy settings.py

fb0512f

- 'httpx' requires 'asyncio' to be enabled in Scrapy - see: https://docs.scrapy.org/en/latest/topics/asyncio.html Signed-off-by: criamos <[email protected]>

change: replace synchronous 'requests'-calls with async 'httpx'-requests

31e5459

Signed-off-by: criamos <[email protected]>

change: async "parse"-method / await web_tools

2b1c12d

Signed-off-by: criamos <[email protected]>

change: enable Scrapy "PeriodicLog"-extension

37a5044

- see: https://docs.scrapy.org/en/latest/topics/extensions.html#periodic-log-extension Signed-off-by: criamos <[email protected]>

change: async parse method

ce6a5d8

Signed-off-by: criamos <[email protected]>

change: shared 'httpx.AsyncClient()' for es_connector.py

9e10759

Signed-off-by: criamos <[email protected]>

change: shared 'httpx.AsyncClient()' for Thumbnail-Pipeline

f1cfb39

Signed-off-by: criamos <[email protected]>

build: introduce 'async-lru'-package to dependencies

fdbbd7f

- ports Python's built-in "functools.lru_cache"-function to asyncio - see: https://github.com/aio-libs/async-lru Signed-off-by: criamos <[email protected]>

change: disable timeouts, use Semaphore

be2e6d7

- work-in-progress: for now, 25 seems to be a reasonable limit Signed-off-by: criamos <[email protected]>

build: upgrade to browserless v2

80fb90c

- change: observed connection timeouts in both v1 and v2, therefore changed the timeout setting to 60s (from default: 30s) Signed-off-by: criamos <[email protected]>

change: browserless/chrome 'timeout'-setting to 120s

4bae38f

- during testing/debugging it was observed that even the lenient timeout of 60s (default: 30s) per job is oftentimes not enough time for some website responses Signed-off-by: criamos <[email protected]>

change: LomBase parse / mapResponse / getUrlData methods to async

8b7541f

- change methods to async where necessary to await the coroutines of WebTools and be able to work with their data - fix/optimize imports Signed-off-by: criamos <[email protected]>

bergatco and others added 26 commits September 25, 2024 13:54

DMED-119 - fix issue with wrong initialization of FoundTooManyExcepti…

cc5038d

…on in h5p uploader

DMED-119 - add missing await to LomBase.getUrlData function call in m…

2aa1a2a

…erlin_spider

DMED-119 - fix getUrlData self issue

34d4748

fix 10 weak warnings w.r.t. variable names and too broad exception cl…

ab2c80a

…auses

DMED-119 - fix merlin_spider issues

5a9dc04

add merlin_spider pyCharm runConfiguration

dd6f082

Merge pull request openeduhub#113 from openeduhub/fix_httpx_readerrors

eda6ea4

Fix "httpx"-related ReadErrors in `es_connector`

DMED-119 - fixed delete_too_many_children function in h5p_upload

e5e2446

Merge pull request openeduhub#114 from openeduhub/develop

8e43c33

Merge recent HTTPX-related fixes into `master`

DMED-119 - (hopefully) fix await issue in mediothek_pixiothek_spider

993aa2f

DMED-119 - remove env.local

62c7e50

DMED-119 - fix validation error for "cclom:duration"

090da20

fix: ValidationError during handling of "cclom:duration"-values in es…

af3abe4

…_connector (expected type: str) - feat: additional check in the pipelines for "lom.technical.duration"-values and conversion to string

fix: add missing 'license.internal' mapping for "NONPUBLIC" licenses

b75e140

feat: convert "BaseItem.hash"-values to a string

ffc9f75

background: - some old crawlers (e.g. merlin_spider) returned an Integer value in their "getHash()"-method, which now causes pydantic ValidationErrors with the new API client (since the API expects a string value for this property)

fix: convert LOM General aggregationLevel int values to str

084804f

- this fixes pydantic ValidationErrors in the es_connector during POST requests

style: code formatting via black

d453684

Merge pull request openeduhub#115 from openeduhub/2024-09-26-fixes

6c196ab

Improved Exception Handling during website-screenshot fallback and several fixes for `pydantic` ValidationErrors

Merge pull request openeduhub#116 from openeduhub/develop

63aa77b

Merge fixes from PR 115 into `master`

Merge branch 'openeduhub-master' into update-from-openeduhub

984a6de

Merge pull request #55 from hpi-schul-cloud/update-from-openeduhub

211fbb7

Merge branch 'master' into update_master-schulcloud_from_master

9084dc8

mamutmk5 approved these changes Oct 2, 2024

View reviewed changes

mamutmk5 changed the title ~~DMED-119 - Update "master_schulcloud" from "master"~~ DMED-119 - BC-8072 - Update "master_schulcloud" from "master" Oct 2, 2024

bergatco merged commit ebd5c96 into master_schulcloud Oct 7, 2024
2 of 4 checks passed

bergatco deleted the update_master-schulcloud_from_master branch October 7, 2024 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DMED-119 - BC-8072 - Update "master_schulcloud" from "master" #50

DMED-119 - BC-8072 - Update "master_schulcloud" from "master" #50

bergatco commented Jul 16, 2024 •

edited

Loading

DMED-119 - BC-8072 - Update "master_schulcloud" from "master" #50

DMED-119 - BC-8072 - Update "master_schulcloud" from "master" #50

Conversation

bergatco commented Jul 16, 2024 • edited Loading

Description

Links to Tickets or other pull requests

Changes

Datasecurity

Deployment

New Repos, NPM pakages or vendor scripts

Screenshots of UI changes

Approval for review

bergatco commented Jul 16, 2024 •

edited

Loading