Crash on JSONDecodeError from body of YouTube page #171

wjdp · 2021-01-26T01:59:49Z

I have some code to pull metadata from YouTube

response = requests.get(video_url)
metadata = extruct.extract(response.text, base_url="https://youtube.com")

Have noticed some recent crashing, but only on some videos.

No crash: https://www.youtube.com/watch?v=ZY48KUAZKhM https://www.youtube.com/watch?v=ZlVI7YJGHq0
Crash: https://www.youtube.com/watch?v=987wzJ2NHBE https://www.youtube.com/watch?v=0-EF60neguk

Common factor among those that crash is apostrophes in the channel name!

Traceback (most recent call last):
  File "/home/will/local/breda/src/dredger/ingest/tests/test_youtube.py", line 72, in test_one
    youtube.get_video_data("https://www.youtube.com/watch?v=987wzJ2NHBE")
  File "/home/will/local/breda/src/dredger/ingest/youtube.py", line 46, in get_video_data
    metadata = extruct.extract(response.text, base_url="https://youtube.com")
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/_extruct.py", line 108, in extract
    output[syntax] = list(extract(document, base_url=base_url))
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 25, in extract_items
    return [
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 25, in <listcomp>
    return [
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 38, in _extract_items
    data = jstyleson.loads(HTML_OR_JS_COMMENTLINE.sub('', script),strict=False)
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/jstyleson.py", line 123, in loads
    return json.loads(dispose(text), **kwargs)
  File "/usr/lib/python3.8/json/__init__.py", line 370, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 211 (char 210)

Haven't had a chance today to dig into much beyond triaging the above.

The text was updated successfully, but these errors were encountered:

udit19281 · 2022-05-13T13:50:51Z

I haven't been able to replicate the issue. Your Crash video links point to the video that has been removed. Maybe that is the reason why you are getting this error. I suggest you check the video links before passing them to the extract.
Here is the code that I used:

Code:
import extruct
import requests
from w3lib.html import get_base_url

crash_links=['https://www.youtube.com/watch?v=987wzJ2NHBE','https://www.youtube.com/watch?v=0-EF60neguk']

for video_url in crash_links:
response = requests.get(video_url)
base_url = get_base_url(response.text, response.url)
metadata=extruct.extract(response.text, base_url=base_url, uniform=True,
syntaxes=['json-ld',
'microdata',
'opengraph'])
print(metadata)

Output:
{'microdata': [], 'json-ld': [], 'opengraph': []}
{'microdata': [], 'json-ld': [], 'opengraph': []}

AbhinavSE · 2022-05-13T15:07:25Z

I replicated the issue using these YouTube links, https://www.youtube.com/watch?v=-J2e8OlBdPs, https://www.youtube.com/watch?v=qP07oyFTRXc, https://www.youtube.com/watch?v=BUrnfkxwozM.

As @wjdp suggested, it is because of the apostrophe in the channel name. json.loads() throws an error when the input contains hex codes like "\x27" (which is the apostrophe). I created a pull request #195 where I replace the hex code with the special characters themselves before passing to the json.loads() function.

AbhinavSE mentioned this issue May 13, 2022

Solves issue #171 #195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on JSONDecodeError from body of YouTube page #171

Crash on JSONDecodeError from body of YouTube page #171

wjdp commented Jan 26, 2021

udit19281 commented May 13, 2022

AbhinavSE commented May 13, 2022

Crash on JSONDecodeError from body of YouTube page #171

Crash on JSONDecodeError from body of YouTube page #171

Comments

wjdp commented Jan 26, 2021

udit19281 commented May 13, 2022

AbhinavSE commented May 13, 2022