Galleries with broken thumbnail urls are generating .webp files with html content. #349

maltbeverage · 2024-11-08T18:06:07Z

After webp images started showing up, I noticed a few galleries were pulling in broken webp images. On closer inspection, the downloaded files contain html that show a 404 error.

Example 538028, the first thumbnail is referencing an invalid url:

/galleries/3115455/1t.jpg.webp

Looks like an issue with nhentai. I can remove the .webp extension from the thumbnail url

/galleries/3115455/1t.jpg

and the thumbnail image will load in.

The broken thumnail links to page 1 of the doujin and does indeed have a working image:

/galleries/3115455/1.jpg

So this broken thumnail might be messing up the parsing somehow. When comming accross these broken thumbnails, I think it attempts to download

/galleries/3115455/1.webp

which does not exist, the actual file is

/galleries/3115455/1.jpg

and this successfully saves as a .webp file, but the contents are the html of the 404 error.

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

I'm thinking this might be a parsing logic issue if the thumbnail url is somehow used to determine the file extension of the downloaded image file.

This only affects around 5 galleries at the moment.

The text was updated successfully, but these errors were encountered:

maltbeverage · 2024-11-08T18:29:03Z

Forgot to add, I'm on the latest commit, f30ff59.

maltbeverage · 2024-11-08T19:02:35Z

Found it I think:

nhentai/nhentai/parser.py

Line 156 in f30ff59

_, ext_name = os.path.basename(i.img.attrs['data-src']).rsplit('.', 1)

If I'm reading it right, this explains the parsing of the invalid webp url. In this case it's a broken thumbnail url, but if there was ever a case where nhentai implemented a different thumbnail extension from the actaul doujin image extension, that might bork the downloader.

I suppose the alternative would be to follow each url on the gallery page and extract the image url one at a time, but that sounds a bit expensive. Maybe error checking for a 404 code when downloading and failing with an error might be a good way to go. I'd rather have a failure on rare edge cases than an archive with missing images.

RicterZ · 2024-11-09T03:31:03Z

I'll check it out

RicterZ · 2024-11-09T03:53:21Z

Need more sample doujinshi, https://t5.nhentai.net/galleries/3115455/1t.jpg.webp returns 403

RicterZ · 2024-11-09T04:12:09Z

After some investigations, I found that:

The preview url, https://t5.nhentai.net/galleries/3115455/1t.jpg.webp, returns 404
The real image url is https://i.nhentai.net/galleries/3115455/1.jpg, seems like the uploader's mistake or bug of nhentai website
Others webp url in this doujinshi, such as https://i3.nhentai.net/galleries/3115455/136.webp, works fine

Need to determine whether it is an isolated case or the norm.

maltbeverage · 2024-11-09T05:13:15Z

Here are all of the codes with bad thumbnail image urls. I've scraped everything released recently to check.

538005
538006
538020
538028
538045

I agree it's an issue with the website and not widespread.
Since this popped up yesterday, I havn't seen the issue on any new releases.

Thanks for taking as a look.

maltbeverage · 2024-11-09T21:36:53Z

I noticed some more galleries with issues:

538053
538058
538063
538087
538088
538090
538098
538148
538159

Looks like this'll be an issue until it's fixed on the nhentai side.

Here is a quick and dirty workaround if anyone needs to get this working:

maltbeverage@ea52cff

This should string split the two extensions and then use the first one.... probably introducing new edge cases with this, but it works for now.

DeadlyShadow71 · 2024-11-10T11:03:11Z

Not sure if my problem is related, if not, please ignore and I will open a new Issue - since the downtime of nhentai earlier this week I can't download certain doujinshis, I haven't updated the script until earlier today to try and fix it but it keeps happening in the same way.

[11:56:00] doujinshi_parser: Fetching doujinshi information of id 538003
[11:56:01] doujinshi_parser: Tried yo get image id failed

It stays there for some time and just dies, I use the favorites method but even when trying to download just that one, same result.
Not knowledgeable enough in either python or the scripts interaction with the site to know what might cause it, so I'm not sure if the same workaround would work for me.

maltbeverage · 2024-11-10T16:41:20Z

Not sure if my problem is related, if not, please ignore and I will open a new Issue - since the downtime of nhentai earlier this week I can't download certain doujinshis, I haven't updated the script until earlier today to try and fix it but it keeps happening in the same way.

[11:56:00] doujinshi_parser: Fetching doujinshi information of id 538003 [11:56:01] doujinshi_parser: Tried yo get image id failed

It stays there for some time and just dies, I use the favorites method but even when trying to download just that one, same result. Not knowledgeable enough in either python or the scripts interaction with the site to know what might cause it, so I'm not sure if the same workaround would work for me.

Not the same issue as this one. nHentai started using webp images which did not have support for parsing until f30ff59 was commited a couple days ago.

If you git clone and install from source, it should work. I'm not sure if this fix has been pushed out to any other install methods.

DeadlyShadow71 · 2024-11-10T16:57:52Z

I'm using the nhentaiGUI, so I just edited the couple of lines in the files I have, works now, thanks for the info and fix. Solved on my part.

poohzaza166 · 2024-11-14T22:13:33Z

i have the same problem it seem like it happen with any doujin uploaded recently ie: 538703
but when running the parser module as a standalone python file the code seem to run normally

print(doujinshi_parser("538703"))

NyahKen · 2024-11-19T13:04:55Z

I'm using the nhentaiGUI, so I just edited the couple of lines in the files I have, works now, thanks for the info and fix. Solved on my part.

how did you fix it, I have the same issue

DeadlyShadow71 · 2024-11-19T13:19:30Z

I'm using the nhentaiGUI, so I just edited the couple of lines in the files I have, works now, thanks for the info and fix. Solved on my part.

how did you fix it, I have the same issue

Find the files from nhentai in your python folder, "AppData\Local\Programs\Python\Python312\Lib\site-packages\nhentai" is mine for example, should be about the same unless you installed it differently.
doujinshi.py - parser.py - utils.py
Open them with notepad or some text editor and look for the lines that the commit changed and edit them to look the same, that's what I did at least, since just replacing them didn't work in my case.

f30ff59 - look for the differences and the number at the left of the line or just find it without, edit them, save and try if it works. If it doesn't, maybe replacing them works for you.

EZ-Melon · 2024-11-20T01:33:44Z

I've encountered another weird bug relating to broken webp images. I'm trying to download 538999 on the latest release (0.5.13) and it seems that the first image is corrupt somehow, which is preventing the pdf file from being created.

DeadlyShadow71 · 2024-11-20T07:16:44Z

I've encountered another weird bug relating to broken webp images. I'm trying to download 538999 on the latest release (0.5.13) and it seems that the first image is corrupt somehow, which is preventing the pdf file from being created.

I save everything in .cbz, which works fine apart from like every 10th doujins first page being corrupted, so I think it's the webp failing to download correctly or something, I guess the img2pdf.py can't handle the first page now working and just dies. Not sure if the problem is fixable by adjusting the script of if it's a problem on nhentais end. My advice is using the .cbz function for the ones that don't work as a pdf until it's fixed, unless you find a way to fix it yourself

maltbeverage · 2024-11-20T08:15:38Z

Previously these broken thumbnails on nhentai with double extensions (example 3122455/1t.jpg.webp) were invalid and didn't load on the gallery page. Now they do load. This means nhentai fixed that issue by just changing actual thumbnail image file to match to broken url in the html. Probably indicates that filenames of the thumbnails will not be updated to the normal convention on the nhentai website. This means the current parsing method will fail to download affected images on these handful of releases.

If you don't mind updating two lines of code, here is a workaround: maltbeverage@ea52cff

I think to resolve the issue:

There should be an error check on image download and if a non 200 code is encountered, the entire download fails
There should be an image validation check to ensure that the downloaded file mime type is an image. Additional checking for corruption would be nice, but tricky because there are number of images on nhentai that fail validation, but are viewable otherwise. Generally it's preferable to archive these.
Instead of predicting the download file extensions from the thumbnails, it may be needed to actually follow each of the pages on the gallery and extract the full size image paths... this would be more expensive to do and cause more traffic to the site.

xchud · 2024-12-17T00:16:09Z

Yesterday i had a batch with broken thumbnails so i looked into it too. It seems that there's now a mix between e.g. 1t.jpg.webp and 1t.webp for thumbnail names, depending on the extension of the full image. So since it seems always indicated, i used regex to fix it for me like following.

example gallery: /g/538696/

change line 158-160 in parser.py
from:

    for i in html.find_all('div', attrs={'class': 'thumb-container'}):
        _, ext_name = os.path.basename(i.img.attrs['data-src']).rsplit('.', 1)
        ext.append(ext_name)

to:

    for i in html.find_all('div', attrs={'class': 'thumb-container'}):
        ext_reg = re.search(r'(?:\.(\w+))?\.(\w+)$', (os.path.basename(i.img.attrs['data-src'])))
        ext_name = ext_reg.group(1) if ext_reg.group(1) else ext_reg.group(2)
        ext.append(ext_name)

I agree with the part that it not throwing an error here is annoying

maltbeverage changed the title ~~Galleries with broken thumnail urls are generating .webp files with html content.~~ Galleries with broken thumbnail urls are generating .webp files with html content. Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Galleries with broken thumbnail urls are generating .webp files with html content. #349

Galleries with broken thumbnail urls are generating .webp files with html content. #349

maltbeverage commented Nov 8, 2024

maltbeverage commented Nov 8, 2024

maltbeverage commented Nov 8, 2024

RicterZ commented Nov 9, 2024

RicterZ commented Nov 9, 2024

RicterZ commented Nov 9, 2024

maltbeverage commented Nov 9, 2024

maltbeverage commented Nov 9, 2024

DeadlyShadow71 commented Nov 10, 2024

maltbeverage commented Nov 10, 2024

DeadlyShadow71 commented Nov 10, 2024

poohzaza166 commented Nov 14, 2024

NyahKen commented Nov 19, 2024

DeadlyShadow71 commented Nov 19, 2024

EZ-Melon commented Nov 20, 2024 •

edited

Loading

DeadlyShadow71 commented Nov 20, 2024

maltbeverage commented Nov 20, 2024 •

edited

Loading

xchud commented Dec 17, 2024 •

edited

Loading

Galleries with broken thumbnail urls are generating .webp files with html content. #349

Galleries with broken thumbnail urls are generating .webp files with html content. #349

Comments

maltbeverage commented Nov 8, 2024

maltbeverage commented Nov 8, 2024

maltbeverage commented Nov 8, 2024

RicterZ commented Nov 9, 2024

RicterZ commented Nov 9, 2024

RicterZ commented Nov 9, 2024

maltbeverage commented Nov 9, 2024

maltbeverage commented Nov 9, 2024

DeadlyShadow71 commented Nov 10, 2024

maltbeverage commented Nov 10, 2024

DeadlyShadow71 commented Nov 10, 2024

poohzaza166 commented Nov 14, 2024

NyahKen commented Nov 19, 2024

DeadlyShadow71 commented Nov 19, 2024

EZ-Melon commented Nov 20, 2024 • edited Loading

DeadlyShadow71 commented Nov 20, 2024

maltbeverage commented Nov 20, 2024 • edited Loading

xchud commented Dec 17, 2024 • edited Loading

EZ-Melon commented Nov 20, 2024 •

edited

Loading

maltbeverage commented Nov 20, 2024 •

edited

Loading

xchud commented Dec 17, 2024 •

edited

Loading