Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Galleries with broken thumbnail urls are generating .webp files with html content. #349

Open
maltbeverage opened this issue Nov 8, 2024 · 17 comments

Comments

@maltbeverage
Copy link

After webp images started showing up, I noticed a few galleries were pulling in broken webp images. On closer inspection, the downloaded files contain html that show a 404 error.

Example 538028, the first thumbnail is referencing an invalid url:

/galleries/3115455/1t.jpg.webp

Looks like an issue with nhentai. I can remove the .webp extension from the thumbnail url

/galleries/3115455/1t.jpg

and the thumbnail image will load in.

The broken thumnail links to page 1 of the doujin and does indeed have a working image:

/galleries/3115455/1.jpg

So this broken thumnail might be messing up the parsing somehow. When comming accross these broken thumbnails, I think it attempts to download

/galleries/3115455/1.webp

which does not exist, the actual file is

/galleries/3115455/1.jpg

and this successfully saves as a .webp file, but the contents are the html of the 404 error.

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

I'm thinking this might be a parsing logic issue if the thumbnail url is somehow used to determine the file extension of the downloaded image file.

This only affects around 5 galleries at the moment.

@maltbeverage maltbeverage changed the title Galleries with broken thumnail urls are generating .webp files with html content. Galleries with broken thumbnail urls are generating .webp files with html content. Nov 8, 2024
@maltbeverage
Copy link
Author

Forgot to add, I'm on the latest commit, f30ff59.

@maltbeverage
Copy link
Author

Found it I think:

_, ext_name = os.path.basename(i.img.attrs['data-src']).rsplit('.', 1)

If I'm reading it right, this explains the parsing of the invalid webp url. In this case it's a broken thumbnail url, but if there was ever a case where nhentai implemented a different thumbnail extension from the actaul doujin image extension, that might bork the downloader.

I suppose the alternative would be to follow each url on the gallery page and extract the image url one at a time, but that sounds a bit expensive. Maybe error checking for a 404 code when downloading and failing with an error might be a good way to go. I'd rather have a failure on rare edge cases than an archive with missing images.

@RicterZ
Copy link
Owner

RicterZ commented Nov 9, 2024

I'll check it out

@RicterZ
Copy link
Owner

RicterZ commented Nov 9, 2024

Need more sample doujinshi, https://t5.nhentai.net/galleries/3115455/1t.jpg.webp returns 403

@RicterZ
Copy link
Owner

RicterZ commented Nov 9, 2024

After some investigations, I found that:

  • The preview url, https://t5.nhentai.net/galleries/3115455/1t.jpg.webp, returns 404
  • The real image url is https://i.nhentai.net/galleries/3115455/1.jpg, seems like the uploader's mistake or bug of nhentai website
  • Others webp url in this doujinshi, such as https://i3.nhentai.net/galleries/3115455/136.webp, works fine

Need to determine whether it is an isolated case or the norm.

@maltbeverage
Copy link
Author

Here are all of the codes with bad thumbnail image urls. I've scraped everything released recently to check.

538005
538006
538020
538028
538045

  • I agree it's an issue with the website and not widespread.
  • Since this popped up yesterday, I havn't seen the issue on any new releases.

Thanks for taking as a look.

@maltbeverage
Copy link
Author

I noticed some more galleries with issues:

538053
538058
538063
538087
538088
538090
538098
538148
538159

Looks like this'll be an issue until it's fixed on the nhentai side.

Here is a quick and dirty workaround if anyone needs to get this working:

maltbeverage@ea52cff

This should string split the two extensions and then use the first one.... probably introducing new edge cases with this, but it works for now.

@DeadlyShadow71
Copy link

Not sure if my problem is related, if not, please ignore and I will open a new Issue - since the downtime of nhentai earlier this week I can't download certain doujinshis, I haven't updated the script until earlier today to try and fix it but it keeps happening in the same way.

[11:56:00] doujinshi_parser: Fetching doujinshi information of id 538003
[11:56:01] doujinshi_parser: Tried yo get image id failed

It stays there for some time and just dies, I use the favorites method but even when trying to download just that one, same result.
Not knowledgeable enough in either python or the scripts interaction with the site to know what might cause it, so I'm not sure if the same workaround would work for me.

@maltbeverage
Copy link
Author

Not sure if my problem is related, if not, please ignore and I will open a new Issue - since the downtime of nhentai earlier this week I can't download certain doujinshis, I haven't updated the script until earlier today to try and fix it but it keeps happening in the same way.

[11:56:00] doujinshi_parser: Fetching doujinshi information of id 538003 [11:56:01] doujinshi_parser: Tried yo get image id failed

It stays there for some time and just dies, I use the favorites method but even when trying to download just that one, same result. Not knowledgeable enough in either python or the scripts interaction with the site to know what might cause it, so I'm not sure if the same workaround would work for me.

Not the same issue as this one. nHentai started using webp images which did not have support for parsing until f30ff59 was commited a couple days ago.

If you git clone and install from source, it should work. I'm not sure if this fix has been pushed out to any other install methods.

@DeadlyShadow71
Copy link

I'm using the nhentaiGUI, so I just edited the couple of lines in the files I have, works now, thanks for the info and fix. Solved on my part.

@poohzaza166
Copy link

i have the same problem it seem like it happen with any doujin uploaded recently ie: 538703
but when running the parser module as a standalone python file the code seem to run normally

print(doujinshi_parser("538703"))

@NyahKen
Copy link

NyahKen commented Nov 19, 2024

I'm using the nhentaiGUI, so I just edited the couple of lines in the files I have, works now, thanks for the info and fix. Solved on my part.

how did you fix it, I have the same issue

@DeadlyShadow71
Copy link

I'm using the nhentaiGUI, so I just edited the couple of lines in the files I have, works now, thanks for the info and fix. Solved on my part.

how did you fix it, I have the same issue

Find the files from nhentai in your python folder, "AppData\Local\Programs\Python\Python312\Lib\site-packages\nhentai" is mine for example, should be about the same unless you installed it differently.
doujinshi.py - parser.py - utils.py
Open them with notepad or some text editor and look for the lines that the commit changed and edit them to look the same, that's what I did at least, since just replacing them didn't work in my case.

f30ff59 - look for the differences and the number at the left of the line or just find it without, edit them, save and try if it works. If it doesn't, maybe replacing them works for you.

@EZ-Melon
Copy link

EZ-Melon commented Nov 20, 2024

I've encountered another weird bug relating to broken webp images. I'm trying to download 538999 on the latest release (0.5.13) and it seems that the first image is corrupt somehow, which is preventing the pdf file from being created.

Screenshot 2024-11-19 203718

@DeadlyShadow71
Copy link

I've encountered another weird bug relating to broken webp images. I'm trying to download 538999 on the latest release (0.5.13) and it seems that the first image is corrupt somehow, which is preventing the pdf file from being created.

Screenshot 2024-11-19 203718

I save everything in .cbz, which works fine apart from like every 10th doujins first page being corrupted, so I think it's the webp failing to download correctly or something, I guess the img2pdf.py can't handle the first page now working and just dies. Not sure if the problem is fixable by adjusting the script of if it's a problem on nhentais end. My advice is using the .cbz function for the ones that don't work as a pdf until it's fixed, unless you find a way to fix it yourself

@maltbeverage
Copy link
Author

maltbeverage commented Nov 20, 2024

Previously these broken thumbnails on nhentai with double extensions (example 3122455/1t.jpg.webp) were invalid and didn't load on the gallery page. Now they do load. This means nhentai fixed that issue by just changing actual thumbnail image file to match to broken url in the html. Probably indicates that filenames of the thumbnails will not be updated to the normal convention on the nhentai website. This means the current parsing method will fail to download affected images on these handful of releases.

If you don't mind updating two lines of code, here is a workaround: maltbeverage@ea52cff

I think to resolve the issue:

  1. There should be an error check on image download and if a non 200 code is encountered, the entire download fails
  2. There should be an image validation check to ensure that the downloaded file mime type is an image. Additional checking for corruption would be nice, but tricky because there are number of images on nhentai that fail validation, but are viewable otherwise. Generally it's preferable to archive these.
  3. Instead of predicting the download file extensions from the thumbnails, it may be needed to actually follow each of the pages on the gallery and extract the full size image paths... this would be more expensive to do and cause more traffic to the site.

@xchud
Copy link

xchud commented Dec 17, 2024

Yesterday i had a batch with broken thumbnails so i looked into it too. It seems that there's now a mix between e.g. 1t.jpg.webp and 1t.webp for thumbnail names, depending on the extension of the full image. So since it seems always indicated, i used regex to fix it for me like following.

example gallery: /g/538696/

change line 158-160 in parser.py
from:

    for i in html.find_all('div', attrs={'class': 'thumb-container'}):
        _, ext_name = os.path.basename(i.img.attrs['data-src']).rsplit('.', 1)
        ext.append(ext_name)

to:

    for i in html.find_all('div', attrs={'class': 'thumb-container'}):
        ext_reg = re.search(r'(?:\.(\w+))?\.(\w+)$', (os.path.basename(i.img.attrs['data-src'])))
        ext_name = ext_reg.group(1) if ext_reg.group(1) else ext_reg.group(2)
        ext.append(ext_name)

I agree with the part that it not throwing an error here is annoying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants