-
-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse embedded metadata in PDF files #3108
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary look. I'm mainly concerned with the implementation to fetch the data from the pdf. I would like to investigate grabbing the string via docnet, which we already use for other things in Kavita.
Implemented review feedback. Was not sure about the placement of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall code looks good, minus a few style points. Once the code is in shape, I can spare some time to pull it down and ensure it doesn't break anything.
Another thing that would be stellar is if you could update the wiki along with this to write the mapping of fields like we do with the epub reader and perhaps any settings needed on our calibre guide for ensuring metadata is written in the pdf. |
Addressed second round of comments, and added documentation in Kareadita/Wiki-Nextra#13 |
Did some initial testing of the PR and came across 2 aspects that need to be looked into: 1 - File locking - Once the scanner hits the files it doesn't seem to let go of the file lock. I couldn't move files around or modify them on the host file system after kavita scanned them. 2 - Long scan times per file. Even on my fairly powerful desktop, scan times were taking 5-6 seconds per file scanned in. It took almost an hour to bring in the 749 files I have in a test library. This is only going to get worse on lower powered hardware. Especially if it has to load the entire file to search for metadata. People TTRPG collections are going to have 300MB+ files and hundreds of them, if not thousands. That can easily start to add up to weeks of scanning time. |
@DieselTech do you see the same behavior re: (1) for epub and pdf? For (2) were those 749 files pdf? If so, do you have comparison numbers for epub, so I know what I ought to aim for? It might inherently be a difficult problem, as there may not be a way around reading the whole file, but there may be an exploitable file structure in PDF that could be used. |
2 was all PDF files. I didn't test any epub with this branch but I can just to get some comparison |
It turns out that PDF files indeed have an index that allows much faster access to the metadata than the implementation here, though accessing it is a bit trickier. I'm still working on the new approach. |
Added
Implements PDF metadata processing as discussed in #3103. Some of this metadata is generic, and available in many PDFs. Other fields (series and index) are not as standardized, and for these we're using the calibre version of the fields.
This is not only my first Kavita contribution, but also my first time programming C#, so I hope I managed to write reasonable code, and would appreciate feedback if I did not.