-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v5 devel branch #307
base: main
Are you sure you want to change the base?
v5 devel branch #307
Conversation
This backports (and slightly improves) the new bookmark API from devel_new. Test suite TBD.
Note the following test script: ``` import io import sys import logging import contextlib logger = logging.getLogger("testLogger") logger.setLevel(logging.DEBUG) buf = io.StringIO() logger.addHandler(logging.StreamHandler(buf)) # ! with contextlib.redirect_stdout(buf), contextlib.redirect_stderr(buf): print("print to stdout") print("print to stderr", file=sys.stderr) logger.info("info message") logger.warning("warning message") print(f"{buf.getvalue()!r}") ``` Like this, we get: > 'print to stdout\nprint to stderr\ninfo message\nwarning message\n' Without handler: > 'print to stdout\nprint to stderr\nwarning message\n' With default handler: > info message > warning message > 'print to stdout\nprint to stderr\n' Weird.
e18a049
to
94342f8
Compare
Removed PdfDocument.render() & PdfBitmapInfo. Implemented context manager support for PdfDocument. Test suite integration TBD.
2d1c805
to
8049d8e
Compare
Use bool() rather than checking against None. See findings in get_toc(): "We need bool(ptr) here to handle cases where .contents is a null pointer (raises exception on access). Don't use ptr != None, it's always true."
This is longer, but cleaner. Imagine you have to edit it and assignment order gets wrong :P BTW, normalize PdfFormEnv constructor param order.
The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled. |
I see. FWIW, you can already use the semi-private |
func = { | ||
False: pdfium_c.FPDFImageObj_GetImageDataRaw, | ||
True: pdfium_c.FPDFImageObj_GetImageDataDecoded, | ||
}[decode_simple] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: want to change these {False: ..., True: ...}[var]
dicts back to ... if var else ...
after all. so any object whose bool()
is True can be used.
Also it seems like overkill to create a dict just for True/False.
@mara004 I have tried using page._flatten() with the instructions you have given as follows: pdf = pypdfium2.PdfDocument(pdf_path)
pdf.init_forms()
for page_idx in page_range: # page_range --> 2
page = pdf.get_page(page_idx)
page._flatten(flag=pdfium_c.FLAT_NORMALDISPLAY) # return 1
text_page = page.get_textpage()
page = pdf.get_page(page_idx)
page._flatten(flag=pdfium_c.FLAT_NORMALDISPLAY) # return 2
text_page = page.get_textpage()
...
total_chars = text_page.count_chars() I have to repeat the _flatten() twice to get all the editable values from the form. number of characters without repeating _ flatten code --> total_chars # 4619 |
I can't really comment on that behavior as I'm only providing the bindings, and what the underlying APIs actually do is down to pdfium. However, given that the second So perhaps you just have to re-initialize the page handle? |
Finally, I just had to re-initialize the page handle. page.gen_content() option didn't work. Thanks |
That makes sense. I can add a note to the future docs that flattening invalidates existing handles to the page. |
15b9478
to
6736a5d
Compare
It is not clear to me if PDFium is "BSD-3-Clause OR Apache-2.0" or "BSD-3-Clause AND Apache-2.0". The pypdfium2 codebase previously stated "OR", but recently it hit me we don't actually have any evidence for that. In the end, I figured it was probably a presumption from the early days of the project that might as well be wrong, and that "BSD-3-Clause AND Apache-2.0" would have been the safer assumption. Sorry :( IANAL, but to my understanding both licenses are liberal and in similar spirit, so hopefully this should not have negative legal consequences downstream. Note that there is (and always was) ABSOLUTELY NO WARRANTY for any information provided with the pypdfium2 project. For pypdfium2's Readme, see the CC-BY-4.0 license (e.g. "Section 5 -- Disclaimer of Warranties and Limitation of Liability."). For pypdfium2's code (including any information provided therein), see the Apache-2.0 or BSD-3-Clause licenses, which have similar disclaimers. This patch avoids any "OR" or "AND", instead changing to a generic comma. This is not valid SPDX/reuse syntax and serves as a placeholder until we know better. Note that pypdfium2's Python code continues to be "Apache-2.0 OR BSD-3-Clause". This issue is only about PDFium itself.
had two consecutive use_syslibs if-blocks that could be merged into one.
No description provided.