Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v5 devel branch #307

Draft
wants to merge 138 commits into
base: main
Choose a base branch
from
Draft

v5 devel branch #307

wants to merge 138 commits into from

Conversation

mara004
Copy link
Member

@mara004 mara004 commented Apr 4, 2024

No description provided.

This backports (and slightly improves) the new bookmark API from
devel_new. Test suite TBD.
Note the following test script:
```
import io
import sys
import logging
import contextlib

logger = logging.getLogger("testLogger")
logger.setLevel(logging.DEBUG)

buf = io.StringIO()
logger.addHandler(logging.StreamHandler(buf))  # !

with contextlib.redirect_stdout(buf), contextlib.redirect_stderr(buf):
    print("print to stdout")
    print("print to stderr", file=sys.stderr)
    logger.info("info message")
    logger.warning("warning message")

print(f"{buf.getvalue()!r}")
```

Like this, we get:
> 'print to stdout\nprint to stderr\ninfo message\nwarning message\n'
Without handler:
> 'print to stdout\nprint to stderr\nwarning message\n'
With default handler:
> info message
> warning message
> 'print to stdout\nprint to stderr\n'

Weird.
@mara004 mara004 force-pushed the devel_new branch 2 times, most recently from e18a049 to 94342f8 Compare April 4, 2024 20:41
Removed PdfDocument.render() & PdfBitmapInfo.
Implemented context manager support for PdfDocument.

Test suite integration TBD.
@mara004 mara004 force-pushed the devel_new branch 10 times, most recently from 2d1c805 to 8049d8e Compare April 4, 2024 21:39
Use bool() rather than checking against None. See findings in get_toc():
"We need bool(ptr) here to handle cases where .contents is a null
pointer (raises exception on access). Don't use ptr != None, it's always
true."
This is longer, but cleaner.
Imagine you have to edit it and assignment order gets wrong :P

BTW, normalize PdfFormEnv constructor param order.
@semoal
Copy link

semoal commented Aug 8, 2024

@semoal Thanks. There is no set time frame, and I'm currently immersed in some other things, but I'd be hoping to merge sooner than later, to avoid another stalled and diverged branch, as had unfortunately happened on the previous go at this.

However, this project has grown a bit over my head TBH, and I'm somewhat scared of breaking anything or making wrong API decisions, as this may affect many downstreams. Also, I'd like to address all API-breaking or otherwise significant changes I had in mind before going ahead with this.

Out of interest, is there any particular change you're looking forward to?

The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled.
Once it's stabilized i would create a pre-release or release candidate and start receiving feedback from there, there's no future without breaking changes ;)

@mara004
Copy link
Member Author

mara004 commented Aug 8, 2024

The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled.

I see. FWIW, you can already use the semi-private page._flatten() if you make sure init_forms() was called on the parent pdf before page retrieval (ideally, directly after construction).
The bindings code is the same, just a check added and docs updated. You could also copy the flatten() implementation over into your own code.
Sorry for the inconvenience; this originated from a time where form initialization wasn't integrated properly.

func = {
False: pdfium_c.FPDFImageObj_GetImageDataRaw,
True: pdfium_c.FPDFImageObj_GetImageDataDecoded,
}[decode_simple]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: want to change these {False: ..., True: ...}[var] dicts back to ... if var else ... after all. so any object whose bool() is True can be used.
Also it seems like overkill to create a dict just for True/False.

@hector-sherpas
Copy link

hector-sherpas commented Aug 13, 2024

The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled.

I see. FWIW, you can already use the semi-private page._flatten() if you make sure init_forms() was called on the parent pdf before page retrieval (ideally, directly after construction). The bindings code is the same, just a check added and docs updated. You could also copy the flatten() implementation over into your own code. Sorry for the inconvenience; this originated from a time where form initialization wasn't integrated properly.

@mara004 I have tried using page._flatten() with the instructions you have given as follows:

pdf = pypdfium2.PdfDocument(pdf_path)
pdf.init_forms()
for page_idx in page_range: # page_range --> 2
        page = pdf.get_page(page_idx)
        page._flatten(flag=pdfium_c.FLAT_NORMALDISPLAY) # return 1
        text_page = page.get_textpage()
        page = pdf.get_page(page_idx)
        page._flatten(flag=pdfium_c.FLAT_NORMALDISPLAY) # return 2
        text_page = page.get_textpage()
        ...
        total_chars = text_page.count_chars()

I have to repeat the _flatten() twice to get all the editable values ​​from the form.

number of characters without repeating _ flatten code --> total_chars # 4619
number of characters with repeating _ flatten code --> total_chars # 5014

@mara004
Copy link
Member Author

mara004 commented Aug 13, 2024

I can't really comment on that behavior as I'm only providing the bindings, and what the underlying APIs actually do is down to pdfium.

However, given that the second _flatten() call returns 2, which is equal to pdfium_c.FLATTEN_NOTHINGTODO, it should be a no-op (FWIW, you can take a look at the fpdf_flatten.cpp code and see when FLATTEN_NOTHINGTODO is returned).

So perhaps you just have to re-initialize the page handle? Or maybe call page.gen_content() after flattening?

@hector-sherpas
Copy link

Finally, I just had to re-initialize the page handle. page.gen_content() option didn't work. Thanks

@mara004
Copy link
Member Author

mara004 commented Aug 20, 2024

That makes sense. I can add a note to the future docs that flattening invalidates existing handles to the page.

It is not clear to me if PDFium is "BSD-3-Clause OR Apache-2.0" or
"BSD-3-Clause AND Apache-2.0". The pypdfium2 codebase previously stated
"OR", but recently it hit me we don't actually have any evidence for
that.
In the end, I figured it was probably a presumption from the early days
of the project that might as well be wrong, and that "BSD-3-Clause AND
Apache-2.0" would have been the safer assumption. Sorry :(

IANAL, but to my understanding both licenses are liberal and in similar
spirit, so hopefully this should not have negative legal consequences
downstream.
Note that there is (and always was) ABSOLUTELY NO WARRANTY for any
information provided with the pypdfium2 project. For pypdfium2's Readme,
see the CC-BY-4.0 license (e.g. "Section 5 -- Disclaimer of Warranties
and Limitation of Liability."). For pypdfium2's code (including any
information provided therein), see the Apache-2.0 or BSD-3-Clause
licenses, which have similar disclaimers.

This patch avoids any "OR" or "AND", instead changing to a generic
comma. This is not valid SPDX/reuse syntax and serves as a placeholder
until we know better.

Note that pypdfium2's Python code continues to be "Apache-2.0 OR
BSD-3-Clause". This issue is only about PDFium itself.
had two consecutive use_syslibs if-blocks that could be merged into one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants