v5 devel branch #307

mara004 · 2024-04-04T15:42:59Z

No description provided.

This backports (and slightly improves) the new bookmark API from devel_new. Test suite TBD.

Note the following test script: ``` import io import sys import logging import contextlib logger = logging.getLogger("testLogger") logger.setLevel(logging.DEBUG) buf = io.StringIO() logger.addHandler(logging.StreamHandler(buf)) # ! with contextlib.redirect_stdout(buf), contextlib.redirect_stderr(buf): print("print to stdout") print("print to stderr", file=sys.stderr) logger.info("info message") logger.warning("warning message") print(f"{buf.getvalue()!r}") ``` Like this, we get: > 'print to stdout\nprint to stderr\ninfo message\nwarning message\n' Without handler: > 'print to stdout\nprint to stderr\nwarning message\n' With default handler: > info message > warning message > 'print to stdout\nprint to stderr\n' Weird.

Removed PdfDocument.render() & PdfBitmapInfo. Implemented context manager support for PdfDocument. Test suite integration TBD.

Use bool() rather than checking against None. See findings in get_toc(): "We need bool(ptr) here to handle cases where .contents is a null pointer (raises exception on access). Don't use ptr != None, it's always true."

This is longer, but cleaner. Imagine you have to edit it and assignment order gets wrong :P BTW, normalize PdfFormEnv constructor param order.

semoal · 2024-08-08T14:16:50Z

@semoal Thanks. There is no set time frame, and I'm currently immersed in some other things, but I'd be hoping to merge sooner than later, to avoid another stalled and diverged branch, as had unfortunately happened on the previous go at this.

However, this project has grown a bit over my head TBH, and I'm somewhat scared of breaking anything or making wrong API decisions, as this may affect many downstreams. Also, I'd like to address all API-breaking or otherwise significant changes I had in mind before going ahead with this.

Out of interest, is there any particular change you're looking forward to?

The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled.
Once it's stabilized i would create a pre-release or release candidate and start receiving feedback from there, there's no future without breaking changes ;)

mara004 · 2024-08-08T14:27:43Z

The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled.

I see. FWIW, you can already use the semi-private page._flatten() if you make sure init_forms() was called on the parent pdf before page retrieval (ideally, directly after construction).
The bindings code is the same, just a check added and docs updated. You could also copy the flatten() implementation over into your own code.
Sorry for the inconvenience; this originated from a time where form initialization wasn't integrated properly.

see changelog entry

mara004 · 2024-08-12T12:11:48Z

src/pypdfium2/_helpers/pageobjects.py

+        func = {
+            False: pdfium_c.FPDFImageObj_GetImageDataRaw,
+            True: pdfium_c.FPDFImageObj_GetImageDataDecoded,
+        }[decode_simple]


TODO: want to change these {False: ..., True: ...}[var] dicts back to ... if var else ... after all. so any object whose bool() is True can be used.
Also it seems like overkill to create a dict just for True/False.

hector-sherpas · 2024-08-13T09:46:38Z

The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled.

I see. FWIW, you can already use the semi-private page._flatten() if you make sure init_forms() was called on the parent pdf before page retrieval (ideally, directly after construction). The bindings code is the same, just a check added and docs updated. You could also copy the flatten() implementation over into your own code. Sorry for the inconvenience; this originated from a time where form initialization wasn't integrated properly.

@mara004 I have tried using page._flatten() with the instructions you have given as follows:

pdf = pypdfium2.PdfDocument(pdf_path)
pdf.init_forms()
for page_idx in page_range: # page_range --> 2
        page = pdf.get_page(page_idx)
        page._flatten(flag=pdfium_c.FLAT_NORMALDISPLAY) # return 1
        text_page = page.get_textpage()
        page = pdf.get_page(page_idx)
        page._flatten(flag=pdfium_c.FLAT_NORMALDISPLAY) # return 2
        text_page = page.get_textpage()
        ...
        total_chars = text_page.count_chars()

I have to repeat the _flatten() twice to get all the editable values from the form.

number of characters without repeating _ flatten code --> total_chars # 4619
number of characters with repeating _ flatten code --> total_chars # 5014

mara004 · 2024-08-13T12:42:54Z

I can't really comment on that behavior as I'm only providing the bindings, and what the underlying APIs actually do is down to pdfium.

However, given that the second _flatten() call returns 2, which is equal to pdfium_c.FLATTEN_NOTHINGTODO, it should be a no-op (FWIW, you can take a look at the fpdf_flatten.cpp code and see when FLATTEN_NOTHINGTODO is returned).

So perhaps you just have to re-initialize the page handle? ~~Or maybe call page.gen_content() after flattening?~~

hector-sherpas · 2024-08-20T15:42:56Z

Finally, I just had to re-initialize the page handle. page.gen_content() option didn't work. Thanks

mara004 · 2024-08-20T15:58:57Z

That makes sense. I can add a note to the future docs that flattening invalidates existing handles to the page.

docs/devel/changelog_staging.md

It is not clear to me if PDFium is "BSD-3-Clause OR Apache-2.0" or "BSD-3-Clause AND Apache-2.0". The pypdfium2 codebase previously stated "OR", but recently it hit me we don't actually have any evidence for that. In the end, I figured it was probably a presumption from the early days of the project that might as well be wrong, and that "BSD-3-Clause AND Apache-2.0" would have been the safer assumption. Sorry :( IANAL, but to my understanding both licenses are liberal and in similar spirit, so hopefully this should not have negative legal consequences downstream. Note that there is (and always was) ABSOLUTELY NO WARRANTY for any information provided with the pypdfium2 project. For pypdfium2's Readme, see the CC-BY-4.0 license (e.g. "Section 5 -- Disclaimer of Warranties and Limitation of Liability."). For pypdfium2's code (including any information provided therein), see the Apache-2.0 or BSD-3-Clause licenses, which have similar disclaimers. This patch avoids any "OR" or "AND", instead changing to a generic comma. This is not valid SPDX/reuse syntax and serves as a placeholder until we know better. Note that pypdfium2's Python code continues to be "Apache-2.0 OR BSD-3-Clause". This issue is only about PDFium itself.

had two consecutive use_syslibs if-blocks that could be merged into one.

mara004 added 8 commits April 4, 2024 15:48

Work on API-breaking changes (bookmarks)

9856cfc

This backports (and slightly improves) the new bookmark API from devel_new. Test suite TBD.

toc: update API test

0183d80

Update test expectations

8d0d36f

toc: better explain level == maxdepth scenario

847281c

Start tracking changes

f235226

slightly improve docs for get_count()

4bfb461

address various nits

ac7903f

mara004 force-pushed the devel_new branch 2 times, most recently from e18a049 to 94342f8 Compare April 4, 2024 20:41

Continue on document and bitmap

517630a

Removed PdfDocument.render() & PdfBitmapInfo. Implemented context manager support for PdfDocument. Test suite integration TBD.

mara004 force-pushed the devel_new branch 10 times, most recently from 2d1c805 to 8049d8e Compare April 4, 2024 21:39

Work on PdfImage.extract()

677c498

mara004 force-pushed the devel_new branch from 8049d8e to 677c498 Compare April 4, 2024 21:55

mara004 added 5 commits April 5, 2024 00:02

Fix some object pointer checks against None

4de863d

Use bool() rather than checking against None. See findings in get_toc(): "We need bool(ptr) here to handle cases where .contents is a null pointer (raises exception on access). Don't use ptr != None, it's always true."

Address run check findings

ccfe923

Expand constructor assignments

2360165

This is longer, but cleaner. Imagine you have to edit it and assignment order gets wrong :P BTW, normalize PdfFormEnv constructor param order.

autorelease: add task

c581f5a

slightly improve wording for v4.25 changelog

81f2b4a

mara004 force-pushed the devel_new branch from 5731aed to 81f2b4a Compare April 4, 2024 22:27

Remove deprecated version API

ddc3f3a

mara004 added 2 commits August 11, 2024 21:26

PdfMatrix.mirror(): Fix misleading terminology

bbc7f98

see changelog entry

changelog: explicitly mention previous _flatten()

98ed536

mara004 commented Aug 12, 2024

View reviewed changes

changelog nit

ee2f035

mara004 had a problem deploying to github-pages September 16, 2024 23:23 — with GitHub Actions Failure

mara004 commented Sep 16, 2024

View reviewed changes

docs/devel/changelog_staging.md Outdated Show resolved Hide resolved

mara004 force-pushed the main branch 7 times, most recently from 15b9478 to 6736a5d Compare September 19, 2024 14:44

mara004 added 3 commits September 19, 2024 17:28

changelog: fix typo

ef0854e

PdfPage.flatten(): add note regarding invalidation of handles

d54d041

mara004 force-pushed the devel_new branch from d105d6e to dc5db75 Compare October 26, 2024 22:22

PdfBitmap.to_numpy() Use 2d shape for single-channel bitmap

51d8899

mara004 force-pushed the devel_new branch from dc5db75 to 51d8899 Compare October 26, 2024 22:38

version.py: minor cleanup

7f12cee

mara004 force-pushed the devel_new branch from d371310 to 1c65b7e Compare October 30, 2024 22:10

CLI(renderer/pageobjects): slightly improve code style

195ce71

mara004 force-pushed the devel_new branch from 1c65b7e to 195ce71 Compare October 30, 2024 22:19

Fix some dirty code in pdfium build script

5362127

had two consecutive use_syslibs if-blocks that could be merged into one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v5 devel branch #307

v5 devel branch #307

mara004 commented Apr 4, 2024

semoal commented Aug 8, 2024 •

edited

Loading

mara004 commented Aug 8, 2024 •

edited

Loading

mara004 Aug 12, 2024

hector-sherpas commented Aug 13, 2024 •

edited by mara004

Loading

mara004 commented Aug 13, 2024 •

edited

Loading

hector-sherpas commented Aug 20, 2024

mara004 commented Aug 20, 2024

v5 devel branch #307

Are you sure you want to change the base?

v5 devel branch #307

Conversation

mara004 commented Apr 4, 2024

semoal commented Aug 8, 2024 • edited Loading

mara004 commented Aug 8, 2024 • edited Loading

mara004 Aug 12, 2024

Choose a reason for hiding this comment

hector-sherpas commented Aug 13, 2024 • edited by mara004 Loading

mara004 commented Aug 13, 2024 • edited Loading

hector-sherpas commented Aug 20, 2024

mara004 commented Aug 20, 2024

semoal commented Aug 8, 2024 •

edited

Loading

mara004 commented Aug 8, 2024 •

edited

Loading

hector-sherpas commented Aug 13, 2024 •

edited by mara004

Loading

mara004 commented Aug 13, 2024 •

edited

Loading