How to get Japanese text from Office files using PyMuPDF Pro #4133

waonpad · 2024-12-10T09:53:20Z

waonpad
Dec 10, 2024

I am a Japanese PyMuPDF user.

I wanted to extract text from Office files, and found out that the Pro version of PyMuPDF, which I am familiar with, supports Office files.

I tried it out and confirmed that English was extracted correctly.
However, all Japanese characters were garbled and displayed as \ufffd, �.

What do I need to do to extract Japanese correctly?
I read the PyMuPDF documentation, but I couldn't find a solution on my own.

Versions

Python : 3.13.0
pymupdf : 1.25.0
pymupdfpro : 1.25.0 (Restricted Mode)

Sample Office files

test.docx
test.pptx
test.xlsx

Code

import pymupdf.pro

pymupdf.pro.unlock()

test_docx = "test.docx"
test_pptx = "test.pptx"
test_xlsx = "test.xlsx"


def main():
    print("==== Testing docx ====")
    docx = pymupdf.open(test_docx)
    docx_page = docx.load_page(0)
    print(docx_page.get_textpage().extractHTML())

    print("==== Testing pptx ====")
    pptx = pymupdf.open(test_pptx)
    pptx_page = pptx.load_page(0)
    print(pptx_page.get_textpage().extractHTML())

    print("==== Testing xlsx ====")
    xlsx = pymupdf.open(test_xlsx)
    xlsx_page = xlsx.load_page(0)
    print(xlsx_page.get_textpage().extractHTML())


if __name__ == "__main__":
    main()

Outputs of Code

PyMuPDFPro: Restricted Mode. Please visit https://pymupdf.io/try-pro to request your trial key.
==== Testing docx ====
<div id="page0" style="width:612.0pt;height:792.0pt">
<p style="top:73.5pt;left:72.0pt;line-height:11.0pt"><span style="font-family:Arial,serif;font-size:11.0pt;color:#000000">Hello</span></p>
<p style="top:102.6pt;left:72.0pt;line-height:11.0pt"><span style="font-family:Arial,serif;font-size:11.0pt;color:#000000">&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;</span></p>
</div>

==== Testing pptx ====
<div id="page0" style="width:720.0pt;height:405.0pt">
<p style="top:161.2pt;left:300.8pt;line-height:52.0pt"><span style="font-family:Arial,serif;font-size:52.0pt;color:#000000">Hello</span></p>
<p style="top:236.0pt;left:213.0pt;line-height:28.0pt"><span style="font-family:Arial,serif;font-size:28.0pt;color:#000000">&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;</span></p>
</div>

==== Testing xlsx ====
<div id="page0" style="width:721.1pt;height:488.3pt">
<p style="top:12.5pt;left:9.6pt;line-height:11.0pt"><span style="font-family:Arial,serif;font-size:11.0pt;color:#000000">Hello</span></p>
<p style="top:28.3pt;left:9.6pt;line-height:11.0pt"><span style="font-family:Arial,serif;font-size:11.0pt;color:#000000">&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;&#xfffd;</span></p>
</div>

Answered by julian-smith-artifex-com

Dec 10, 2024

I think unlock()'s undocumented fontpath and fontpath_auto args should work even without a key; fontpath should be a list of strings or a string containing os.pathsep-separated strings. Similarly for setting os.environ['PYMUPDFPRO_FONT_PATH'] before calling unlock(), as mentioned by @JorjMcKie; this should be an os.pathsep-separated lists of strings.

If you set (undocumented) os.environ['PYMUPDFPRO_FONT_PATH_VERBOSE']='1', unlock() will output diagnostics showing what font directories it is adding. This might show some useful information.

View full answer

JorjMcKie · 2024-12-10T10:16:47Z

JorjMcKie
Dec 10, 2024
Maintainer

Please tell us your OS. The Office document converter uses fonts available in the system which introduces differences specific to the platform (Windows, Linux, Mac).

12 replies

JorjMcKie Dec 10, 2024
Maintainer

You can set the environment variable PYMUPDFPRO_FONT_PATH independently (before using unlock). Maybe it helps.

jamie-lemon Dec 10, 2024
Maintainer

I tried to get a trial key about 5 hours ago but still haven't received an email ...

Please let me know if you don't receive the email. We need to make sure the email system operates as expected.

waonpad Dec 10, 2024
Author

@jamie-lemon
I still haven't received the email...

jamie-lemon Dec 10, 2024
Maintainer

@waonpad I can see that you registered okay - I will email your key to you.

waonpad Dec 10, 2024
Author

@jamie-lemon
I received my key! Thank you for your fast response.

julian-smith-artifex-com · 2024-12-10T14:05:03Z

julian-smith-artifex-com
Dec 10, 2024
Maintainer

I think unlock()'s undocumented fontpath and fontpath_auto args should work even without a key; fontpath should be a list of strings or a string containing os.pathsep-separated strings. Similarly for setting os.environ['PYMUPDFPRO_FONT_PATH'] before calling unlock(), as mentioned by @JorjMcKie; this should be an os.pathsep-separated lists of strings.

If you set (undocumented) os.environ['PYMUPDFPRO_FONT_PATH_VERBOSE']='1', unlock() will output diagnostics showing what font directories it is adding. This might show some useful information.

1 reply

waonpad Dec 10, 2024
Author

Thank you.

I was able to confirm that fontpath works even without the key.
It seems that I was specifying the path incorrectly.

Thanks for the additional information too.

waonpad · 2024-12-10T15:01:32Z

waonpad
Dec 10, 2024
Author

I was able to obtain Japanese text by setting fontpath correctly.
Thank you.

PyMuPDF Pro is a very good solution.
I think if it were better documented it would be useful to more people.

As a side note, the garbled characters could not be fixed on macOS because fontpath does not take precedence over fonts loaded with fontpath_auto.

This can be fixed by setting fontpath_auto to False.

Noto Sans JP was used for testing.

Code using fontpath

import os

import pymupdf.pro

test_docx = "test.docx"
test_pptx = "test.pptx"
test_xlsx = "test.xlsx"


def main():
    cwd = os.getcwd()

    # Print added font paths
    os.environ["PYMUPDFPRO_FONT_PATH_VERBOSE"] = "1"

    pymupdf.pro.unlock(
        # Font files are located in the `fonts` directory
        # e.g. /path/to/fonts/SomeFont.ttf
        fontpath=f"{cwd}/fonts",
        # Disable automatic font path detection
        fontpath_auto=False,
    )

    print("==== Testing docx ====")
    docx = pymupdf.open(test_docx)
    docx_page = docx.load_page(0)
    print(docx_page.get_textpage().extractHTML())

    print("==== Testing pptx ====")
    pptx = pymupdf.open(test_pptx)
    pptx_page = pptx.load_page(0)
    print(pptx_page.get_textpage().extractHTML())

    print("==== Testing xlsx ====")
    xlsx = pymupdf.open(test_xlsx)
    xlsx_page = xlsx.load_page(0)
    print(xlsx_page.get_textpage().extractHTML())


if __name__ == "__main__":
    main()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get Japanese text from Office files using PyMuPDF Pro #4133

{{title}}

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to get Japanese text from Office files using PyMuPDF Pro #4133

waonpad Dec 10, 2024

Replies: 3 comments · 13 replies

JorjMcKie Dec 10, 2024 Maintainer

JorjMcKie Dec 10, 2024 Maintainer

jamie-lemon Dec 10, 2024 Maintainer

waonpad Dec 10, 2024 Author

jamie-lemon Dec 10, 2024 Maintainer

waonpad Dec 10, 2024 Author

julian-smith-artifex-com Dec 10, 2024 Maintainer

waonpad Dec 10, 2024 Author

waonpad Dec 10, 2024 Author

waonpad
Dec 10, 2024

Replies: 3 comments 13 replies

JorjMcKie
Dec 10, 2024
Maintainer

JorjMcKie Dec 10, 2024
Maintainer

jamie-lemon Dec 10, 2024
Maintainer

waonpad Dec 10, 2024
Author

jamie-lemon Dec 10, 2024
Maintainer

waonpad Dec 10, 2024
Author

julian-smith-artifex-com
Dec 10, 2024
Maintainer

waonpad Dec 10, 2024
Author

waonpad
Dec 10, 2024
Author