How to get Japanese text from Office files using PyMuPDF Pro #4133
-
I am a Japanese PyMuPDF user. I wanted to extract text from Office files, and found out that the Pro version of PyMuPDF, which I am familiar with, supports Office files. I tried it out and confirmed that English was extracted correctly. What do I need to do to extract Japanese correctly? Versions Python : 3.13.0 Sample Office files Code import pymupdf.pro
pymupdf.pro.unlock()
test_docx = "test.docx"
test_pptx = "test.pptx"
test_xlsx = "test.xlsx"
def main():
print("==== Testing docx ====")
docx = pymupdf.open(test_docx)
docx_page = docx.load_page(0)
print(docx_page.get_textpage().extractHTML())
print("==== Testing pptx ====")
pptx = pymupdf.open(test_pptx)
pptx_page = pptx.load_page(0)
print(pptx_page.get_textpage().extractHTML())
print("==== Testing xlsx ====")
xlsx = pymupdf.open(test_xlsx)
xlsx_page = xlsx.load_page(0)
print(xlsx_page.get_textpage().extractHTML())
if __name__ == "__main__":
main() Outputs of Code
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 13 replies
-
Please tell us your OS. The Office document converter uses fonts available in the system which introduces differences specific to the platform (Windows, Linux, Mac). |
Beta Was this translation helpful? Give feedback.
-
I think If you set (undocumented) |
Beta Was this translation helpful? Give feedback.
-
I was able to obtain Japanese text by setting fontpath correctly. PyMuPDF Pro is a very good solution. As a side note, the garbled characters could not be fixed on macOS because This can be fixed by setting Noto Sans JP was used for testing. Code using fontpath import os
import pymupdf.pro
test_docx = "test.docx"
test_pptx = "test.pptx"
test_xlsx = "test.xlsx"
def main():
cwd = os.getcwd()
# Print added font paths
os.environ["PYMUPDFPRO_FONT_PATH_VERBOSE"] = "1"
pymupdf.pro.unlock(
# Font files are located in the `fonts` directory
# e.g. /path/to/fonts/SomeFont.ttf
fontpath=f"{cwd}/fonts",
# Disable automatic font path detection
fontpath_auto=False,
)
print("==== Testing docx ====")
docx = pymupdf.open(test_docx)
docx_page = docx.load_page(0)
print(docx_page.get_textpage().extractHTML())
print("==== Testing pptx ====")
pptx = pymupdf.open(test_pptx)
pptx_page = pptx.load_page(0)
print(pptx_page.get_textpage().extractHTML())
print("==== Testing xlsx ====")
xlsx = pymupdf.open(test_xlsx)
xlsx_page = xlsx.load_page(0)
print(xlsx_page.get_textpage().extractHTML())
if __name__ == "__main__":
main() |
Beta Was this translation helpful? Give feedback.
I think
unlock()
's undocumentedfontpath
andfontpath_auto
args should work even without a key;fontpath
should be a list of strings or a string containingos.pathsep
-separated strings. Similarly for settingos.environ['PYMUPDFPRO_FONT_PATH']
before callingunlock()
, as mentioned by @JorjMcKie; this should be anos.pathsep
-separated lists of strings.If you set (undocumented)
os.environ['PYMUPDFPRO_FONT_PATH_VERBOSE']='1'
,unlock()
will output diagnostics showing what font directories it is adding. This might show some useful information.