Pytesseract stuff should not be inside imagetranslation.py file #18

maeriil · 2023-10-15T19:17:57Z

Currently we process the image using pytesseract right inside the imagetranslation.py file.

    pytesseract_config = r"--oem 3 --psm 5 -l jpn_vert"
    ...
        current_text = ""
        if is_lang_vertical_lang(src_lang):
            current_text = pytesseract.image_to_string(
                cropped_section, config=pytesseract_config
            )
        current_text = remove_trailing_whitespace(current_text).replace(
            " ", ""
        )

        if current_text == "":
            continue
    ...

However, this has the following issues that needs to be fixed:

If the source language is not japanese, then the pytesseract_config is not valid.
If the source language is not part of vertical language, we aren't even extracting text from it

Furthermore, the imagetranslation.py file is too big since it contains too much unnecessary logics that can be subdivided into other modules. Therefore, the pytesseract handling should be exported into a seperate file called textextraction.py inside ./src/modules. The file should implement the main method

def extract_text (image: np.array, src_lang: str) -> str:
  """add documentation"""
  pytesseract_config, success = generate_config(src_lang)

  content = ""
  if success:
    content = pytesseract.image_to_string(image, config=pytesseract_config)
  return remove_trailing_whitespace(content).replace(" ", "")

such that the imagetranslation.py file now will be call it as such

...
from src.modules.textextraction import extract_text
...

def translate(...):
  ...
        current_text = extract_text(cropped_section, src_lang)
        if current_text == "":
            continue
  ...

The text was updated successfully, but these errors were encountered:

maeriil added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed High Priority Urgent requirement labels Oct 15, 2023

maeriil added this to the Backend API Release milestone Oct 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytesseract stuff should not be inside imagetranslation.py file #18

Pytesseract stuff should not be inside imagetranslation.py file #18

maeriil commented Oct 15, 2023

Pytesseract stuff should not be inside imagetranslation.py file #18

Pytesseract stuff should not be inside imagetranslation.py file #18

Comments

maeriil commented Oct 15, 2023