Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytesseract stuff should not be inside imagetranslation.py file #18

Open
maeriil opened this issue Oct 15, 2023 · 0 comments
Open

Pytesseract stuff should not be inside imagetranslation.py file #18

maeriil opened this issue Oct 15, 2023 · 0 comments
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed High Priority Urgent requirement

Comments

@maeriil
Copy link
Owner

maeriil commented Oct 15, 2023

Currently we process the image using pytesseract right inside the imagetranslation.py file.

    pytesseract_config = r"--oem 3 --psm 5 -l jpn_vert"
    ...
        current_text = ""
        if is_lang_vertical_lang(src_lang):
            current_text = pytesseract.image_to_string(
                cropped_section, config=pytesseract_config
            )
        current_text = remove_trailing_whitespace(current_text).replace(
            " ", ""
        )

        if current_text == "":
            continue
    ...

However, this has the following issues that needs to be fixed:

  • If the source language is not japanese, then the pytesseract_config is not valid.
  • If the source language is not part of vertical language, we aren't even extracting text from it

Furthermore, the imagetranslation.py file is too big since it contains too much unnecessary logics that can be subdivided into other modules. Therefore, the pytesseract handling should be exported into a seperate file called textextraction.py inside ./src/modules. The file should implement the main method

def extract_text (image: np.array, src_lang: str) -> str:
  """add documentation"""
  pytesseract_config, success = generate_config(src_lang)

  content = ""
  if success:
    content = pytesseract.image_to_string(image, config=pytesseract_config)
  return remove_trailing_whitespace(content).replace(" ", "")

such that the imagetranslation.py file now will be call it as such

...
from src.modules.textextraction import extract_text
...

def translate(...):
  ...
        current_text = extract_text(cropped_section, src_lang)
        if current_text == "":
            continue
  ...
@maeriil maeriil added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed High Priority Urgent requirement labels Oct 15, 2023
@maeriil maeriil added this to the Backend API Release milestone Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request help wanted Extra attention is needed High Priority Urgent requirement
Projects
None yet
Development

No branches or pull requests

1 participant