(a) Input | (b) Using the ocr prompt | (c) Using the markdown prompt |
- Containerizing Microsoft's Kosmos-2.5 Multimodal-LLM (MLLM) for Local OCR via a RESTful API
- Introduction
- Dependencies
- Installing & Deploying the Kosmos-2.5 Pre-Built Docker Image
- Building the Docker Image
- API Specification
- Invoke Kosmos-2.5 API - /infer endpoint
- Rebuilding the Dependencies & Container - If the Pre-Built Image & dockerfile in this Repo Fail to Work
- Running Kosmos-2.5 Uncontainerized
Optical Character Recognition, commonly abbreviated as "OCR", is a technology to recognize and extract text from visual data such as images. OCR is often essential for a myriad of applications where simpler techniques fall short or fail entirely.
One such application is the extraction of text from documents, such as PDFs. Oftentimes, PDFs comprise of documents obtained from a scanner wherein each page of the generated PDF is basically an image, and thus simply attempting to read & parse text out of the document will not work, as no text exists! Further, even in cases where text can be parsed from PDFs, the complex and varied internal structure of PDF documents often results in the extracted text being garbled (mis-spelled or incorrectly split/combined) and lacking formatting integrity of the source document. For any documents containing crucial data, these issues can be serious as the essence of the data could be lost or corrupted via such extracts. This challenge is prevalent in document formats other than PDFs too, and high-performance OCR tools are often the very best in addressing these challenges.
While document-centric text-extraction has always been a popular requirement, this usecase is especially in the limelight today with the surge in popularity of Large Language Models (LLMs) and specifically, their use in RAG (Retrieval Augmented Generation) applications, wherein users upload their own documents and engage in conversations wherein LLMs ground their responses in the uploaded content, all in an effort to mitigate the pitfalls of AI-generated inaccuracies or "hallucinations." LARS - The LLM & Advanced Referencing Solution, is one such application which additionally injects detailed citations into responses to further increase trust in LLM outputs.
The typical pipeline of such RAG-applications involves extracting textual data from uploaded documents, breaking up this extract into fixed-sized chunks for processing via an embedding model and finally, storing the resulting embeddings into a vector database. Subsequently on receiving a user query, a semantic-similarity search is carried out on the vector database, relevant chunks of data retrieved, and a large corpus of contextual data supplied to the LLM alongside the user's query, all of which (hopefully!) results in significantly higher quality response generation by the LLM. Thus the term 'Retrieval Augmented Generation'!
It's easy to focus on optimizing such RAG-pipelines by focusing on the LLMs & embedding models used, and on the chunking strategy (size, overlap, etc) applied. However, the very first step of text-extraction itself, if not done with a high degree of precision, risks compromising all downstream tasks even if State-Of-The-Art (SOTA) models & techniques are deployed! It's thus very much worthwhile to expend the necessary resources to optimize this first step of the RAG pipeline adequately.
In doing so, it's soon discovered that local OCR techniques based on popular tools like Tesseract or Camelot often require extensive tuning to the input image and thus fall short in RAG applications that aim to be as broadly applicable as possible, allowing the user to upload a variety of document formats, types & content. Indeed, commercial OCR services served up by cloud providers such as Azure are often a necessity, and applications might incorporate OCR via API calls as a result. This is also the approach that's adopted in LARS, wherein an extensive investigation of local OCR tools and transformer models clearly indicated the necessity of such cloud services.
However, with the advent of a new breed of Multimodal-LLMs (MLLMs), and more specifically vision-LLMs, the landscape for local OCR may change significantly, with high-quality text-identification & extraction provided by locally run models that natively process visual data.
Microsoft's Kosmos-2.5 is one such MLLM and in Microsoft's own words, is specifically a "literate model for machine reading of text-intensive images."
However, it's no easy feat to get this model up and running locally on your device: with a very stringent and specific set of hardware and software requirements, this model is extremely temperamental to deploy and use! Popular backends such as llama.cpp don't support it and a very specific, non-standard and customized version of the transformers library is required to correctly infer it. Certain dependencies necessitate Linux, while others necessitate very specific generations of Nvidia GPUs.
With such specific dependencies that can even hinder the deployment of other local LLMs, how can this model be made to co-exist alongside actual real-world applications and more crucially, be made to serve those applications in a useful manner?
While I cannot liberate the hardware requirements, I did see an opportunity to address the software challenges: by containerizing the model and its dependencies and leveraging PyFlask to expose the model over a RESTful API, Kosmos-2.5 can be made available as a service, thus providing fully local & high-performance OCR capabilities by leveraging a cutting-edge MLLM!
-
Due to the use & requirements of Flash Attention, you must have an Nvidia GPU based on one of the below architecture families:
- Ampere: RTX 3000 GeForce, RTX A Professional series of GPUs, A100 etc.
- Hopper: H100, H200, and H800
- Ada-Lovelace: RTX 4000 GeForce or Professional series of GPUs
-
The model consumes 10GB of VRAM in my testing, which further limits it to the below GPUs:
- RTX 3060 or RTX 3080 (might work on 10GB variant, 12GB variant preferrable) RTX 3080 Ti, RTX 3090, RTX 3090 Ti and Laptop RTX 3080 (16GB VRAM variant) & RTX 3080 Ti GPUs
- RTX A800, A4000 and above Professional GPUs
- A100, H100, H200, and H800
- RTX 4070 Ti Super, RTX 4080, RTX 4080 Super, RTX 4090 and Laptop RTX 4080 and RTX 4090 GPUs
- RTX 2000 & above Ada Lovelace Professional GPUs
-
Install Nvidia GPU Drivers
-
Install Nvidia CUDA Toolkit - Kosmos-2.5 container built with v12.4.1
-
Verify Installation via the terminal:
nvcc -V nvidia-smi
-
If you encounter
nvcc not found
errors on Linux, you must manually set the NVCC PATH:-
confirm symlink for cuda:
ls -l /usr/local/cuda ls -l /etc/alternatives/cuda
-
update bashrc:
nano ~/.bashrc # add this line to the end of bashrc: export PATH=/usr/local/cuda/bin:$PATH
-
reload bashrc:
source ~/.bashrc
-
Confirm nvcc is correctly setup:
nvcc -V
-
- While not explicitly required, some experience with Docker containers and familiarity with the concepts of containerization and virtualization is recommended!
-
Installing Docker
-
Your CPU should support virtualization and it should be enabled in your system's BIOS/UEFI
-
Download and install Docker Desktop
-
If on Windows, you may need to install the Windows Subsystem for Linux if it's not already present. To do so, open PowerShell as an Administrator and run the following:
wsl --install
-
Ensure you have WSL version 2 by running:
wsl -v # or wsl --status
Update WSL if not!
-
Ensure Docker Desktop is up and running, then open a Command Prompt / Terminal and execute the following command to ensure Docker is correctly installed and up and running:
docker ps
-
-
Windows Only - Install Ubuntu 22.04 via the Microsoft Store if it's not already installed:
-
Windows Only - Docker & WSL Integration:
-
Open a new PowerShell window and set this Ubuntu installation as the WSL default:
wsl --list wsl --set-default Ubuntu-22.04 # if not already marked as Default
-
Navigate to
Docker Desktop -> Settings -> Resources -> WSL Integration
-> Check Default & Ubuntu 22.04 integrations. Refer to the screenshot below:
-
-
In a bash shell (search for
Ubuntu
in the Start-menu in Windows), perform the following steps:-
Configure the production repository:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
-
Update the packages list from the repository & Install the Nvidia Container Toolkit Packages:
sudo apt-get update && apt-get install -y nvidia-container-toolkit
-
Configure the container runtime by using the nvidia-ctk command, which modifies the /etc/docker/daemon.json file so that Docker can use the Nvidia Container Runtime:
sudo nvidia-ctk runtime configure --runtime=docker
-
Restart the Docker daemon:
sudo systemctl restart docker
-
-
Download the pre-built image from my Google Drive (only way I could host it for free!): https://drive.google.com/file/d/18R4XxOmH8R8wwtvhyPM0PCMiO58If2wY/view?usp=sharing
-
Import Image into Docker:
docker load -i kosmos_image.tar
-
Run the Kosmos-2.5 container!
docker run --gpus all -p 25000:25000 kosmos-2_5
-
If for any reason you prefer to build the image yourself rather than using the pre-built image, simply clone this repo and use
docker build
:-
Navigate to
<path_to_cloned_kosmos_container_repo>/kosmos-2_5-container-files
-
Build:
docker build --progress=plain -t kosmos-2_5 . # To build without using cached data: docker build --progress=plain --no-cache -t kosmos-2_5 .
-
Note: A snapshot of the official model checkpoint at the time this repo was created can also be obtained from my backup here: https://drive.google.com/file/d/17RwlniqMwbLEMj5ELQd9iQ4kor749Z0e/view?usp=sharing
-
-
Endpoint:
/infer
-
Port:
25000
-
Header:
Content-Type
header set tomultipart/form-data
-
Request Body:
-
Type:
form-data
-
key:
image
- Allowed file-types:
png
,jpg
,jpeg
,gif
- Allowed file-types:
-
key:
task
-
ocr
for optical-character recognition - outputs text & bounding-box co-ordinates -
md
for markdown - outputs text from image in markdown format
-
-
-
Response:
-
Format:
json
-
Successful Response:
{'output': result.stdout, 'error': result.stderr}
- Note:
stderr
contains any warnings - these are not errors but rather general data andFutureWarning
notifications
- Note:
-
Error Response:
- If
image
key missing:
Response:
{'error': 'No image file provided'}
Code:400
- If
image
key present but no file sent:
Response:
{'error': 'No selected file'}
Code:400
- If invalid file-type:
Response:
{'error': 'Invalid file type'}
Code:400
- If
-
-
Using the Desktop Client:
- Refer to the screenshot below:
-
Open Postman
-
Create a new request: Click on the
+
tab orNew
button to create a new request -
Set up the request:
-
Change the
HTTP method
toPOST
using the dropdown menu -
Enter the URL:
http://localhost:25000/infer
-
-
Set up the request body:
-
Click on the
Body
tab -
Select
form-data
-
You'll need to add two key-value pairs:
-
For the image:
Key: image Value: Select "File" from the dropdown next to the key # click "Select Files" and choose your image file
-
For the task:
Key: task Value: ocr (type this as text) # or md for markdown
-
-
-
Headers: Postman will automatically set the
Content-Type
header tomultipart/form-data
when you use form-data, so you don't need to set this manually -
Send the request: Click the "Send" button
-
via BASH:
-
For OCR:
curl -X POST -F "image=@/path/to/local/image.jpg" -F "task=ocr" http://localhost:25000/infer
-
For markdown:
curl -X POST -F "image=@/path/to/local/image.jpg" -F "task=md" http://localhost:25000/infer
-
```
import requests
url = "http://localhost:25000/infer"
files = {"image": open("path/to/image.jpg", "rb")}
data = {"task": "ocr"} # or md for markdown
response = requests.post(url, files=files, data=data)
print(response.json())
```
```
const formdata = new FormData();
formdata.append("image", fileInput.files[0], "path/to/image.jpg");
formdata.append("task", "ocr"); # or md for markdown
const requestOptions = {
method: "POST",
body: formdata,
redirect: "follow"
};
fetch("http://localhost:25000/infer", requestOptions)
.then((response) => response.text())
.then((result) => console.log(result))
.catch((error) => console.error(error));
```
```
var form = new FormData();
form.append("image", fileInput.files[0], "path/to/image.jpg");
form.append("task", "ocr"); # or md for markdown
var settings = {
"url": "http://localhost:25000/infer",
"method": "POST",
"timeout": 0,
"processData": false,
"mimeType": "multipart/form-data",
"contentType": false,
"data": form
};
$.ajax(settings).done(function (response) {
console.log(response);
});
```
Rebuilding the Dependencies & Container - If the Pre-Built Image & dockerfile in this Repo Fail to Work
-
If the pre-built image provided in this repository don't work, and neither does a fresh
docker build
with the dockerfile provided, it may be an issue with re-using the prebuilt wheels -
In this case, you may elect to build these dependencies on your system via either of the options below - strap in with your favorite drink (or three) as this is going to be a long ride!
-
This method is the preferred route as building wheels for the
flash-attention
andxformers
libraries can take a very long time and exponentially more hardware resources when done via the dockerfile while building the container as compared to doing so on the host system -
For instance, building the
flash-attention
library takes about an hour on my host system (Windows 11, Intel Core i9 13900KF, RTX 3090) while fitting comfortably within the 32GB SysRAM. Within the container build though, it wasn't even half done after an hour and an additional 100GB pagefile was necessary to augment the SysRAM! -
Even on the host OS, these builds will take a while so don't be alarmed
-
If you're doing this on Windows, you must use your Ubuntu-22.04 WSL environment for building the wheels, and then transfer them to Windows to build the container
-
In a bash shell (search for
Ubuntu
in the Start-menu in Windows), perform the following steps:
-
Install flash-attention:
-
install PyTorch:
sudo apt install python3-pip pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # modify CUDA version if required - https://pytorch.org/get-started/locally/
-
install dependencies:
pip install wheel==0.37.1 pip install ninja==1.11.1 pip install packaging==24.1 pip install numpy==1.22 pip install psutil==6.0.0
-
git clone
andcd
repo:git clone -b v2.5.9.post1 https://github.com/Dao-AILab/flash-attention.git cd flash-attention
-
install from repo:
pip install . --no-build-isolation
-
test flash-attention installation (example output: 2.5.9.post1):
python3 import flash_attn print(flash_attn.__version__)
-
-
Install xformers:
-
Run the below command:
pip install git+https://github.com/facebookresearch/xformers.git@04de99bb28aa6de8d48fab3cdbbc9e3874c994b8
-
-
Locate and Copy Wheels:
-
Locate the wheels that were built for the above:
find ~ -name "*.whl"
-
Look for the paths similar to the below:
/home/<username>/.cache/pip/wheels/e1/b9/e3/5b5b849d01c0e4007af963f69ad86fb43910a0c18080ee8918/xformers-0.0.22+04de99b.d20240705-cp310-cp310-linux_x86_64.whl /home/<username>/.cache/pip/wheels/f6/b4/f5/30df6540ed09f56a99a1138f669e1dbee729478850845504f0/flash_attn-2.5.9.post1-cp310-cp310-linux_x86_64.whl
-
Copy the wheels:
-
In Linux:
cp <path_to_flash_attn_wheel> <path_to_cloned_kosmos_container_repo>/kosmos-2_5-container-files/prebuilt_wheels cp <path_to_xformer_wheel> <path_to_cloned_kosmos_container_repo>/kosmos-2_5-container-files/prebuilt_wheels
-
In Windows:
-
This will make a folder in your Windows User directory:
mkdir -p /mnt/c/Users/YourWindowsUsername/wsl_wheels
-
Transfer wheels from WSL to Windows:
cp <path_to_flash_attn_wheel> /mnt/c/Users/YourWindowsUsername/wsl_wheels cp <path_to_xformer_wheel> /mnt/c/Users/YourWindowsUsername/wsl_wheels
-
Now copy the wheels from
C:/Users/YourWindowsUsername/wsl_wheels
to<path_to_cloned_kosmos_container_repo>/kosmos-2_5-container-files/prebuilt_wheels
-
-
-
-
Build Container:
-
Navigate to
<path_to_cloned_kosmos_container_repo>/kosmos-2_5-container-files
-
Run Docker Build:
docker build --progress=plain -t kosmos-2_5 . # To build without using cached data: docker build --progress=plain --no-cache -t kosmos-2_5 .
-
Note: A snapshot of the official model checkpoint at the time this repo was created can also be obtained from my backup here: https://drive.google.com/file/d/17RwlniqMwbLEMj5ELQd9iQ4kor749Z0e/view?usp=sharing
-
-
WARNING: Very slow and requires significant hardware resources, particularly SysRAM!
-
For instance, building the
flash-attention
library takes about an hour on my host system (Windows 11, Intel Core i9 13900KF, RTX 3090) while fitting comfortably within the 32GB SysRAM. Within the container build though, it wasn't even half done after an hour and an additional 100GB pagefile was necessary to augment the SysRAM! -
If you still chose this route, then clone this repo, replace the supplied
dockerfile
with the one below and usedocker build
:-
Navigate to
kosmos-2_5-containerized/kosmos-2_5-container-files
:cd kosmos-2_5-containerized/kosmos-2_5-container-files
-
-
Replace the existing
dockerfile
with the one below (MODIFY IT AS PER COMMENTS!):
# Use an official Nvidia CUDA runtime as a parent image - MODIFY CUDA VERSION AS REQUIRED
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
# Avoid interactive prompts - auto-select defaults for any prompts
ARG DEBIAN_FRONTEND=noninteractive
# Set timezone for tzdata package as it's a dependency for some packages
ENV TZ=America/Los_Angeles
# Set the working directory in the container
WORKDIR /app
# Install Python & PIP and git & wget to clone model repo and download model checkpoint - build-essential includes GCC, G++, and make; libffi-dev for Foreign Function Interface (FFI); libssl-dev for SSL support
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
software-properties-common \
build-essential \
libffi-dev \
libssl-dev \
git \
wget \
net-tools \
iproute2 \
cuda-toolkit-12-4
# Copy the current directory contents into the container at /app
COPY . /app
# Setting environment variables to maximize use of available hardware resources - MODIFY MAX_JOBS AS PER YOUR CPU LOGICAL CORE COUNT
ENV MAKEFLAGS="-j$(nproc)"
ENV MAX_JOBS=16
# OPTIONAL - Change fPIC level, and Set CUDA optimizations as per your GPU arch - `arch=compute_86,code=sm_86` is for RTX 3000 Ampere, `arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90` is for Ampere & Hopper etc
# ENV CUDA_NVCC_FLAGS="-Xcompiler -fPIC -O3 --use_fast_math -gencode arch=compute_86,code=sm_86"
# Install PyTorch Nightly Build for CUDA 12.4, dependencies for Flash Attention 2 and initial dependencies for Kosmos-2.5
RUN pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 && \
pip install -v wheel==0.37.1 ninja==1.11.1 packaging==24.1 numpy==1.22 psutil==6.0.0 && \
pip install -v tiktoken tqdm "omegaconf<=2.1.0" boto3 iopath "fairscale==0.4" "scipy==1.10" triton flask
# Clone specific flash-attn release
RUN git clone -b v2.5.9.post1 https://github.com/Dao-AILab/flash-attention.git
# Set work directory to install flash-attention
WORKDIR /app/flash-attention
RUN pip install . --no-build-isolation
# Change back to the main app directory
WORKDIR /app
# Clone model checkpoint
RUN wget -P /app/kosmos-2_5 https://huggingface.co/microsoft/kosmos-2.5/resolve/main/ckpt.pt
# Install remaining dependencies for Kosmos-2.5 from custom repos
RUN pip install -v git+https://github.com/facebookresearch/xformers.git@04de99bb28aa6de8d48fab3cdbbc9e3874c994b8 && \
pip install -v git+https://github.com/Dod-o/kosmos2.5_tools.git@fairseq && \
pip install -v git+https://github.com/Dod-o/kosmos2.5_tools.git@infinibatch && \
pip install -v git+https://github.com/Dod-o/kosmos2.5_tools.git@torchscale && \
pip install -v git+https://github.com/Dod-o/kosmos2.5_tools.git@transformers
# Create image upload directory, no error if already exists
RUN mkdir -p /tmp
# Make port 25000 available to the world outside this container - MODIFY IF DESIRED
EXPOSE 25000
# Change back to the main app directory
WORKDIR /app/kosmos-2_5
# Run application
CMD ["python3", "kosmos_api.py"]
-
Note: A snapshot of the official model checkpoint at the time this repo was created can also be obtained from my backup here: https://drive.google.com/file/d/17RwlniqMwbLEMj5ELQd9iQ4kor749Z0e/view?usp=sharing
-
Run Docker Build:
docker build --progress=plain -t kosmos-2_5 . # To build without using cached data: docker build --progress=plain --no-cache -t kosmos-2_5 .
-
If you notice
killed
while building flash-attention, your system is resource constrained and the docker process is being killed. To mitigate this, you may try to modify the CPU, RAM, and Pagefile resources allocated to WSL. To do so:-
Navigate to
C:\Users<YourUsername>
-
Create a
.wslconfig
file, or modify if it already exists -
Enter/tweak the below parameters:
[wsl2] memory=24GB processors=16 swap=80GB
-
Restart WSL for changes to take effect:
# via command-prompt / PowerShell: wsl --shutdown # Confirm no distros are running: wsl --list --running # should show "There are no running distributions." # Restart: wsl
-
Keep an eye on resource use via the Task Managers
Performance
tab, modify the values above as required
-
-
You may also try to build with increased verbosity as required to diagnose any other issues:
pip install -vvv . --no-build-isolation
-
If you prefer to setup Kosmos-2.5, be aware that it's incredibly temperamental and has a bunch of specific requirements
-
As a result, you may wish you configure a Python virtual environment
-
Requirements (in addition to Dependencies above)
-
Linux, as
triton
is officially only supported on Linux- Use your Ubuntu-22.04 WSL environment if on Windows
-
Python v3.10.x, as the custom
fairseq
lib malfunctions on 3.11.x
-
-
In a bash shell (search for
Ubuntu
in the Start-menu in Windows), perform the following steps:-
Install CUDA Toolkit v12.4.1 (or appropriate version for your system):
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-wsl-ubuntu-12-4-local_12.4.1-1_amd64.deb sudo dpkg -i cuda-repo-wsl-ubuntu-12-4-local_12.4.1-1_amd64.deb sudo cp /var/cuda-repo-wsl-ubuntu-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cuda-toolkit-12-4
-
Set NVCC PATH:
-
confirm symlink for cuda:
ls -l /usr/local/cuda ls -l /etc/alternatives/cuda
-
update bashrc:
nano ~/.bashrc # add this line to the end of bashrc: export PATH=/usr/local/cuda/bin:$PATH
-
reload bashrc:
source ~/.bashrc
-
-
Confirm CUDA installation:
nvcc -V nvidia-smi
-
Install flash-attention:
-
install PyTorch:
sudo apt install python3-pip pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # modify CUDA version if required - https://pytorch.org/get-started/locally/
-
install dependencies:
pip install wheel==0.37.1 pip install ninja==1.11.1 pip install packaging==24.1 pip install numpy==1.22 pip install psutil==6.0.0
-
git clone
andcd
repo:git clone -b v2.5.9.post1 https://github.com/Dao-AILab/flash-attention.git cd flash-attention
-
install from repo:
pip install . --no-build-isolation
-
test flash-attention installation (example output: 2.5.9.post1):
python3 import flash_attn print(flash_attn.__version__)
-
-
Install Kosmos-2.5!
-
PIP Requirements:
pip install tiktoken pip install tqdm pip install "omegaconf<=2.1.0" pip install boto3 pip install iopath pip install "fairscale==0.4" pip install "scipy==1.10" pip install triton pip install git+https://github.com/facebookresearch/xformers.git@04de99bb28aa6de8d48fab3cdbbc9e3874c994b8 pip install git+https://github.com/Dod-o/kosmos2.5_tools.git@fairseq pip install git+https://github.com/Dod-o/kosmos2.5_tools.git@infinibatch pip install git+https://github.com/Dod-o/kosmos2.5_tools.git@torchscale pip install git+https://github.com/Dod-o/kosmos2.5_tools.git@transformers
-
Clone Repo and Checkpoint:
git clone https://github.com/microsoft/unilm.git cd unilm/kosmos-2.5/ wget https://huggingface.co/microsoft/kosmos-2.5/resolve/main/ckpt.pt
-
Note: A snapshot of the official model checkpoint at the time this repo was created can also be obtained from my backup here: https://drive.google.com/file/d/17RwlniqMwbLEMj5ELQd9iQ4kor749Z0e/view?usp=sharing
-
Run OCR!
python3 inference.py --do_ocr --image assets/example/in.png -- ckpt ckpt.pt python3 inference.py --do_md --image assets/example/in.png -- ckpt ckpt.pt
-
-