Skip to content

Use speech input to generate images and convert to GCode

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation


This project was developed during our master studies at the Kempten University of Applied Sciences in cooperation with the Institute for Data-optimised Manufacturing (IDF).

Project members: Linus Göhl, Quirin Sandt, Benjamin Schober


The goal of the project was to create a pipeline that converts language to GCode (e.g. for a CNC milling machine). For this the different components are necessary:

Complete Pipeline

Short information on how this pipeline works:

  • Audio is transcribed to text returning the prompt
  • Prompt is used to generate images using Stable Diffusion
  • Image is rated by its quality and using object detection
  • Selected image is preprocessed and converted to GCode

Below is more detailed information about the specific pipeline parts, models and technologies used.


Most of the pipeline components are deployed within a Docker container running on a GPU cluster. The pipelines are accessed through a REST API.

Text processing

Text processing pipeline

Models and technologies used:

Model/Technology Description Link
openai/whisper-large-v2 Speech recognition model (ASR) OpenAI GitHub, HuggingFace Model, [Paper]
Helsinki-NLP/opus-mt-de-en Translation model Helsinki-NLP GitHub, HuggingFace Model
NLTK Natural Language Toolkit. Used for keyword/noun extraction NLTK GitHub, NLTK Website

Since the pipeline is accessed through a REST API, all the functional parts are implemented in the class TextPipeline. When the pipeline is deployed, one instance of the class is created and the models are loaded into VRAM. Since the pipeline consists of multiple models and parts, the following endpoints and functions are available:

Endpoint Description
/api/transcribe Transcribes the audio file to text (executes the transcribe, translate and extraact_nouns function).
/api/translate Translates the text to English (executes the translate and extract_nouns function)

Image creation and rating

Model/Technology Description Link
stabilityai/stable-diffusion-2-1-base Image generation model HuggingFace Model, [Paper]
LAION-Aesthetics_Predictor V1 Image rating model GitHub, [Paper]
Grounding DINO Object detection model GitHub

Rating pipeline

Image preprocessing and GCode generation

Note: This pipeline component is not deployed within a Docker container, it is running on the local machine.