Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit realtime transcription time #117

Open
tictproject opened this issue Sep 22, 2024 · 34 comments
Open

Limit realtime transcription time #117

tictproject opened this issue Sep 22, 2024 · 34 comments

Comments

@tictproject
Copy link

Currently if real time transcription continues for a long time then sentence appear to be a huge so it's processing takes a lot of time even with using Cuda. Is that possible to set something like maximum_audio_duration etc to process smaller chunks of text instead of it all?

@KoljaB
Copy link
Owner

KoljaB commented Sep 23, 2024

I can't introduce a fixed max duration because that could cut into words and result in wrong/unprecise transcriptions. Please set post_speech_silence_duration parameter to lower values like 0.1, 0.2 or 0.3 to make it detect sentences or sentence fragments faster.
I think if transcription takes a long time (like > 1 second) probably the CUDA/cuDNN/torch combination isn't working well together with faster_whisper. If everything works well and the GPU isn't too outdated you should end up below 1 second with the large-v2 model (at least on my old GTX 2080 it's far below 1 sec) in most cases.

@tictproject
Copy link
Author

Realtime transcription is working pretty fast but i have issues with time for fullSentence recognition. python -c "import torch; print(torch.cuda.is_available())" return true for me, Cuda version is 12.4, maybe i'm missing something and needs to check something else? Thanks

@KoljaB
Copy link
Owner

KoljaB commented Sep 23, 2024

It would be helpful to see your AudioToTextRecorder class constructor parameters. You can change the model parameter to a smaller model like "medium" (maybe try a distil model from systran) and / or reduce the beam size to make it faster. You could also try to raise realtime_processing_pause parameter or try use_main_model_for_realtime to only work with a single model.

@tictproject
Copy link
Author

'spinner': False,
'use_microphone': False,
'model': 'large-v2',
'language': 'en',
'silero_sensitivity': 0.4,
'webrtc_sensitivity': 2,
'post_speech_silence_duration': 0.6,
'min_length_of_recording': 0,
'min_gap_between_recordings': 0,
'enable_realtime_transcription': True,
'realtime_processing_pause': 0,
'realtime_model_type': 'large-v2',
'on_realtime_transcription_stabilized': text_detected,

    i've took it from browser_client example
    
    use_main_model_for_realtime will result the same no? Because i have large-v2 for realtime and as main model also

@KoljaB
Copy link
Owner

KoljaB commented Sep 23, 2024

large-v2 as realtime_model_type could be the problem, this is a big model for realtime transcriptions. With realtime_processing_pause = 0 the realtime model large-v2 nonstop transcribes and could consume too much GPU resources.
Try to change realtime_model_type to tiny.en or small.en or medium.en. Keep 'model': 'large-v2'. Set the realtime_processing_pause to somewhere between 0,2 and 0,5.

use_main_model_for_realtime will result the same

No, it would only use one single model then. Now it loads 2x large-v2 and does the processing in parallel, it's a difference. I recommend try above settings ('realtime_model_type': 'tiny.en', 'realtime_processing_pause': 0.5) without 'use_main_model_for_realtime': True first.

@tictproject
Copy link
Author

i have tried that and looks like nothing changed.

for example here is the sentence "Transcribe one of those work meetings that you missed, or copy-paste the link to a YouTube documentary you're curious about. See what comes out!"

I'm receiving fullSentence event only after almost 3 seconds which is too much. Are you sure that it's possible to receive fullSentence transcription with large-v2 in a second?

@KoljaB
Copy link
Owner

KoljaB commented Sep 25, 2024

I am absolutely sure, I'm under 300 milliseconds here for a large-v2 transcription. I'd recommend testing the faster_whisper library directly to find the issue, I feel this might be beyond the scope of RealtimeSTT.

@tictproject
Copy link
Author

tictproject commented Sep 25, 2024

Can you suggest on how to debug this please? I'm not a python developer and this application is running on GPU instances because my mac doesn't have NVIDIA so it's pretty tough for me to test it, locally it will be slow. I'm using browser client example

Also large-v2 realtime transcription works ok, fullSentence is requiring 1-2 seconds after it to finalize the response

P.S. Also can you share your system preferences? Maybe i'll try to set up in cloud the same one

@KoljaB
Copy link
Owner

KoljaB commented Sep 25, 2024

That realtime transcription with large-v2 works indicates that the general transcription time seems to be fast enough. Hard to tell where the problem is in this setup. Please change the server.py and add:

import logging
logging.basicConfig(level=logging.DEBUG)

as first lines.

Also please add

'level' : logging.DEBUG,

to the recorder config. This should gives you extended logging on the server for more insights.

You might also want to hook into on_recording_start, on_recording_stop and on_transcription_start callbacks and log your own timestamps, so you can see where the time is spent.

@tictproject
Copy link
Author

2024-09-25.20-25-08.mov

here is example of how it's working and it's much slower than in your example. I have 2 GPU for 15GB each

@KoljaB
Copy link
Owner

KoljaB commented Sep 25, 2024

Ok that looks like says cuda.is_available(): True but still does not use GPU.

@tictproject
Copy link
Author

Are you sure that it's not using a GPU? I think that for large-v2 CPU would be much slower, no?

@KoljaB
Copy link
Owner

KoljaB commented Sep 25, 2024

I would say if it used the GPU it should be much faster. Verify if GPU load goes up while transcripting and VRAM usage goes up while loading the model. It might be that you need to use rocm for pytorch install (--index-url https://download.pytorch.org/whl/rocm5.6 and +rocm instead of +cu121, but I have no mac, so I don't know for sure...)

@tictproject
Copy link
Author

tictproject commented Sep 25, 2024

Here is debug console
RealTimeSTT: torio._extension.utils - DEBUG - Loading FFmpeg6
DEBUG:torio._extension.utils:Failed to load FFmpeg6 extension.
Traceback (most recent call last):
File "/home/illiashkurenko/venv/lib/python3.11/site-packages/torio/_extension/utils.py", line 116, in _find_ffmpeg_extension
ext = _find_versionsed_ffmpeg_extension(ffmpeg_ver)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/illiashkurenko/venv/lib/python3.11/site-packages/torio/_extension/utils.py", line 108, in _find_versionsed_ffmpeg_extension
_load_lib(lib)
File "/home/illiashkurenko/venv/lib/python3.11/site-packages/torio/_extension/utils.py", line 94, in _load_lib
torch.ops.load_library(path)
File "/home/illiashkurenko/venv/lib/python3.11/site-packages/torch/_ops.py", line 1032, in load_library
ctypes.CDLL(path)
File "/usr/lib/python3.11/ctypes/init.py", line 376, in init
self._handle = _dlopen(self._name, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libavutil.so.58: cannot open shared object file: No such file or directory
RealTimeSTT: torio._extension.utils - DEBUG - Failed to load FFmpeg6 extension.
Traceback (most recent call last):
File "/home/illiashkurenko/venv/lib/python3.11/site-packages/torio/_extension/utils.py", line 116, in _find_ffmpeg_extension
ext = _find_versionsed_ffmpeg_extension(ffmpeg_ver)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/illiashkurenko/venv/lib/python3.11/site-packages/torio/_extension/utils.py", line 108, in _find_versionsed_ffmpeg_extension
_load_lib(lib)
File "/home/illiashkurenko/venv/lib/python3.11/site-packages/torio/_extension/utils.py", line 94, in _load_lib
torch.ops.load_library(path)
File "/home/illiashkurenko/venv/lib/python3.11/site-packages/torch/_ops.py", line 1032, in load_library
ctypes.CDLL(path)
File "/usr/lib/python3.11/ctypes/init.py", line 376, in init
self._handle = _dlopen(self._name, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libavutil.so.58: cannot open shared object file: No such file or directory
DEBUG:torio._extension.utils:Loading FFmpeg5
RealTimeSTT: torio._extension.utils - DEBUG - Loading FFmpeg5
DEBUG:torio._extension.utils:Successfully loaded FFmpeg5
RealTimeSTT: torio._extension.utils - DEBUG - Successfully loaded FFmpeg5
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/Systran/faster-whisper-large-v2/revision/main HTTP/11" 200 1857
RealTimeSTT: urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /api/models/Systran/faster-whisper-large-v2/revision/main HTTP/11" 200 1857
DEBUG:root:Silero VAD voice activity detection engine initialized successfully
RealTimeSTT: root - DEBUG - Silero VAD voice activity detection engine initialized successfully
DEBUG:root:Starting recording worker
DEBUG:root:Starting realtime worker
DEBUG:root:Waiting for main transcription model to start
RealTimeSTT: root - DEBUG - Starting recording worker
RealTimeSTT: root - DEBUG - Starting realtime worker
RealTimeSTT: root - DEBUG - Waiting for main transcription model to start
DEBUG:root:Faster_whisper main speech to text transcription model initialized successfully
DEBUG:root:Main transcription model ready
RealTimeSTT: root - DEBUG - Faster_whisper main speech to text transcription model initialized successfully
RealTimeSTT: root - DEBUG - Main transcription model ready
DEBUG:root:RealtimeSTT initialization completed successfully
RealTimeSTT: root - DEBUG - RealtimeSTT initialization completed successfully
RealtimeSTT initialized
Server started. Press Ctrl+C to stop the server.
INFO:websockets.server:server listening on 0.0.0.0:8001
RealTimeSTT: websockets.server - INFO - server listening on 0.0.0.0:8001

looks like some issue with ffmpeg but it's installed on machine. Can it affect on performance?

Also nvidia-smi shows usage of gpu after launch server.py
Снимок экрана 2024-09-25 в 20 44 48

By the way i'm testing on Debian remotely

@tictproject
Copy link
Author

2024-09-25.20-53-24.mov

Here is live usage. Volatile CPU-Util is going to 90-100% per transcription so is it normal?

@KoljaB
Copy link
Owner

KoljaB commented Sep 25, 2024

The ffmpeg messages are totally normal, this does not affect performance. Most probably Tesla T4 is not transcribing faster. Here is another user with this GPU who says:
"I'm using the large-v2 model and the usual runtime for a 10-15s clip is around 2-5s."

@KoljaB
Copy link
Owner

KoljaB commented Sep 25, 2024

Try using a distil model as I already suggested. Just by changing "large-v2" to "distil-large-v2" you should be able to get a 2x faster transcription and if you can live with the transcription quality of "distil-medium.en" you might even get 4x or 6x speed. Maybe finetune beam size a bit (maybe if transcription quality of distil-medium.en is not good enough increasing beam size a bit can help while not adding too much latency)

@tictproject
Copy link
Author

Thanks for advice, distil-large-v2 is not too accurate as large-v2 so i left it. I have changed mine T4 on NVIDIA V100 and it's faster on 1-1.5 seconds, now that sentence is produced in 1.5-1.9 seconds. Seems that to receive your results it needs to pay a lot for super powerful machine.

If i will add additional GPU and pass in devices array like [0,1] will i receive better results or it does not depend?

@KoljaB
Copy link
Owner

KoljaB commented Sep 25, 2024

Thank you for providing the videos, that helped a lot getting an idea of what was going on. I was using a RTX 2080 for a long time. The video on the front page of this repo is showing the performance with this card. So you don't need to put in the money for a 4090 or so. But I can't tell which card works best for your money regarding faster whisper. Maybe you might want to try some other CTranslate2 models like distil-large-v3 or something else.

Additional GPUs passed in like [0,1] will not speed up a single transcription. This is only used to parallelize multiple transcriptions.

@tictproject
Copy link
Author

tictproject commented Sep 26, 2024

i'm stack with this and got frustrated, i've created an instances on vastai with RTX 3090 and RTX 4090 and the same result, 2-3 seconds for short sentences, wtf...

@tictproject
Copy link
Author

also i think that last commit 9 hours ago broke something and now client is not initializing fully. Model download window keeps on 100% and there is no message
"RealtimeSTT initialized"

and also no text is printed in console
Снимок экрана 2024-09-26 в 23 50 43

Снимок экрана 2024-09-26 в 23 54 28

I've even tried an A100 instance which is the most powerfull and there is no change in speed. Can it be because your app is running locally when mine is passing chunks of data via sockets on server which is hosted in cloud?

@KoljaB
Copy link
Owner

KoljaB commented Sep 26, 2024

C:\Dev\Audio\RealtimeSTT\RealtimeSTT\tests>python simple_test.py
Say something...
Model large-v2 completed transcription in 0.32 seconds
Hey there, this is a test.

System: RTX 4090, Windows 11, Python 3.10.9, CUDA 12.1, cuDNN v8.9.7

@tictproject
Copy link
Author

i have no idea why it's like that. I've tried a lot of instances, cloud providers and always result is the same. Only difference is that you are running it locally while me in cloud by passing data via websockets.

Also can you check what's happened after your last commit?

@KoljaB
Copy link
Owner

KoljaB commented Sep 26, 2024

also i think that last commit 9 hours ago broke something and now client is not initializing fully. Model download window keeps on 100% and there is no message

Will look into that but not today anymore, it's late here. Just use "pip install RealtimeSTT==0.2.41" if the current version makes problems for now please. I recommend doing some tests with faster_whisper itself. I think you need to start to measure and find out how your transcription times really are with large-v2 instead.

@KoljaB
Copy link
Owner

KoljaB commented Sep 26, 2024

Also can you check what's happened after your last commit?

Tons of stuff under the hood basically.

Can you post a full debug log with

import logging
logging.basicConfig(level=logging.DEBUG)

as first lines and AudioToTextRecorder called with level=logging.DEBUG? Might see what goes wrong then.

@KoljaB
Copy link
Owner

KoljaB commented Sep 27, 2024

Tested the new version, I can start the server.py here and also see the "RealtimeSTT initialized" message:

(venv) C:\Dev\Audio\RealtimeSTT\RealtimeSTT\example_browserclient>python server.py
Starting server, please wait...
Initializing RealtimeSTT...
RealtimeSTT initialized
Server started. Press Ctrl+C to stop the server.
Client connected
Sentence: Hey there, this is a little test.

Please start the server with full log and paste it here, so hopefully we can see better what's going wrong then.

@tictproject
Copy link
Author

tictproject commented Sep 27, 2024

the new one version on linux doesn't work as expected, used 0.2.41 as you suggested and it's working fine. Will provide a logs a little bit later
Also is the time with websocket server for you is the same as with direct microphone?

@tictproject
Copy link
Author

2024-09-27.13-01-13.mov

here are the logs with new version. Also don't see a warnings regarding ffmpeg anymore

2024-09-27.12-50-28.mov

above is video of logs without websockets. So is it looks like faster whisper works slower then it should?

@KoljaB
Copy link
Owner

KoljaB commented Sep 27, 2024

Hm, which one is the video where it does not work? Because in both videos it looks like as if it works in the logging.

Here a test for browserclient on my system for comparison (sorry my voice is very loud), so you can see how fast this should be in theory:

Browserclient.Test.mp4

above is video of logs without websockets. So is it looks like faster whisper works slower then it should?

I don't think it's a faster whisper issue but this is so hard to tell without knowing everything about your hardware and environment, and even then it might be difficult. You need to measure the transcription time to be sure. Maybe the browser client sends huge chunks. Like records over a longer time and then sends the chunk, idk. I'm no web developer.

@tictproject
Copy link
Author

first video. There is no RealtimeSTT Initialized in console and also no printing of sentences, check the end of video. The client receives the data but anyway the behavior of server is weird after new push.

Regarding speed your transcription also is more than 1 second for short sentences. Mine is ~1.8 and on second video in logs you can see that faster whisper transcription is taking 1.5 seconds for 'Hello' or 'How are you' which is far from ideal unfortunately

@KoljaB
Copy link
Owner

KoljaB commented Sep 27, 2024

first video. There is no RealtimeSTT Initialized in console and also no printing of sentences,

Server log indicates that here is everything okay. Maybe the stdout routing to get log messages transferred from the spawned processes back to the main process somehow fails on your system. Hard to test here, I only have a Windows system. But it feels like it works in general but don't print the results now anymore.

@KoljaB
Copy link
Owner

KoljaB commented Sep 27, 2024

I don't really see a reason why this would fail but maybe you can try uncommenting the stdout thread start in the AudioToTextRecorder to see if that helps:

        self.stdout_thread = threading.Thread(target=self._read_stdout)
        self.stdout_thread.daemon = True
        self.stdout_thread.start()

@KoljaB
Copy link
Owner

KoljaB commented Sep 27, 2024

Please also remove or uncomment this in _transcription_worker, I feel this could maybe cause the mess up of your main prints:
builtins['print'] = custom_print

@bilalshafim
Copy link

bilalshafim commented Oct 14, 2024

Currently if real time transcription continues for a long time then sentence appear to be a huge so it's processing takes a lot of time even with using Cuda. Is that possible to set something like maximum_audio_duration etc to process smaller chunks of text instead of it all?

I would agree with the original comment that processing all of it introduces delay as the session time increases.
As you can see in the logs below, at 5 minute mark, it process all the chunks again.

Note: The logs do not show the complete section. It actually started with 00:00.000. I cannot retrieve the complete section or rerun another test at this time.

RealTimeSTT: faster_whisper - DEBUG - Processing segment at 02:39.200
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 03:08.000
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 03:32.560
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 04:02.160
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 04:26.800
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 04:53.840
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 05:10.480

After a minute or two of new audio, the previous audio is irrelevant and does not need to be processed again each time. Even though the transcription is quite fast on my machine, this would cause huge delay on long running sessions.

Main model: medium
Realtime model: small

Slightly related issue (might open new issue for this): If I use main model for realtime transcription (small, medium or even tiny), it does not transcribe. When not using main model for realtime, text received is always 'realtime', never 'fullSentence'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants