Clear the torch cuda cache after response #301
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If a user is using 2.5 with int4, it can just barely fit into 8gb of vram (an extremely common vram size) without using any shared memory. If you mess with the settings and switch from sampling to beam search mode, with the default settings, it will cause the GPU to use more than 8gb of vram and roll over into the shared pool, which exponentially slows things down.
If the user then tries to go back to using sampling mode, to regain the lost speed, the vram usage will still contain the leftovers from using beam and will still be slowed down.
This PR just purges the cuda cache after response, so if you change between the settings, you don't get stuck with the garbage in vram that will keep you stuck with a reduced speed.
EDIT: Updated to only run the command if device == "cuda"