-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[$$ BOUNTY] Add Qwen 1.5 (0.5B) model to TT-Buda Model Demos #20
Comments
@Shubhamsaboo I want to have a crack at this but I don't have any Tenstorrent hardware, can I still do it? Alternatively, do you guys provide dev grants so I can test it on there? |
Update: I am working with @JonathanALevine to get this done for Qwen since he has a grayskull we can test it on |
Hey all. So I do not have a Grayskull but I've been able to get the model to compile with decent results running with Pytorch can use seemingly any tensor as a mask for So the big problem is that Qwen uses the lowest finite value a 32-bit float can represent as its mask filler. The way TVM decomposes So I've solved this by editing the decomposition of
I will admit that The reason I I've also edited the decomposition of
The model fully compiles, and through net2pipe as I've set
I am unsure where the model goes wrong at the end there. I'm confident its a separate issue though. For all I know this might be a bug with devmode and will work just fine on a Grayskull. Good luck to whoever picks this up! |
👀 |
@JushBJJ and I took a look at this and I think this is on the right track. Some comments below on implementation and findings running this on a Grayskull e150:
Takes a while to get an output, and I get this:
During inference I think the grayskull is active as I see this in sensors:
However it is also curious because I see flags like this:
Suggesting all compute is taking place on the CPU. Any comments on this? @Lewis300 @Shubhamsaboo |
Ran this again setting
|
Ah so the log messages you are seeing involving cpu_fallback are just for a few operations near the start and end of the model which cannot execute on the device and must be run on CPU. The vast majority of the heavy lifting is done on the TTDevice (grayskull). However, when you use PYBUDA_DEVMODE=1 it may actually run the TTDevice code on your cpu as well, I’m unsure. It’s best to go without using that if you have the card. If running the exact same code with PYBUDA_DEVMODE=1 is what gave you the better answer there then you might have to tinker with some other env variables, default data format, amp level, etc… I wonder if the Float16_b conversion of the min float32 mask filler is causing problems… |
Removed this: Seems to have fixed the output. This is what I get:
|
Updating PR and adding new patches |
@Lewis300 thanks for that |
@JonathanALevine You were able to get that output without PYBUDA_DEVMODE? |
That's correct!
|
@JushBJJ @JonathanALevine Addressing your slowness concerns... CPU fallback is necessary for performing the embeddings
Hopefully this helps! PS: Look for the |
For me, I've noticed that moving the result from the TT section of the model, and the tail-end CPU fallback section takes some time. Moreover the tail-end CPU fallback is a matmul and a reshape - both of which could be run on the Grayskull. They get put on CPU because the weights of that matmul are the embedding weights. Which, I see how you'd want them on the same device for training purposes. For inference it seems to me that it would be alright to just clone them onto the device. Go to This ends up removing the tail-end CPU fallback entirely for this model and with that, the delay. Maybe adding shared weight operations to fallback can be added to the compiler configuration. I guess that would be for TT to figure out. |
To summarize a bit what was discovered today: (1)
|
@JushBJJ can you link your new PR for adding Qwen 0.5 to this issue. |
PR for Qwen: #37 |
Claimed by @JushBJJ. It will be closed once it's merged into main. Congrats Jush! |
Closing this one as merged to main. Congrats again @JushBJJ. |
Background:
TT-Buda model demos, developed by Tenstorrent, is a growing collection of model demos showcasing the capabilities of AI models running on Tenstorrent hardware. These demonstrations cover a wide range of applications, aiming to provide insights and inspiration for developers and researchers interested in advanced AI implementations.
Bounty Objective:
We are excited to announce a bounty for contributing a new AI model demonstration to the TT-Buda repository. This is an opportunity for AI enthusiasts, researchers, and developers to showcase their skills, contribute to cutting-edge AI research, and earn rewards.
Task Details:
Integrate Qwen-1.5 (0.5B) into the TT-Buda model demonstrations.
Requirements:
Contribution Guidelines:
model_demos
folder following the naming convention:model_yourModelName
.CONTRIBUTING.md
file.Evaluation Criteria:
Rewards:
Contributions will be evaluated by the Tenstorrent team, and the best contribution will be eligible for $500 bounty.
Get Started with Grayskull DevKit
Dive into AI development with the Grayskull DevKit, your gateway to exploring Tenstorrent's hardware. Paired with TT-Buda and TT-Metalium software approaches, it offers a solid foundation for AI experimentation. Secure your kit here.
Connect on Discord
Join our Discord to talk AI, share your journey, and get support from the Tenstorrent community and team. Let's innovate together!
The text was updated successfully, but these errors were encountered: