-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TinyModel addition #31804
Comments
It would be quite nice to add this using the new model adder that @ArthurZucker has contributed; @ArthurZucker, when back from leave (next week), do you mind sharing with @noanabeshima how to get this done the best way? |
Hey! sorry for the delay! Yep, my recommendation is to use the #30868 tool to isolate the changes as much as possible 🤗 |
hi @ArthurZucker I am new to open-source contribution and I would like to contribute to add this new model to transformer library can you please provide to any reference or previous PRs that were similar to this |
#29622 or #31659 are quite similar, there is also https://huggingface.co/docs/transformers/en/add_new_model which should help! |
FYI @LysandreJik @ArthurZucker Hi, I have been trying to work on this issue and have created this model architecture from the source code: noanabeshima/tinymodel Before proceeding further, I would like to clarify if this is the way forward and we want to add this model
See the implementation below class Attention(nn.Module):
def __init__(self, config):
super().__init__()
self.hidden_size = config.hidden_size
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = config.attention_head_size
self.all_head_size = config.num_attention_heads * config.attention_head_size
self.Q = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
self.K = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
self.V = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
self.O = nn.Linear(self.all_head_size, self.hidden_size, bias=False) # TODO: Remove bias
def forward(self, hidden_states):
# hidden_states.shape (batch_size, seq_len, hidden_size)
q, k, v = self.Q(hidden_states), self.K(hidden_states), self.V(hidden_states)
# q.shape (batch_size, seq_len, all_head_size)
q = q.reshape(*q.shape[:-1], self.num_attention_heads, self.attention_head_size)
k = k.reshape(*k.shape[:-1], self.num_attention_heads, self.attention_head_size)
v = v.reshape(*v.shape[:-1], self.num_attention_heads, self.attention_head_size)
# q.shape (batch_size, seq_len, num_attention_heads, num_attention_heads)
q = q.transpose(-2, -3)
k = k.transpose(-2, -3)
v = v.transpose(-2, -3)
# q.shape (batch_size, num_attention_heads, seq_len, num_attention_heads)
head_writeouts = F.scaled_dot_product_attention(
q, k, v,
is_causal=True,
)
# head_writeouts.shape (batch_size, num_attention_heads, seq_len, num_attention_heads)
head_writeouts = head_writeouts.transpose(-2, -3)
# head_writeouts.shape (batch_size, seq_len, num_attention_heads, num_attention_heads)
head_writeouts = head_writeouts.reshape(*head_writeouts.shape[:-2], self.all_head_size)
# head_writeouts.shape (batch_size, seq_len, all_head_size)
attn_out = self.O(head_writeouts)
# attn_out.shape (batch_size, seq_len, hidden_size)
return attn_out
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.mlp = nn.Sequential(
nn.Linear(self.hidden_size, self.intermediate_size),
nn.ReLU(),
nn.Linear(self.intermediate_size, self.hidden_size),
)
def forward(self, hidden_states):
# hidden_states.shape (batch_size, seq_len, hidden_size)
return self.mlp(hidden_states)
class TransformerBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.attention = Attention(config)
self.mlp = MLP(config)
def forward(self, hidden_states):
attention_out = self.attention(hidden_states)
mlp_out = self.mlp(attention_out)
return mlp_out |
Hey!
One thing that is however fairly important is to follow the Having a minimal implementation would be nice! I can help you by reviewing the PR for sure! You need to take a little bit of inspiration from If your model is super small for example, it makes sense not to have past key values. 🤗 hope we can merge this and have a great example of a TinyModel to set good standards! 🤗 |
Model description
https://github.com/noanabeshima/tiny_model
It's a small language model trained on TinyStories for interpretability with sparse autoencoders and transcoders added. It has no layernorms (this helps with interpretability) which makes it not fit with any existing model architecture in the transformers library. Its architecture is essentially GPT-2's except that it doesn't have layernorms and it has untied embed/deembed.
Open source status
Provide useful links for the implementation
The implementation is here:
https://github.com/noanabeshima/tiny_model/blob/main/tiny_model/lm.py
The weights are here:
https://huggingface.co/noanabeshima/tiny_model/blob/main/tiny_model.pt
The default config corresponding to the weights is:
I am the author.
The text was updated successfully, but these errors were encountered: