Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TinyModel addition #31804

Open
2 tasks done
noanabeshima opened this issue Jul 5, 2024 · 6 comments
Open
2 tasks done

TinyModel addition #31804

noanabeshima opened this issue Jul 5, 2024 · 6 comments

Comments

@noanabeshima
Copy link

noanabeshima commented Jul 5, 2024

Model description

https://github.com/noanabeshima/tiny_model

It's a small language model trained on TinyStories for interpretability with sparse autoencoders and transcoders added. It has no layernorms (this helps with interpretability) which makes it not fit with any existing model architecture in the transformers library. Its architecture is essentially GPT-2's except that it doesn't have layernorms and it has untied embed/deembed.

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

The implementation is here:
https://github.com/noanabeshima/tiny_model/blob/main/tiny_model/lm.py

The weights are here:
https://huggingface.co/noanabeshima/tiny_model/blob/main/tiny_model.pt

The default config corresponding to the weights is:

    d_model=768,
    n_layers=4,
    n_heads=16,
    max_seq_len=256,
    vocab_size=10_000

I am the author.

@LysandreJik
Copy link
Member

It would be quite nice to add this using the new model adder that @ArthurZucker has contributed; @ArthurZucker, when back from leave (next week), do you mind sharing with @noanabeshima how to get this done the best way?

@ArthurZucker
Copy link
Collaborator

Hey! sorry for the delay! Yep, my recommendation is to use the #30868 tool to isolate the changes as much as possible 🤗

@vishwas-sharma2480
Copy link

hi @ArthurZucker I am new to open-source contribution and I would like to contribute to add this new model to transformer library can you please provide to any reference or previous PRs that were similar to this

@ArthurZucker
Copy link
Collaborator

#29622 or #31659 are quite similar, there is also https://huggingface.co/docs/transformers/en/add_new_model which should help!

@geetu040
Copy link

geetu040 commented Oct 4, 2024

FYI @LysandreJik @ArthurZucker

Hi, I have been trying to work on this issue and have created this model architecture from the source code: noanabeshima/tinymodel

Before proceeding further, I would like to clarify if this is the way forward and we want to add this model

  • Because the model implementation looks really simple and straighforward. Not sure if there are models as simple as this in the library.
  • Although the dataset this model has been experimented on comes from TinyStories, yet there is no literature backing for the model architechture specifically. @noanabeshima can confirm this.

See the implementation below

class Attention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = config.attention_head_size
        self.all_head_size = config.num_attention_heads * config.attention_head_size

        self.Q = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
        self.K = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
        self.V = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
        self.O = nn.Linear(self.all_head_size, self.hidden_size, bias=False) # TODO: Remove bias


    def forward(self, hidden_states):
        # hidden_states.shape (batch_size, seq_len, hidden_size)

        q, k, v = self.Q(hidden_states), self.K(hidden_states), self.V(hidden_states)
        # q.shape (batch_size, seq_len, all_head_size)

        q = q.reshape(*q.shape[:-1], self.num_attention_heads, self.attention_head_size)
        k = k.reshape(*k.shape[:-1], self.num_attention_heads, self.attention_head_size)
        v = v.reshape(*v.shape[:-1], self.num_attention_heads, self.attention_head_size)
        # q.shape (batch_size, seq_len, num_attention_heads, num_attention_heads)

        q = q.transpose(-2, -3)
        k = k.transpose(-2, -3)
        v = v.transpose(-2, -3)
        # q.shape (batch_size, num_attention_heads, seq_len, num_attention_heads)

        head_writeouts = F.scaled_dot_product_attention(
            q, k, v,
            is_causal=True,
        )
        # head_writeouts.shape (batch_size, num_attention_heads, seq_len, num_attention_heads)

        head_writeouts = head_writeouts.transpose(-2, -3)
        # head_writeouts.shape (batch_size, seq_len, num_attention_heads, num_attention_heads)

        head_writeouts = head_writeouts.reshape(*head_writeouts.shape[:-2], self.all_head_size)
        # head_writeouts.shape (batch_size, seq_len, all_head_size)

        attn_out = self.O(head_writeouts)
        # attn_out.shape (batch_size, seq_len, hidden_size)

        return attn_out


class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size

        self.mlp = nn.Sequential(
            nn.Linear(self.hidden_size, self.intermediate_size),
            nn.ReLU(),
            nn.Linear(self.intermediate_size, self.hidden_size),
        )

    def forward(self, hidden_states):
        # hidden_states.shape (batch_size, seq_len, hidden_size)
        return self.mlp(hidden_states)


class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.attention = Attention(config)
        self.mlp = MLP(config)

    def forward(self, hidden_states):
        attention_out = self.attention(hidden_states)
        mlp_out = self.mlp(attention_out)
        return mlp_out

@ArthurZucker
Copy link
Collaborator

Hey!

  1. Indeed this is the "simplest" model there is! Would be a nice addition IMO!
  2. Not really a problem we can link to a github repo or simply a blog post!

One thing that is however fairly important is to follow the transformers API.
Most important is from_pretrained.

Having a minimal implementation would be nice! I can help you by reviewing the PR for sure!

You need to take a little bit of inspiration from modeling_llama.py but removing the extras.

If your model is super small for example, it makes sense not to have past key values.
I would ask also if the model supports padding or not!
It's comes down to skimming to the essential, which I am very much pro!

🤗 hope we can merge this and have a great example of a TinyModel to set good standards! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants