Performance on Apple M1 Max #11

certik · 2023-03-04T21:02:44Z

I am using the latest main (409c640) plus the following patch that make both PyTorch and fast_gpt2 run exactly the same model, and text (20 tokens), no Cuda in either:

diff --git a/src/lib.rs b/src/lib.rs
index 367e2ca..9eb9347 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -87,7 +87,7 @@ pub async fn run() -> Result<(), Gpt2Error> {
     #[cfg(not(feature = "dfdx"))]
     let gpt2 = Gpt2::from_tensors(&tensors, num_heads);
 
-    let string = "My name is";
+    let string = "Alan Turing theorized that computers would one day become very powerful, but even he could not imagine";
 
     let encoded = tokenizer.encode(string, false).unwrap();
     println!("Loaded & encoded {:?}", start.elapsed());
@@ -101,7 +101,7 @@ pub async fn run() -> Result<(), Gpt2Error> {
     let mut current_ids = ids.clone();
     #[cfg(feature = "cuda")]
     profiler_start()?;
-    for _i in 0..10 {
+    for _i in 0..20 {
         // println!("-------------");
         let start = std::time::Instant::now();
         let new_id = gpt2.forward(&current_ids, &mut past_key_values);
diff --git a/test.py b/test.py
index 608b4cf..5405733 100644
--- a/test.py
+++ b/test.py
@@ -4,7 +4,7 @@ start = datetime.datetime.now()
 import torch
 
 print(f"Loaded torch {datetime.datetime.now() - start}")
-torch.zeros((2, 2)).cuda()
+torch.zeros((2, 2))
 print(f"Loaded torch (cuda) {datetime.datetime.now() - start}")
 
 
@@ -13,12 +13,12 @@ from transformers import pipeline
 print(f"Loaded transformers {datetime.datetime.now() - start}")
 
 
-pipe = pipeline(task="text-generation", model="gpt2-large", do_sample=False, device=0)
-pipe.model.config.max_length = None
+pipe = pipeline(task="text-generation", model="gpt2", do_sample=False)
+#pipe.model.config.max_length = None
 print(f"Loaded in {datetime.datetime.now() - start}")
 inf_start = datetime.datetime.now()
-new_tokens = 10
-out = pipe("My name is", max_length=3 + new_tokens)
+new_tokens = 20
+out = pipe("Alan Turing theorized that computers would one day become very powerful, but even he could not imagine", max_new_tokens=new_tokens)
 print(f"Tokens: {(datetime.datetime.now() - inf_start)/new_tokens}/tokens")
 print(f"Inference took: {(datetime.datetime.now() - inf_start)}")
 print(out)

Here is what I got for fast_gpt2:

$ cargo run --example run --release    
    Finished release [optimized] target(s) in 0.11s
     Running `target/release/examples/run`
Safetensors 1.86ms
Tokenizer 31.226958ms
Loaded & encoded 461.879041ms
Loop in 156.600333ms
Loop in 80.137333ms
Loop in 80.596916ms
Loop in 81.4075ms
Loop in 79.844708ms
Loop in 81.373583ms
Loop in 82.741458ms
Loop in 107.9175ms
Loop in 83.611083ms
Loop in 80.898125ms
Loop in 84.577875ms
Loop in 84.253166ms
Loop in 84.087083ms
Loop in 85.110708ms
Loop in 85.1405ms
Loop in 84.291708ms
Loop in 84.722125ms
Loop in 84.515916ms
Loop in 84.030916ms
Loop in 84.704333ms
Result Ok("Alan Turing theorized that computers would one day become very powerful, but even he could not imagine how they would be able to do so.\n\n\"I think that the most important thing is")
Total Inference 2.222943541s

And PyTorch (installed from conda-forge):

$ TRANSFORMERS_OFFLINE=1 python test.py
Loaded torch 0:00:00.359938
Loaded torch (cuda) 0:00:00.360043
Loaded transformers 0:00:02.340165
Loaded in 0:00:04.140099
/Users/ondrej/mambaforge/envs/pico/lib/python3.9/site-packages/transformers/generation/utils.py:1186: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Tokens: 0:00:00.040217/tokens
Inference took: 0:00:00.804370
[{'generated_text': 'Alan Turing theorized that computers would one day become very powerful, but even he could not imagine how they would be able to do so.\n\n"I think that the most important thing is'}]
Ran in 0:00:04.944507

So fast_gpt2 runs in 2.2s, and PyTorch in 0.8s.

In order to speedup fast_gpt2, we can use the fast matrix matrix multiply from the Accelerate library, as shown in #10 (comment).

The text was updated successfully, but these errors were encountered:

certik · 2023-03-04T21:05:11Z

With cblas I get:

$ cargo run --example run --release --features cblas
  Downloaded cblas-sys v0.1.4
  Downloaded 1 crate (4.6 KB) in 1.06s
   Compiling fast_gpt2 v0.1.0 (/Users/ondrej/repos/fast_gpt2)
   Compiling cblas-sys v0.1.4
warning: variants `Static` and `Dynamic` are never constructed
 --> build.rs:6:5
  |
5 | enum Library {
  |      ------- variants in this enum
6 |     Static,
  |     ^^^^^^
7 |     Dynamic,
  |     ^^^^^^^
  |
  = note: `#[warn(dead_code)]` on by default

warning: `fast_gpt2` (build script) generated 1 warning
    Finished release [optimized] target(s) in 29.62s
     Running `target/release/examples/run`
Safetensors 855.333µs
Tokenizer 29.599416ms
Loaded & encoded 452.78475ms
Loop in 47.89675ms
Loop in 18.986791ms
Loop in 14.83675ms
Loop in 14.925958ms
Loop in 14.872541ms
Loop in 14.980875ms
Loop in 14.907541ms
Loop in 14.824458ms
Loop in 14.939708ms
Loop in 15.042333ms
Loop in 14.968833ms
Loop in 14.993291ms
Loop in 14.957375ms
Loop in 14.921791ms
Loop in 15.023041ms
Loop in 15.006166ms
Loop in 15.037291ms
Loop in 15.01775ms
Loop in 15.131791ms
Loop in 15.20075ms
Result Ok("Alan Turing theorized that computers would one day become very powerful, but even he could not imagine how they would be able to do so.\n\n\"I think that the most important thing is")
Total Inference 789.519958ms

So fast_gpt2 runs in 0.79s.

certik · 2023-03-04T21:11:51Z

@Narsil is the main speedup coming from caching the past embeddings in

fast_gpt2/src/model/raw.rs

Line 318 in 7ad72e4

pub fn forward(&self, ids: &[u32], past: &mut PastKeyValues) -> usize {

? Or are there other optimizations?

Narsil · 2023-03-04T21:35:31Z

Both Python and this crate are using the past key values.

The main "optimization" is using a single kernel for the gelu (which you can optimize easily by wrapping the new_gelu function in transformers with @torch.jit.script.

However, this crate is still faster, CPU overhead is a real thing.
Also the main focus of this part is the load speed, not really the runtime inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance on Apple M1 Max #11

Performance on Apple M1 Max #11

certik commented Mar 4, 2023

certik commented Mar 4, 2023

certik commented Mar 4, 2023

Narsil commented Mar 4, 2023 •

edited

Loading

Performance on Apple M1 Max #11

Performance on Apple M1 Max #11

Comments

certik commented Mar 4, 2023

certik commented Mar 4, 2023

certik commented Mar 4, 2023

Narsil commented Mar 4, 2023 • edited Loading

Narsil commented Mar 4, 2023 •

edited

Loading