ggml : synchronize threads using barriers #7993

slaren · 2024-06-18T17:09:56Z

Replaces the spinlocks in the CPU backend with barriers, using openmp if possible.

CPU	Model	Threads	Test	t/s master	t/s sl/test-omp-sync	Speedup
i9-13900K	stories260k F32	8	pp512	37153.86	34901.28	0.94
i9-13900K	stories260k F32	8	tg128	2926.66	4913.27	1.68
i9-13900K	stories260k F32	16	pp512	20069.17	21167.47	1.05
i9-13900K	stories260k F32	16	tg128	1450.93	3405.28	2.35
i9-13900K	stories260k F32	24	pp512	24257.18	26325.85	1.09
i9-13900K	stories260k F32	24	tg128	1022.16	2734.91	2.68
i9-13900K	1B Q8_0	8	pp128	340.93	336.43	0.99
i9-13900K	1B Q8_0	8	tg32	70.62	73.01	1.03
i9-13900K	1B Q8_0	16	pp128	437.92	424.76	0.97
i9-13900K	1B Q8_0	16	tg32	58.12	64.29	1.11
i9-13900K	1B Q8_0	24	pp128	521.29	506.86	0.97
i9-13900K	1B Q8_0	24	tg32	53.88	63.47	1.18
i9-13900K	7B Q4_0	8	pp128	49.77	49.81	1.00
i9-13900K	7B Q4_0	8	tg32	21.07	21.41	1.02
i9-13900K	7B Q4_0	16	pp128	63.52	64.56	1.02
i9-13900K	7B Q4_0	16	tg32	20.71	21.93	1.06
i9-13900K	7B Q4_0	24	pp128	73.48	73.90	1.01
i9-13900K	7B Q4_0	24	tg32	17.93	19.24	1.07

Green-Sky · 2024-06-18T17:39:35Z

spinlocks

Is there it is not using mtecies ? (mutexes? o.o)

slaren · 2024-06-18T17:56:16Z

Since these waits are often very short, spinlocks resulted in better performance. OpenMP is also likely spinning for a while before sleeping.

Green-Sky · 2024-06-18T18:06:06Z

Its very hard to get faster then most mutex implementations, since they have staggered strategies. Most will perform a spinlock and then switch to something else...

was this actually mesured?

slaren · 2024-06-18T18:11:04Z

There were several PRs to change the synchronization mechanism in the past, but always resulted in worse performance.

slaren · 2024-06-18T18:55:42Z

The fallback seems to perform ok on the M3 Max.

LLAMA_NO_METAL=1 LLAMA_NO_ACCELERATE=1 LLAMA_NO_LLAMAFILE=1

CPU	Model	Threads	Test	t/s master	t/s sl/test-omp-sync	Speedup
M3 Max	stories260k F32	4	pp512	24415.91	24354.48	1.00
M3 Max	stories260k F32	4	tg128	6134.50	7296.25	1.19
M3 Max	stories260k F32	8	pp512	40683.65	41209.17	1.01
M3 Max	stories260k F32	8	tg128	2503.32	3272.53	1.31
M3 Max	stories260k F32	12	pp512	55364.55	55204.87	1.00
M3 Max	stories260k F32	12	tg128	1977.22	2406.93	1.22
M3 Max	stories260k F32	16	pp512	44830.02	49383.96	1.10
M3 Max	stories260k F32	16	tg128	1058.14	1436.44	1.36
M3 Max	1B Q8_0	4	pp128	201.11	190.80	0.95
M3 Max	1B Q8_0	4	tg64	100.37	101.50	1.01
M3 Max	1B Q8_0	8	pp128	359.73	353.41	0.98
M3 Max	1B Q8_0	8	tg64	95.07	99.61	1.05
M3 Max	1B Q8_0	12	pp128	425.86	455.97	1.07
M3 Max	1B Q8_0	12	tg64	93.11	97.74	1.05
M3 Max	1B Q8_0	16	pp128	408.75	494.34	1.21
M3 Max	1B Q8_0	16	tg64	26.22	59.93	2.29
M3 Max	7B Q4_0	4	pp128	25.15	24.98	0.99
M3 Max	7B Q4_0	4	tg64	22.10	22.08	1.00
M3 Max	7B Q4_0	8	pp128	46.31	46.11	1.00
M3 Max	7B Q4_0	8	tg64	29.73	30.26	1.02
M3 Max	7B Q4_0	12	pp128	58.64	58.82	1.00
M3 Max	7B Q4_0	12	tg64	29.12	29.91	1.03
M3 Max	7B Q4_0	16	pp128	65.84	67.42	1.02
M3 Max	7B Q4_0	16	tg64	6.70	24.69	3.69
M3 Max	7B F32	4	pp128	9.65	9.66	1.00
M3 Max	7B F32	4	tg64	4.75	4.77	1.00
M3 Max	7B F32	8	pp128	17.40	17.87	1.03
M3 Max	7B F32	8	tg64	4.75	4.77	1.00
M3 Max	7B F32	12	pp128	22.86	22.93	1.00
M3 Max	7B F32	12	tg64	4.70	4.74	1.01
M3 Max	7B F32	16	pp128	23.05	24.70	1.07
M3 Max	7B F32	16	tg64	0.77	4.62	5.99

ggml-ci

slaren · 2024-06-18T20:26:31Z

The server thread sanitizer run failed the first time, but I don't understand the error, I cannot reproduce it locally, and the second run passed. It may have been a fluke, but it's something to look at if it continues failing intermittently after this is merged.

slaren · 2024-06-18T23:51:17Z

With MSVC:

CPU	Model	Threads	Test	t/s master	t/s sl/test-omp-sync	Speedup
i9-13900K	stories260k F32	8	pp512	29900.69	29218.33	0.98
i9-13900K	stories260k F32	8	tg128	2168.78	3604.36	1.66
i9-13900K	stories260k F32	16	pp512	14942.76	15333.24	1.03
i9-13900K	stories260k F32	16	tg128	651.22	2253.29	3.46
i9-13900K	stories260k F32	24	pp512	18434.45	19764.31	1.07
i9-13900K	stories260k F32	24	tg128	418.43	1637.96	3.91
i9-13900K	1B Q8_0	8	pp128	298.91	304.23	1.02
i9-13900K	1B Q8_0	8	tg64	60.63	65.54	1.08
i9-13900K	1B Q8_0	16	pp128	254.52	257.34	1.01
i9-13900K	1B Q8_0	16	tg64	43.94	58.29	1.33
i9-13900K	1B Q8_0	24	pp128	355.98	368.96	1.04
i9-13900K	1B Q8_0	24	tg64	41.27	64.36	1.56
i9-13900K	7B Q4_0	8	pp128	37.51	38.40	1.02
i9-13900K	7B Q4_0	8	tg64	15.62	15.96	1.02
i9-13900K	7B Q4_0	16	pp128	32.53	32.61	1.00
i9-13900K	7B Q4_0	16	tg64	11.88	12.99	1.09
i9-13900K	7B Q4_0	24	pp128	47.63	48.02	1.01
i9-13900K	7B Q4_0	24	tg64	12.60	15.55	1.23

slaren · 2024-06-19T00:56:47Z

I was surprised that the performance with MSVC had fallen so much behind, and I found that the reason is llamafile.

MSVC without llamafile:

model	size	params	backend	threads	test	t/s
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	8	pp128	337.27 ± 4.21
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	8	tg64	71.37 ± 0.59
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	16	pp128	417.18 ± 11.49
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	16	tg64	65.59 ± 0.45
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	24	pp128	502.06 ± 17.67
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	24	tg64	67.69 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp128	46.79 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg64	21.08 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	pp128	61.67 ± 0.54
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	tg64	21.56 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	24	pp128	73.49 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	24	tg64	19.43 ± 0.02

ggerganov · 2024-06-19T06:55:08Z

I noticed that the thread sanitizer workflow also failed last week and it reported a data race in this case:

4bfe50f#comments

So maybe it's not necessary related to the changes from this PR

slaren · 2024-06-19T07:00:13Z

That one seems to be caused by using openmp. Thread sanitizer is not directly compatible with openmp, and causes false positives. I changed the workflow to avoid using openmp with thread sanitizer, so that shouldn't happen anymore. It was already done on build.yml in the initial PR, but I missed the server CI.

ggerganov · 2024-06-19T07:25:27Z

Results on M1 Pro and M2 Ultra:

LLAMA_NO_METAL=1 LLAMA_NO_ACCELERATE=1 LLAMA_NO_LLAMAFILE=1 ./scripts/compare-commits.sh master sl/test-omp-sync -m models/tinyllama-1b-chat/ggml-model-f16.gguf -m models/tinyllama-1b-chat/ggml-model-q8_0.gguf -m models/tinyllama-1b-chat/ggml-model-q4_0.gguf -p 128 -n 64 -t 4,8

CPU	Model	[GiB]	Threads	Test	t/s master	t/s sl/test-omp-sync	Speedup
M1 Pro	llama 1B F16	2.05	4	pp128	113.37	109.21	0.96
M1 Pro	llama 1B F16	2.05	4	tg64	44.83	45.88	1.02
M1 Pro	llama 1B F16	2.05	8	pp128	204.39	172.73	0.85
M1 Pro	llama 1B F16	2.05	8	tg64	38.98	49.44	1.27
M1 Pro	llama 1B Q4_0	0.59	4	pp128	128.83	123.33	0.96
M1 Pro	llama 1B Q4_0	0.59	4	tg64	90.91	90.20	0.99
M1 Pro	llama 1B Q4_0	0.59	8	pp128	206.68	203.71	0.99
M1 Pro	llama 1B Q4_0	0.59	8	tg64	117.38	122.94	1.05
M1 Pro	llama 1B Q8_0	1.09	4	pp128	160.17	143.42	0.90
M1 Pro	llama 1B Q8_0	1.09	4	tg64	71.12	71.98	1.01
M1 Pro	llama 1B Q8_0	1.09	8	pp128	257.38	237.44	0.92
M1 Pro	llama 1B Q8_0	1.09	8	tg64	60.25	84.80	1.41

CPU	Model	[GiB]	Threads	Test	t/s master	t/s sl/test-omp-sync	Speedup
M2 Ultra	llama 1B F16	2.05	4	pp128	118.08	121.59	1.03
M2 Ultra	llama 1B F16	2.05	4	tg64	58.73	53.21	0.91
M2 Ultra	llama 1B F16	2.05	8	pp128	238.14	239.27	1.00
M2 Ultra	llama 1B F16	2.05	8	tg64	66.46	69.03	1.04
M2 Ultra	llama 1B F16	2.05	16	pp128	439.46	432.57	0.98
M2 Ultra	llama 1B F16	2.05	16	tg64	72.97	82.59	1.13
M2 Ultra	llama 1B Q4_0	0.59	4	pp128	147.08	147.61	1.00
M2 Ultra	llama 1B Q4_0	0.59	4	tg64	90.74	91.91	1.01
M2 Ultra	llama 1B Q4_0	0.59	8	pp128	285.33	285.09	1.00
M2 Ultra	llama 1B Q4_0	0.59	8	tg64	136.46	145.17	1.06
M2 Ultra	llama 1B Q4_0	0.59	16	pp128	526.63	526.64	1.00
M2 Ultra	llama 1B Q4_0	0.59	16	tg64	136.58	164.73	1.21
M2 Ultra	llama 1B Q8_0	1.09	4	pp128	174.99	175.82	1.00
M2 Ultra	llama 1B Q8_0	1.09	4	tg64	76.69	80.51	1.05
M2 Ultra	llama 1B Q8_0	1.09	8	pp128	336.11	336.49	1.00
M2 Ultra	llama 1B Q8_0	1.09	8	tg64	92.01	98.79	1.07
M2 Ultra	llama 1B Q8_0	1.09	16	pp128	595.29	612.25	1.03
M2 Ultra	llama 1B Q8_0	1.09	16	tg64	108.28	123.60	1.14

M2 Ultra seems to respond better to the changes

CPU	Model	[GiB]	Threads	Test	t/s master	t/s sl/test-omp-sync	Speedup
Ryzen 5950X 16C	1B F16	2.05	4	pp128	113.46	115.32	1.02
Ryzen 5950X 16C	1B F16	2.05	4	tg64	14.61	14.67	1.00
Ryzen 5950X 16C	1B F16	2.05	8	pp128	207.30	207.07	1.00
Ryzen 5950X 16C	1B F16	2.05	8	tg64	14.20	14.31	1.01
Ryzen 5950X 16C	1B F16	2.05	16	pp128	302.62	300.31	0.99
Ryzen 5950X 16C	1B F16	2.05	16	tg64	13.69	13.92	1.02
Ryzen 5950X 16C	1B Q4_0	0.59	4	pp128	99.55	101.64	1.02
Ryzen 5950X 16C	1B Q4_0	0.59	4	tg64	46.12	45.91	1.00
Ryzen 5950X 16C	1B Q4_0	0.59	8	pp128	191.73	193.95	1.01
Ryzen 5950X 16C	1B Q4_0	0.59	8	tg64	45.50	47.11	1.04
Ryzen 5950X 16C	1B Q4_0	0.59	16	pp128	320.21	327.24	1.02
Ryzen 5950X 16C	1B Q4_0	0.59	16	tg64	43.68	46.12	1.06
Ryzen 5950X 16C	1B Q8_0	1.09	4	pp128	122.47	124.77	1.02
Ryzen 5950X 16C	1B Q8_0	1.09	4	tg64	26.48	26.57	1.00
Ryzen 5950X 16C	1B Q8_0	1.09	8	pp128	233.34	234.81	1.01
Ryzen 5950X 16C	1B Q8_0	1.09	8	tg64	25.78	26.25	1.02
Ryzen 5950X 16C	1B Q8_0	1.09	16	pp128	370.48	373.76	1.01
Ryzen 5950X 16C	1B Q8_0	1.09	16	tg64	24.87	25.63	1.03

slaren · 2024-06-19T07:30:53Z

Does it help if the sched_yield is removed from ggml_barrier?

ggerganov · 2024-06-19T07:43:32Z

After removing sched_yield, here are the results:

CPU	Model	[GiB]	Threads	Test	t/s master	t/s sl/test-omp-sync	Speedup
M1 Pro	llama 1B F16	2.05	4	pp128	113.82	113.25	0.99
M1 Pro	llama 1B F16	2.05	4	tg64	44.87	45.90	1.02
M1 Pro	llama 1B F16	2.05	8	pp128	220.78	188.19	0.85
M1 Pro	llama 1B F16	2.05	8	tg64	38.35	39.43	1.03
M1 Pro	llama 1B Q4_0	0.59	4	pp128	136.69	126.87	0.93
M1 Pro	llama 1B Q4_0	0.59	4	tg64	91.48	92.21	1.01
M1 Pro	llama 1B Q4_0	0.59	8	pp128	230.98	209.10	0.91
M1 Pro	llama 1B Q4_0	0.59	8	tg64	85.29	124.21	1.46
M1 Pro	llama 1B Q8_0	1.09	4	pp128	164.59	153.72	0.93
M1 Pro	llama 1B Q8_0	1.09	4	tg64	70.59	72.08	1.02
M1 Pro	llama 1B Q8_0	1.09	8	pp128	289.09	251.00	0.87
M1 Pro	llama 1B Q8_0	1.09	8	tg64	55.05	57.00	1.04

slaren · 2024-06-19T07:50:19Z

The results with the M1 Pro seem to vary quite a bit between runs, the differences between master on the previous one and this one are larger than the difference between master and the PR in some cases, even though it is the same test.

ggerganov · 2024-06-19T07:51:54Z

Hm, I might be doing something wrong in the benchmarks. I now ran llama-bench alone (i.e. without scripts/compare-commits.sh and get the following results on this branch as it is (i.e. without removing sched_yield):

LLAMA_NO_METAL=1 LLAMA_NO_ACCELERATE=1 LLAMA_NO_LLAMAFILE=1 make -j && ./llama-bench -m models/tinyllama-1b-chat/ggml-model-f16.gguf -m models/tinyllama-1b-chat/ggml-model-q8_0.gguf -m models/tinyllama-1b-chat/ggml-model-q4_0.gguf -p 128 -n 64 -t 4,8

model	size	params	backend	threads	test	t/s
llama 1B F16	2.05 GiB	1.10 B	CPU	4	pp128	112.98 ± 1.06
llama 1B F16	2.05 GiB	1.10 B	CPU	4	tg64	45.37 ± 0.57
llama 1B F16	2.05 GiB	1.10 B	CPU	8	pp128	213.35 ± 1.99
llama 1B F16	2.05 GiB	1.10 B	CPU	8	tg64	49.28 ± 0.51
llama 1B Q4_0	606.53 MiB	1.10 B	CPU	4	pp128	131.81 ± 2.81
llama 1B Q4_0	606.53 MiB	1.10 B	CPU	4	tg64	91.77 ± 0.12
llama 1B Q4_0	606.53 MiB	1.10 B	CPU	8	pp128	213.46 ± 1.45
llama 1B Q4_0	606.53 MiB	1.10 B	CPU	8	tg64	125.84 ± 1.08
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	4	pp128	161.83 ± 1.58
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	4	tg64	72.21 ± 1.45
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	8	pp128	263.58 ± 1.95
llama 1B Q8_0	1.09 GiB	1.10 B	CPU	8	tg64	85.07 ± 1.19

build: 4d8a0c2 (3185)

slaren · 2024-06-19T07:54:07Z

I very often have a lot of trouble getting reproducible results on the M3 Max, and I think it is due to thermal throttling. Increasing the number of repetitions with -r can help somewhat.

ggerganov

Indeed, increasing the repetitions to -r 20 makes the results consistent:

CPU	Model	[GiB]	Threads	Test	t/s master	t/s sl/test-omp-sync	Speedup
M1 Pro	llama 1B F16	2.05	4	pp128	113.01	112.20	0.99
M1 Pr	llama 1B F16	2.05	4	tg64	45.01	45.32	1.01
M1 Pr	llama 1B F16	2.05	8	pp128	176.56	180.62	1.02
M1 Pr	llama 1B F16	2.05	8	tg64	38.65	49.41	1.28
M1 Pr	llama 1B Q4_0	0.59	4	pp128	132.44	131.86	1.00
M1 Pr	llama 1B Q4_0	0.59	4	tg64	91.38	92.86	1.02
M1 Pr	llama 1B Q4_0	0.59	8	pp128	210.17	212.74	1.01
M1 Pr	llama 1B Q4_0	0.59	8	tg64	118.79	125.84	1.06
M1 Pr	llama 1B Q8_0	1.09	4	pp128	151.81	152.35	1.00
M1 Pr	llama 1B Q8_0	1.09	4	tg64	71.58	73.54	1.03
M1 Pr	llama 1B Q8_0	1.09	8	pp128	246.67	249.50	1.01
M1 Pr	llama 1B Q8_0	1.09	8	tg64	56.88	83.92	1.48

slaren · 2024-06-19T15:04:28Z

Initially I removed the updates of perf_node_start_cycles and perf_start_time_us for simplicity, which are used with GGML_PERF, and then I forgot to add them again. Is this feature still being used? I can easily add it back, but maybe it would be preferable to remove GGML_PERF entirely instead.

ggerganov · 2024-06-19T15:41:09Z

It's quite obsolete at this point so it's fine to remove it. PR welcome

slaren · 2024-06-19T15:49:42Z

I will remove it as part of a refactor to remove the initialize and finalize tasks in favor of using barriers in the ops.

slaren · 2024-06-20T10:05:54Z

The server thread sanitizer run failed the first time, but I don't understand the error, I cannot reproduce it locally, and the second run passed. It may have been a fluke, but it's something to look at if it continues failing intermittently after this is merged.

Since it happened again in #8018, I am trying to understand what this failure means.
The name of the test is "consistent results with same seed", and the error says Assertion Failed: invalid slot 0 expected[state] != slot[state] = 0 != 1. state here should be the state reported by /slots. It seems to be complaining that the state of the slot is SLOT_STATE_PROCESSING, but it expected it to be SLOT_STATE_IDLE. So it is not complaining that the generation is not what it expected, but that the state of the slot is not what it expected. I wonder if this is race condition in the test framework, thread sanitizer is likely making the server responses much slower, and the test may be looking at the state before the server had a chance to update it.

ggerganov · 2024-06-20T12:06:21Z

Yes, it seems to me that the step_all_slots_status step should have a timeout and ping the server a few times before asserting (similar to wait_for_health_status)

ggml : synchronize using openmp barriers

4886f50

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 18, 2024

add implementation without openmp

e4643ad

spin more

7226483

github-actions bot added the devops improvements to build systems and github actions label Jun 18, 2024

slaren marked this pull request as ready for review June 18, 2024 18:55

server ci : do not use openmp with tsan

e5c0c4e

ggml-ci

slaren force-pushed the sl/test-omp-sync branch from 543c5b0 to e5c0c4e Compare June 18, 2024 19:10

slaren changed the title ~~ggml : synchronize using openmp barriers~~ ggml : synchronize threads using barriers Jun 18, 2024

skip barriers with 1 threads

4d8a0c2

slaren mentioned this pull request Jun 19, 2024

Usage of Open MP Barrier for thread sync to avoid atomic operations #8002

Closed

4 tasks

ggerganov approved these changes Jun 19, 2024

View reviewed changes

slaren merged commit 9c77ec1 into master Jun 19, 2024
68 checks passed

slaren deleted the sl/test-omp-sync branch June 19, 2024 13:04

slaren mentioned this pull request Jun 20, 2024

CUDA: stream-k decomposition for MMQ #8018

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : synchronize threads using barriers #7993

ggml : synchronize threads using barriers #7993

slaren commented Jun 18, 2024 •

edited

Loading

Green-Sky commented Jun 18, 2024

slaren commented Jun 18, 2024

Green-Sky commented Jun 18, 2024 •

edited

Loading

slaren commented Jun 18, 2024

slaren commented Jun 18, 2024

slaren commented Jun 18, 2024

slaren commented Jun 18, 2024

slaren commented Jun 19, 2024 •

edited

Loading

ggerganov commented Jun 19, 2024

slaren commented Jun 19, 2024

ggerganov commented Jun 19, 2024 •

edited

Loading

slaren commented Jun 19, 2024

ggerganov commented Jun 19, 2024

slaren commented Jun 19, 2024 •

edited

Loading

ggerganov commented Jun 19, 2024

slaren commented Jun 19, 2024

ggerganov left a comment

slaren commented Jun 19, 2024

ggerganov commented Jun 19, 2024

slaren commented Jun 19, 2024

slaren commented Jun 20, 2024

ggerganov commented Jun 20, 2024

ggml : synchronize threads using barriers #7993

ggml : synchronize threads using barriers #7993

Conversation

slaren commented Jun 18, 2024 • edited Loading

Green-Sky commented Jun 18, 2024

slaren commented Jun 18, 2024

Green-Sky commented Jun 18, 2024 • edited Loading

slaren commented Jun 18, 2024

slaren commented Jun 18, 2024

slaren commented Jun 18, 2024

slaren commented Jun 18, 2024

slaren commented Jun 19, 2024 • edited Loading

ggerganov commented Jun 19, 2024

slaren commented Jun 19, 2024

ggerganov commented Jun 19, 2024 • edited Loading

slaren commented Jun 19, 2024

ggerganov commented Jun 19, 2024

slaren commented Jun 19, 2024 • edited Loading

ggerganov commented Jun 19, 2024

slaren commented Jun 19, 2024

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Jun 19, 2024

ggerganov commented Jun 19, 2024

slaren commented Jun 19, 2024

slaren commented Jun 20, 2024

ggerganov commented Jun 20, 2024

slaren commented Jun 18, 2024 •

edited

Loading

Green-Sky commented Jun 18, 2024 •

edited

Loading

slaren commented Jun 19, 2024 •

edited

Loading

ggerganov commented Jun 19, 2024 •

edited

Loading

slaren commented Jun 19, 2024 •

edited

Loading