M2 Ultra 192GB推理DeepSeek R1 1.5B量化版本打平H100x2

tonyunreal · 发表于 2025-1-30 18:47

数据来自llama.cpp作者：
https://github.com/ggerganov/llama.cpp/issues/11474

M2 Ultra: 13.88 token/s
H100x2: 11.53 token/s

小妻水亚美 · 发表于 2025-1-30 18:55

库克狂喜

—— 来自 OPPO PGFM10, Android 14上的 S1Next-鹅版 v2.5.2

niubility · 发表于 2025-1-30 19:03

就是显存容量优势吧。

—— 来自鹅球 v3.3.96-alpha

qqks · 发表于 2025-1-30 20:54

h100*2

Eg - 137.66 tok / s for prompt processing and 10.69 tok / s for decoding:

The prompt processing reported by llama-bench is only 23t/s which is quite low, but the Metal backend is very poorly optimized for MoE, so maybe it can be improved a bit. Also, currently we have to disable FA because of the unusual shapes of the tensors in the attention which can also be improved.

提示处理速度差六倍

wuuuuuud · 发表于 2025-1-30 22:57

处理速度倒还好

我看b站有人用8卡h100跑满血r1，输出24token/s，但是GPU占用才15%左右

宏. · 发表于 2025-1-30 23:34

还能直接跑在SSD上

I have tested it also 1.73bit (158GB):
NVIDIA GeForce RTX 3090 + AMD Ryzen 9 5900X + 64GB ram (DDR4 3600 XMP)
llama_perf_sampler_print: sampling time = 33,60 ms / 512 runs ( 0,07 ms per token, 15236,28 tokens per second)
llama_perf_context_print: load time = 122508,11 ms
llama_perf_context_print: prompt eval time = 5295,91 ms / 10 tokens ( 529,59 ms per token, 1,89 tokens per second)
llama_perf_context_print: eval time = 355534,51 ms / 501 runs ( 709,65 ms per token, 1,41 tokens per second)
llama_perf_context_print: total time = 360931,55 ms / 511 tokens
It's amazing !!! running DeepSeek-R1-UD-IQ1_M, a 671B with 24GB VRAM.
EDIT: 7 layers offloaded.

复制代码

https://old.reddit.com/r/LocalLL ... xxs_200gb_from_ssd/

0WHan0 · 发表于 2025-1-31 00:01

才1.5b，更大的呢

—— 来自鹅球 v3.3.96

魔法师lain · 发表于 2025-1-31 00:30

等strix halo

—— 来自鹅球 v3.3.96

jeokeo · 发表于 2025-1-31 12:23

宏. 发表于 2025-1-30 23:34
还能直接跑在SSD上

看见了**的3090四个数字

断片集 · 发表于 2025-1-31 13:35

0WHan0 发表于 2025-1-31 00:01
才1.5b，更大的呢

—— 来自鹅球 v3.3.96

是1.5bit量化版，不是1.5b参数的蒸馏版
不知道今年能不能有512g乃至1t内存版的m4 ultra，希望厨子开窍，真有的话性价比爆杀黄狗了，本地单机部署未量化的百b参数大模型不是梦

mimighost · 发表于 2025-1-31 17:51

这量化之后效果还能好么

yanjunle · 发表于 2025-2-1 15:48

mimighost 发表于 2025-1-31 17:51
这量化之后效果还能好么

问了两三个问题，感觉就是特别慢的qwq（相同机器30t/s→2t/s），qwq做不对的脑筋急转弯还是做不对。

龙骑士尹志平 · 发表于 2025-2-1 16:31

3090x8显存够，可以找个gpu云服务器试

		自动登录	找回密码
密码			立即注册

[硬件] M2 Ultra 192GB推理DeepSeek R1 1.5B量化版本打平H100x2