M2 Ultra 192GB推理DeepSeek R1 1.5B量化版本打平H100x2

tonyunreal 发表于 2025-1-30 18:47

数据来自llama.cpp作者：
https://github.com/ggerganov/llama.cpp/issues/11474

M2 Ultra: 13.88 token/s
H100x2: 11.53 token/s

小妻水亚美 发表于 2025-1-30 18:55

库克狂喜

—— 来自 OPPO PGFM10, Android 14上的 S1Next-鹅版 v2.5.2

niubility 发表于 2025-1-30 19:03

就是显存容量优势吧。

—— 来自鹅球 v3.3.96-alpha

qqks 发表于 2025-1-30 20:54

h100*2

Eg - 137.66 tok / s for prompt processing and 10.69 tok / s for decoding:

The prompt processing reported by llama-bench is only 23t/s which is quite low, but the Metal backend is very poorly optimized for MoE, so maybe it can be improved a bit. Also, currently we have to disable FA because of the unusual shapes of the tensors in the attention which can also be improved.

提示处理速度差六倍

wuuuuuud 发表于 2025-1-30 22:57

处理速度倒还好

我看b站有人用8卡h100跑满血r1，输出24token/s，但是GPU占用才15%左右

宏. 发表于 2025-1-30 23:34

还能直接跑在SSD上

I have tested it also 1.73bit (158GB):

NVIDIA GeForce RTX 3090 + AMD Ryzen 9 5900X + 64GB ram (DDR4 3600 XMP)

llama_perf_sampler_print: sampling time =    33,60 ms / 512 runs ( 0,07 ms per token, 15236,28 tokens per second)
llama_perf_context_print:    load time =122508,11 ms
llama_perf_context_print: prompt eval time = 5295,91 ms / 10 tokens (529,59 ms per token, 1,89 tokens per second)
llama_perf_context_print:    eval time =355534,51 ms / 501 runs (709,65 ms per token, 1,41 tokens per second)
llama_perf_context_print:    total time =360931,55 ms / 511 tokens
It's amazing !!! running DeepSeek-R1-UD-IQ1_M, a 671B with 24GB VRAM.

EDIT: 7 layers offloaded.

https://old.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/

0WHan0 发表于 2025-1-31 00:01

才1.5b，更大的呢

—— 来自鹅球 v3.3.96

魔法师lain 发表于 2025-1-31 00:30

等strix halo

—— 来自鹅球 v3.3.96

jeokeo 发表于 2025-1-31 12:23

宏. 发表于 2025-1-30 23:34
还能直接跑在SSD上

看见了**的3090四个数字

断片集 发表于 2025-1-31 13:35

0WHan0 发表于 2025-1-31 00:01
才1.5b，更大的呢

—— 来自鹅球 v3.3.96

是1.5bit量化版，不是1.5b参数的蒸馏版
不知道今年能不能有512g乃至1t内存版的m4 ultra，希望厨子开窍，真有的话性价比爆杀黄狗了，本地单机部署未量化的百b参数大模型不是梦

mimighost 发表于 2025-1-31 17:51

这量化之后效果还能好么

yanjunle 发表于 2025-2-1 15:48

mimighost 发表于 2025-1-31 17:51
这量化之后效果还能好么

问了两三个问题，感觉就是特别慢的qwq（相同机器30t/s→2t/s），qwq做不对的脑筋急转弯还是做不对。

龙骑士尹志平 发表于 2025-2-1 16:31

3090x8显存够，可以找个gpu云服务器试

页: [1]

Stage1st's Archiver

M2 Ultra 192GB推理DeepSeek R1 1.5B量化版本打平H100x2