M2 Ultra 192GB推理DeepSeek R1 1.5B量化版本打平H100x2
数据来自llama.cpp作者:https://github.com/ggerganov/llama.cpp/issues/11474
M2 Ultra: 13.88 token/s
H100x2: 11.53 token/s 库克狂喜
—— 来自 OPPO PGFM10, Android 14上的 S1Next-鹅版 v2.5.2 就是显存容量优势吧。
—— 来自 鹅球 v3.3.96-alpha h100*2
Eg - 137.66 tok / s for prompt processing and 10.69 tok / s for decoding:
The prompt processing reported by llama-bench is only 23t/s which is quite low, but the Metal backend is very poorly optimized for MoE, so maybe it can be improved a bit. Also, currently we have to disable FA because of the unusual shapes of the tensors in the attention which can also be improved.
提示处理速度差六倍
处理速度倒还好
我看b站有人用8卡h100跑满血r1,输出24token/s,但是GPU占用才15%左右 还能直接跑在SSD上
I have tested it also 1.73bit (158GB):
NVIDIA GeForce RTX 3090 + AMD Ryzen 9 5900X + 64GB ram (DDR4 3600 XMP)
llama_perf_sampler_print: sampling time = 33,60 ms / 512 runs ( 0,07 ms per token, 15236,28 tokens per second)
llama_perf_context_print: load time =122508,11 ms
llama_perf_context_print: prompt eval time = 5295,91 ms / 10 tokens (529,59 ms per token, 1,89 tokens per second)
llama_perf_context_print: eval time =355534,51 ms / 501 runs (709,65 ms per token, 1,41 tokens per second)
llama_perf_context_print: total time =360931,55 ms / 511 tokens
It's amazing !!! running DeepSeek-R1-UD-IQ1_M, a 671B with 24GB VRAM.
EDIT: 7 layers offloaded.
https://old.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/
才1.5b,更大的呢
—— 来自 鹅球 v3.3.96 等strix halo
—— 来自 鹅球 v3.3.96 宏. 发表于 2025-1-30 23:34
还能直接跑在SSD上
看见了**的3090四个数字 0WHan0 发表于 2025-1-31 00:01
才1.5b,更大的呢
—— 来自 鹅球 v3.3.96
是1.5bit量化版,不是1.5b参数的蒸馏版
不知道今年能不能有512g乃至1t内存版的m4 ultra,希望厨子开窍,真有的话性价比爆杀黄狗了,本地单机部署未量化的百b参数大模型不是梦 这量化之后效果还能好么 mimighost 发表于 2025-1-31 17:51
这量化之后效果还能好么
问了两三个问题,感觉就是特别慢的qwq(相同机器30t/s→2t/s),qwq做不对的脑筋急转弯还是做不对。 3090x8显存够,可以找个gpu云服务器试
页:
[1]