发个NEOGAF上的一个不同显卡架构间的浮点换算性能对比

内森德雷克 · 发表于 2020-9-20 17:37

TL;DR 1 Ampere TF = 0.72 Turing TF, or 30TF (Ampere) = 21.6TF (Turing)

Reddit Q&A

To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.

A reminder from the Turing whitepaper:
First, the Turing SM adds a new independent integer datapath that can execute instructions concurrently with the floating-point math datapath. In previous generations, executing these instructions would have blocked floating-point instructions from issuing.

So, Turing GPU can execute 64INT32 + 64FP32 ops per clock per SM.
Ampere GPU can either execute 64INT32 + 64FP32 or 128FP32 ops per clock per SM.

Which means if a game executes 0 (zero) INT32 instructions then Ampere = 2xTuring
And if game executes 50/50 INT32 and FP32 then Ampere = Turing exactly.

So how many INT32 are there on average?
According to Nvidia:

we typically see about 36 additional integer pipe instructions for every 100 floating point instructions

Some math: 36 / (100+36) = 26%, i.e. in an average game instruction stream 26% are INT32

So we can now calculate what will happen to both Ampere and Turing when 26% INT32 + 74% FP32 instruction streams are used.
I have written a simple software to do that. But you can calculate an **ytical upper bound easily: 74%/50% = 1.48 or +48%
My software shows a slightly smaller number +44% (and that's because of the edge cases where you cannot distribute the last INT32 ops in a batch equally, as only one pipeline can issue INT32 per each block of 16 cores)
So the theoretical absolute max is +48%, in practice the absolute achievable max is +44%

Thus each 2TF of Ampere have only 1.44TF of Turing performance.

3080 = 30TF (ampere) = 21.6TF (turing) = 2.14x 2080 (10.07TF turing)
Nvidia is even more conservative than that and gives us: 3080 = 2x2080
3070 = 20.4TF (ampere) = 14.7TF (turing) = 1.86x 2070 (7.88TF turing)
Nvidia is massively more conservative here giving us: 3070 = 1.6x2070
Actually if we average the two max numbers that Nvidia gives us (they explicitly say "up to") we get to even lower theoretical max of 1 Ampere TF = 0.65 Turing TF

We do know that Turing had reduced register file access in INT32 (64 vs 256 for FP32) if it's the same (and everything suggests that Ampere is just a Turing facelift) then obviously not all FP32 instruction sequences can run on these pipelines.

Anyway a TF table:

	Ampere TF	Turing TF	Turing TF (NV)
3080	30	21.6	19.5
3070	20.4	14.7	13.3
2080Ti	18.75(me) or 20.7(NV)	13.5	13.5
2080	14 (me) or 15.5 (NV)	10.1	10.1
2070	10.4 (me) or 11.5 (NV)	7.5	7.5

Bonus round: RDNA1 TF
RDNA1 has no INT32 pipeline, all the INT32 instructions are handled in the main stream. Thus it's essentially almost exactly the same as Ampere, but it has no skew in the last instruction thus +48% theoretical max applies here (Ampere +2.3%)

	Ampere TF	Turing TF	Turing TF (NV)
5700XT(RDNA1.0)	10.1	7.2	?

Amusingly enough 5700XT actual performance is pretty similar to 2070 and these adjusted TF numbers show exactly that (10TF vs 10-11TF)

不想看原文的直接看表格对比就行了

		自动登录	找回密码
密码			立即注册

[硬件] 发个NEOGAF上的一个不同显卡架构间的浮点换算性能对比