Hacker Read

dragontamer | karma 30240 | avg karma 2.84 · 2019-08-26 14:24:07

> Are there workloads where the AMD suffers due to its l3 design?

Databases, particularly any database which benefits from more than 16MB of L3 cache.

> On my 3900x L3 latency is 10.4ns when local.

And L3 latency is >100ns when off-die. Remember, to keep memory cohesive, only one L3 cache can "own" data. You gotta wait for the "other core" to give up the data before you can load it into YOUR L3 cache and start writing to it.

Its clear that AMD has a very good cache-coherence system to mitigate the problem (aka: Infinity Fabric), but you can't get around the fundamental fact that a core only really has 16MB of L3 cache.

Intel systems can have all of its L3 cache work on all of its cores, which greatly benefits database applications.

---------

AMD Zen (and Zen2) is designed for cloud-servers, where those "independent" bits of L3 cache are not really a big problem. Intel Xeon are designed for big servers which need to scale up.

With that being said, cloud-server VMs are the dominant architecture today, so AMD really did innovate here. But it doesn't change the fact that their systems have the "split L3" problem which affects databases and some other applications.

reply

gameswithgo | karma 10455 | avg karma 3.71 · 2019-08-26 15:19:46+00:00

> Databases, particularly any database which benefits from more than 16MB of L3 cache.

Yes but have you seen this actually measured, as being a net performance problem for AMD as compared to Intel, yet? I understand the theoretical concern.

reply

dragontamer | karma 30240 | avg karma 2.84 · 2019-08-26 15:30:34+00:00

https://www.phoronix.com/scan.php?page=article&item=amd-epyc...

Older (Zen 1), but you can see how even a AMD EPYC 7601 (32-core) is far slower than Intel Xeon Gold 6138 (20-core) in Postgres.

Apparently Java-benchmarks are also L3 cache heavy or something, because the Xeon Gold is faster in Java as well (at least, whatever Java benchmark Phoronix was running)

reply

gameswithgo | karma 10455 | avg karma 3.71 · 2019-08-26 17:16:36+00:00

Thanks, perfect! I'll keep an eye on these to see how the new epycs do.

arantius | karma 977 | avg karma 3.58 · 2019-08-27 15:07:21+00:00

What I see there is that the EPYC 7601 (first graph, second from the bottom) is much faster than the Xeon 6138 -- it's only slower than /two/ Xeons ("the much more expensive dual Xeon Gold 6138 configuration"). The 32-core EPYC scores 30% more than the 20-core Xeon.

dragontamer | karma 30240 | avg karma 2.84 · 2019-08-27 18:22:42

There's a lot of different benchmarks there.

Look at PostgreSQL, where the split-L3 cache hampers the EPYC 7601's design.

As I stated earlier: in many workloads, the split-cache of EPYC seems to be a benfit. But in DATABASES, which is one major workload for any modern business, EPYC loses to a much weaker system.

reply

monocasa | karma 27236 | avg karma 2.94 · 2019-08-26 17:30:58

Are their L3 slices MOESI like their L2's are (or at least were). That'd let you have multiple copies in different slices as long as you weren't mutating them.

dragontamer | karma 30240 | avg karma 2.84 · 2019-08-26 17:36:31

AMD is using MDOEFSI, according to page 15 of: https://www.hotchips.org/wp-content/uploads/hc_archives/hc29...

However, I can't find any information on what MDOEFSI is. I'm assuming:

* Modified * Dirty * Owned * Exclusive * Forwarding * Shared * Invalid

Any information I look up comes up to an NDA-firewall pretty quickly (be it in performance counters, or hardware level documentation). It seems like AMD is highly protective of their coherency algorithm.

> That'd let you have multiple copies in different slices as long as you weren't mutating them.

Seems like the D(irty) state allows multiple copies to be mutated actually. But its still a "multiple copies" methodology. As any particular core comes up to the 8MB (Zen) or 16 MB (Zen2) limit, that's all they get. No way to have a singular dataset with 32MB of cache on Zen or Zen2.

reply