> They both involve flushing cache to backing stores, and waiting for confirmation of the write.
No they don't. A fence only imposes ordering. It's instant. It can increase the chance of a stall when it forbids certain optimizations, but it won't cause a stall by itself.
CLWB is a small flush, but as tanelpoder explained the more recent CPUs did not need CLWB.
I just wanted to clarify something about flushing caches: fences do not flush the caches in any way. Inside the CPU there is a data structure called the load store queue. It keeps track of pending loads and stores, of which there could be many. This is done so that the processor can run ahead and request things from the caches or to be populated into the caches without having to stop dead the moment it has to wait for any one access. The memory fencing influences how entries in the load store queue are allowed to provide values to the rest of the CPU execution units. On weak orderes processors like ARM, the load store queue is allowed to forward values to the execution pipelines as soon as they are available from the caches, except if a store and load are to the same address. X86 only allows values to go from loads to the pipeline in program order. It can start operations early, but if it detects that a store comes in for a load that's not the oldest it has to throw away the work done based on the speculated load.
Stores are a little special in that the CPU can declare a store as complete without actually writing data to the cache system. So the stores go into a store buffer while the target cache line is still being acquired. Loads have to check the store buffer. On x86 the store buffer releases values to the cache in order, and on ARM the store buffer drains in any order. However both CPU architectures allow loads to read values from the store buffer without them being in the cache and without the normal load queue ordering. They also allow loads to occur to different addresses before stora. So on x86 a store followed by a load can execute as the load first then the store.
Fences logically force the store buffer to flush and the load queue to resolve values from the cache. So everything before the fence is in the caching subsystem, where standard coherency ensures they're visible when requested. Then new operations start filling the load store queue, but they are known to be later than operations before the fence.
> If you try that on, say, an 80-core ARM CPU [1], CPU #53 might not see a 1 for a long time, if ever.[2] You're no longer guaranteed that cache changes propagate unasked.
What do you mean by this? ARM requires cache coherency for data caches, which requires that every store is eventually made visible to all cores. Fences are necessary to impose additional requirements on the order which writes to different addresses are made visible, but AFAIK not necessary to ensure that a single write is eventually visible.
> You can't trust the hardware coherency protocol to do much for you until you follow the platform-specific rules to tell the CPU to make something act coherent.
If software has to perform some operation to maintain coherency then you hardware is not cache coherent. Which some aren't, and especially some sub-sets of operations aren't (instruction cache vs data ops, or CPU vs DMA). But for load/store operations by CPUs, they're all coherent.
Barrier/fence instructions are about enforcing ordering on when memory operations to more than one location can be performed or become visible, with respect to one another.
Careful, memory barriers/fences have nothing to do with caches, they're for establishing an observed order on when the processor is using out of order execution. The cache still has to be strongly consistent at all times or else there's nothing that a memory barrier can do to save it.
> There is no "cache flushing" when barriers or other instructions are used to ensure seqcst on any system I am aware of.
Good to know. I've seen enough of your other posts to trust you at your word.
BTW: I'll have you know that AMD GPUs implement "__threadfence()" as "buffer_wbinvl1_vol", which is documented as:
> buffer_wbinvl1_vol -- Write back and invalidate the shader L1 only for linesthat are marked volatile. Returns ACK to shader.
So I'm not completely making things up here. But yes, I'm currently looking at some GPU assembly which formed the basis of my assumption. So you can mark at least ONE obscure architecture that pushes data out of the L1 cache on a fence / memory barrier!
Memory barriers don't force a flush of all CPU cache. They will enforce the ordering of memory operations issued before and after the barrier instruction, preserving the contents of the CPU's various caches.
> Because we're talking about cross-core guarantees of concurrency. So you actually do care about what another, or all cores are doing.
This is mostly about load/store order of an individual core. How individual core decides to order its reads and writes to memory.
> Weaker CPU's (Like say POWER7) that do batching writes (you accumulate 256bits of data, then write it all at once) don't communicate what is in their write-out buffer. So you may write to a pointer, but until that CPU does a batched write the other cores aren't aware. You have to do a fence to flush this buffer (in the other cores).
You can actually do same on x86 by using non-temporal stores. Although you're not talking about store ordering, but about visibility to other cores. A store won't ever be visible to other cores until it at least hits L1 cache controller.
> There are some scenarios where the same situation can arise on x64 but its rarer.
Yup, that's right. That's why x86 (and x64) got mfence and sfence instructions.
> This is why x64 uses a MESIF-esque cache protocol [1][2]. It can tell when data is Owned/Forwarded/Shared between cores.
Reordering happens before cache controller. When cache controller is involved, the store is already in progress.
> * .. .you just mean that the O state doesn't ping-pong between cores?*
Correct. My terminology was sloppy, sorry. I have no experience of this "context switch" you speak of; my processes are pinned and don't do system calls or get interrupted. :-)
The writing core retains ownership of the cache lines, so no expensive negotiation with other caches occurs. When the writer writes, corresponding cache lines for other cores get invalidated, and when a reader reads, that core's cache requests a current copy of the writer's cache line.
The reader can poll its own cache line as frequently as it likes without generating bus congestion, because no invalidation comes in until a write happens.
There is a newish Intel instruction that just sleeps waiting for a cache line to be invalidated, that doesn't compete for execution resources with the other hyperthread, or burn watts. Of course compilers don't know it. I don't know if AMD has adopted it, but would like to know; I don't recall its name.
> I do wonder to what degree we could hard-partition caches, such that speculative prefetches go straight into CPU-specific caches, and doesn't get to go into shared caches (e.g. L3) until it stops being speculative.
This is already sort of possible. The TLB flushing can take advantage of the PCID to determine, based on the process, whether the cache must be flushed - this provides process level isolation of the TLB.
I believe recent CPUs are increasing the size of some PCID related components since it's becoming increasingly important post-kPTI.
Please do not resort to personal attacks, we are just having a technical discussion.
In intel parlance 'serializing instruction' is a Word of Power; Please see Vol 3A, under serializing instructions. CPUID is listed there.
In the same section, LFENCE, MFENCE, SFENCE are listed as 'memory ordering instructions' and the documentation explicitly says that:
The following instructions are memory-ordering instructions, not serializing instructions. These drain the data memory subsystem. They do not serialize the instruction execution stream
Handling self modifying code requires serializing the instruction stream, so the memory fences are neither necessary nor sufficient[1].
Going back to the original argument, the memory fence instructions (and the various LOCK prefixed operations) do indeed serialize memory operations to guarantee ordering, which I don't dispute. But that's nothing to do with synchronizing caches like you claimed.
[1] IIRC MFENCE it is actually also a serializing instruction in practice, but Intel has not yet committed to document this behavior as architectural.
The thing with cache is that it's a performance cliff. The binary difference between stalling and not stalling can be massive. What looks like a tiny pedantic change can take you across that boundary and help you make efficient use of hardware resources and electrical power.
Pipeline control certainly does effect caches. That's where the memory access being controlled goes. I mean, it's tempting to try to understand them as an orthogonal thing, but in fact memory hierarchies and CPU pipelines (and other stuff like store forward buffers) are tied at the hip and it's pointless to try to "understand" them in isolation.
At best it's an exercise in specification pedantry. The only reason instruction barriers, fences and serialization instructions seem to make sense in docs is that someone at the architecture sat down and wrote a memory and execution model that can be abstracted by whatever API it is they present to the user.
And someone doing performance or exploitation work at the level of the linked article really needs to understand that model, not the API. (Though you're absolutely right that this particular article seems to be a bit mixed up about both)
Sure, but generally a cacheline miss will quickly stall, sure you might have a few non-dependent instructions in the pipeline, but running a CPU at 3+GHz and waiting 70ns is an eternity. Doubly so when you can execute multiple instructions per cycle.
A sync, assuming it is your typical memory barrier, is not bound by the L3 latency. You pay (in first approximation) the L3 cost when you touch a contended cache line, whether you are doing a plain write or a full atomic CAS.
Separately fences and atomic RMWs are slower than plain read/writes, but that's because of the (partially) serialising effects they have on a CPU pipleline, and very little todo with L3 (or any memory) latency.
Case in point: A CAS on intel is 20ish cycles, the L3 latency is 30-40 cycles or more. On the other hand you can have multiple L3 misses outstanding, but CAS hardly pipelines.
> capped by the line fill buffer and queuing delay kicks in spiking your cache miss
could you point me to a little reading material on this? I know what an LFB is, more or less, but what queueing delay, an dhow does that relate to cache misses? Thanks.
"Even from highly experienced technologists I often hear talk about how certain operations cause a CPU cache to "flush"."
Ok.
"This style of memory management is known as write-back whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place."
That sounds like a flush to me. Modified data is written back (or flushed) to main memory.
"I think we can safely say that we never "flush" the CPU cache within our programs."
Maybe not explicitly, but a write-back is triggered by "certain operations," (see the first quotation above).
So it sounds like the real "fallacy" the article is discussing is the idea that a cache flush is something that a program explicitly does. That would indeed be a fallacy, but I have never heard anyone claim this.
On the upside, the article does give a lot of really nice details about the memory hierarchy of modern architectures (this stuff falls out of date quickly). I had no idea the Memory Order Buffers could hold 64 concurrent loads and 36 concurrent stores.
> Unfortunately the amount of time it takes to get a feature like this into an Intel CPU is a bit mind boggling.
I think this is an unfair feature comparison. Zeroing L1 cache is way simpler operation than TM which has been designed to support two modes of operation [legacy -- which only speeds up traditional LOCK-based synchronization -- and true TM], must support transaction aborts and restarts, etc. Also, 10 years ago, TM was still a very active research area -- i.e., people had no clue about which ideas were performant and scalable and, not the least, feasible to implement in HW.
Thanks for correcting my hand-wavy-ness on that. The thrust of my comment was that blog posts like this need to be handled with some care: this is just enough detail to be dangerous without enough context to be truly helpful.
Getting a bunch of people excited to go start relaxing their loads and stores with mostly-truths like "The CPU's store buffer must be flushed to L1 cache to make the write visible to other threads through cache coherency." is just leaving loaded guns lying around. I'm only mostly sure that some uarchs will read from the store buffer without engaging the cache-coherency stuff at least in an HT/ILP world, but that is kind of the point: playing intrinsics jazz is a very asymmetrical bet. You can win big if you absolutely nail it for your particular chip, but you usually lose it all when it's not correct.
For a slightly more detailed treatment I like this SO answer: https://stackoverflow.com/a/62480523/19734375. It sketches out a better intuition about how the store buffer and speculative stores play out in practice and is very well footnoted.
For those that want to go a level deeper I can't recommend Chips and Cheese highly enough. This treatment of Golden Cove is typical of their rigor in understanding what really pushes architecture-specific performance on cutting-edge gear: https://chipsandcheese.com/2021/12/02/popping-the-hood-on-go....
Edit: Even this comment is self-contradictory. I pulled that SO answer out of my `x86_64 arcana` bookmark folder, and on re-reading it cites a primary source that store buffers are partitioned amongst logical cores by the spec. This stuff is tricky!
Indeed L3 being shared and also often working as the MOESI directory works out to the interthread latency being the same order of magnitude as the L3 latency.
My point is that sync has nothing to do with caches. Caches are coherent all the time and do not need barriers. In particular I don't think the git pull/push maps well to MOESI as it is an optimistic protocol and only require transfering opportunistically on demand what is actually needed by a remote core, not conservatively everything that has changed like in git.
The explicit sync model is more representative of non coherent caches, which are not really common as they are hard to use.
Memory barriers are for typical CPUs, purely an i internal matter of the core and synchronize internal queues with L1. In a simple x86 model, where the only visible source of reordering is StoreLoad, a memory barrier simply stalls the pipeline until all preceding stores in program order are flushed out of the write buffer into L1.
In practice things these days things are more complex and a fence doesn't fully stall the pipeline, potentially only needs to visibly prevents loads from completing.
Other more relaxed CPUs also need to synchronise load queues, but still for the most part fences are a core local matter.
Some architectures have indeed remote fences (even x86 is apparently getting some in the near future) but these are more exotic and, AFAIK, do not usually map to default c++11 atomics.
No they don't. A fence only imposes ordering. It's instant. It can increase the chance of a stall when it forbids certain optimizations, but it won't cause a stall by itself.
CLWB is a small flush, but as tanelpoder explained the more recent CPUs did not need CLWB.
reply