Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
63 Cores Blocked by Seven Instructions (randomascii.wordpress.com) similar stories update story
480.0 points by nikbackm | karma 6555 | avg karma 10.21 2019-10-21 07:07:47+00:00 | hide | past | favorite | 86 comments



view as:

It's good to see how features turned on by default (System Restore) can have such a bad impact on performance. Thank you for doing the profiling!

System Restore is commonly disabled for test systems or anyone who has a good automated deployment system. Now I’m wondering how likely it is that many engineers at Microsoft disabled it to save space or conserve every available IOP, especially in the era before large SSDs were widely available.

Do you have any good list of such actions?

Windows has never been about absolute best performance, but "good enough"; which sometimes isn't. Otherwise a default install wouldn't have so much stuff running.

That was a good read. In-depth but understandable. Thanks for sharing.

Edit: Never mind... I completely missed the word "empty" when reading the critical sentence. :(

As per the article: "It is unclear why this code misbehaved so badly on this particular machine. I assume that it is something to do with the layout of the almost-empty 2 TB disk."

Why do the sample counts cluster so heavily on the jne, as opposed to the other instructions in the loop?

The article links to another article which discusses this. In order to do stack sampling, the CPU doing the work must be periodically interrupted for the sampler to run and collect a stack. Modern CPUs are deeply speculative and often issue instructions out of strict program order. When you interrupt such a CPU it has to walk some of that back and decide where to leave off on all of the speculative execution; what the CPU was doing at the time is thus not completely captured by the stack trace alone.

I talked to the author of that article. He hasn't done testing on AMD processors but his guess was:

micro-up fusion means the seven-instruction loop is actually five micro-ops Zen2 processors can retire five instructions per cycle Therefore the loop runs at one iteration per cycle (wow!) The cmp [r8] instruction occasionally has cache misses This means that the seven instructions get synchronized such that the cmp [r8] instruction is the last of them to get retired in a seven-instruction block Therefore the next instruction is usually the jne

TL;DR - the jne gets most of the samples because the cmp [r8] instruction is the most expensive.


Yeah I believe jne gets most of the samples because cmp [r8] is the most expensive, but there could be two separate reasons for that:

Perhaps ETW shows you the precise instruction (i.e., "zero skid") that is slow to retire - this is not like a normal interrupt as described in the article but is available with some performance profiling events like 'cycles:ppp' on Linux perf (in particular, using the zero-skid PEBS events).

In that case, the samples shows up on the jne, not the cmp likely because cmp/jne have fused, so basically get sampled as a single instruction and the samples show up pointing to the jump.

The other scenario is that ETW shows you "skid 1" instructions, i.e., the instructions generally after the slow-to-retire ones (as described in the article), and cmp/jne didn't fuse (perhaps because a cmp with a memory source argument can't fuse on AMD?), and so it again points to the jne.

I haven't looked at many ETW traces, so I couldn't tell you offhand - but for those who have, do the samples usually show on the expensive instructions (things like div and loads that miss are a giveaway), or on the one after?

Added: Per Agner, I guess the "fusion + no skid" is the most likely (from the Ryzen section of microarchitecture.pdf):

> A CMP or TEST instruction immediately followed by a conditional jump can be fused into a single ?op. This applies to all versions of the CMP and TEST instructions and all conditional jumps, except if the CMP or TEST instruction has a rip-relative address or both a displace- ment and an immediate operand.

That also lines up with the cmp having exactly 0 samples, unlike any other instruction of the 7: that's a common indication of fusion.


It's the same off-by-one issue that often affects address of crashing instruction.

The report is incorrect, vast majority of time is taken by the previous one, cmp dword ptr [r8],ebp. It's the only one accessing RAM, and accessing cache line shared across cored is very expensive, even more so than a cache miss.


What off-by-one on crashing instructions are you referring to? Some stack walkers intentionally do off-by-one byte offsets when doing symbol/source lookup on parent functions, but not on the crashing instruction.

Crashing is (with the exception of floating-point exceptions) precise. A particular instruction crashes, the exception record points there, the instructions afterwards are discarded with no side effects. This is necessary to support things like restarting execution.

Sampling, on the other hand, is not even well defined. There are hundreds of instructions in flight, many completing simultaneously, and when an external interrupt happens the CPU has to decide which ones to commit and which to discard. The linked article gives many more thoughts about how the CPU draws the line in the silicon.


> What off-by-one on crashing instructions are you referring to?

Here’s an example for GCC on ARM Linux: https://github.com/dotnet/corert/issues/7826 I think I have observed similar symptoms on Windows, too.

> Sampling, on the other hand, is not even well defined

A crash by e.g. RAM access violation, and interrupt generated by CPU to collect sample for a profiler, are pretty similar, IMO.


access violations are generated internally by a single instruction, and all mainstream CPUs guarantee precise exceptions. PMU interrupts for sampling are external so the CPU picks wherever it wants to stop in the program.

> all mainstream CPUs guarantee precise exceptions

Yes, and precise interrupts, too.

> CPU picks wherever it wants to stop in the program.

I've used profilers quite a lot, and based on my observations they're quite accurate, to exact instruction.


> Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer stored in each sample designates the place where the program was interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e., where it was at the end of the sampling period. In some case, the distance between those two points may be several dozen instructions or more if there were taken branches.

https://perf.wiki.kernel.org/index.php/Tutorial#Event-based_...


But what does that even mean? Seriously. An interrupt fires, on a particular clock tick. At that point there are, let's say, 130 instructions in flight. In the case of a loop like this one there may be seven instructions being retired per clock-cycle.

So, you end up with patterns. I linked to some detailed reverse engineering of which instructions are likely to end up being the victim. One common pattern is that the instruction after an expensive one will have the samples assigned to it, but there's more to it than that - I recommend reading it.

TL;DR - I'm not saying you're wrong, it's just that you're not saying anything specific enough for write/wrong to apply. "accurate, to exact instruction" has not been meaningful for sampling profilers for more than 2.5 decades.


I don't think that's the case here - see my other longer comment, but I think ETW uses the PEBS sampling events, so the instruction is usually the slow-to-retire one, not the subsequent one.

I believe cmp/jne macro fuse on this CPU, so you actually will never get any samples on the first of the two fused instructions: rather they all show on the second one. You see this same effect on Linux when sampling with the cycles:ppp event.


> macro fuse on this CPU, so you actually will never get any samples on the first of the two fused instructions

Haven’t thought about µ-ops fusion in this case. Yes, that explanation is very plausible.


I have seen no signs of samples hitting the most expensive instruction in ETW trace. Across other cases I have looked at the samples tend to cluster after expensive instructions, not on them.

I guess it's possible that all of the cases I looked at were distorted by macro fusion but I don't think so.


Interesting, so I guess ETW is using "normal" interrupts which have skid, and the CPU just having enough retire bandwidth to retire everything in one cycle.

Technically this wasn't caused by those instructions but by the spinlocks waiting for the lock to be released. Also "blocked by seven instructions" sounds a bit click-baity.. you can lock the CPU or power off the computer with less than that amount of instructions :-)

Or break it, depending on how old it is:

  f0 0f c7 c8: lock cmpxchg8b eax

Does this still bug still bite anywhere? Embedded computing with Pentiums?

It's an invalid instruction, so a compiler wouldn't (shouldn't) ever generate it.

You can make gcc generate it, and versions of Linux without no-execute support will actually run it. Like so:

int main = 0xc8c70ff0;


Ah, but that's something you do on purpose. What I meant is that it's not a Meltdown-like problem.

I used a Pentium with that bug as a router up until 2009.

Yeah, I used to have a similar setup then. That was 10 years ago, though.

In this case I'm OK with it. It pulled me in long enough to be entertained.

I think the point is that you have this beast of a computer and the lions share if its power is going to these 7 instructions. It's a bit theatrical way of putting it, but in the article he doesn't actually blame the 7 instructions for the root problem.

Did you read the post? There were no spin locks waiting for the lock to be released. The ~100 threads waiting for the lock were all waiting in an idle state, quite efficiently.

The problem was that the thread that owned the lock was spinning in a seven-instruction loop.


The cause is obvious: they were building on Microsoft Windows, using the NTFS filesystem. Even Microsoft doesn't try to build on NTFS.

Changing any single detail gives better results. Use a Samba share from a Linux filesystem. Run Mingw on a Linux system. Run MSVS in Wine on a Linux system.

Windows is an execution environment for applications. There is no need for, and no value in, actually performing builds in your target execution environment. Use a system designed from the ground up for builds.


That's the first I've heard that MS doesn't try to build Windows on Windows with NTFS.

I don't have first hand experience, but I know some people at MS who work on Xbox (which is a modified version of Windows+HyperV underneath);

From what I understood from them, they do not use NTFS (they use SMB from a clustered filesystem) for builders, but they _do_ use a heavily modified version of windows, incidentally that modified version went on to become "windows nano". What the actual "Windows" team does is a mystery to me though, I would assume it was similar or the same.


That's super interesting. I wonder if they have moved to use SMBv3?

I really liked the direction with Nano in 2016, but I guess it makes more sense as a container OS. Still, the latest version is what a lot of people wish they could start an operating system with. NT kernel, no wmi, no servicing, no activation.


As a developer, you do some builds on your local box, so it's really up to you what filesystem you use.

All the Windows devs I've met built locally on their machines for their own day-to-day dev work and testing. I don't know anyone that used a filesystem other than NTFS. But definitely fast NVMe SSDs and disabling Defender for the local repo checkout.

The build machines are a different story and I don't know the specifics of.


I don't think this deserves the downvotes it's receiving, but:

> no need for, and no value in, actually performing builds in your target execution environment.

.. is completely antithetical to using Visual Studio, where the convenience of building, running and debugging on the desktop is very handy.

NTFS is horribly slow for certain file operations though. Giant batch deletes can wedge the UI while the system catches up. I have in the past benefited from putting %TEMP% on a RAMdisk, although this is a pain to set up.

I'd love to see Microsoft building a VFS driver for, say, ext2/3. It's not impossible, all the APIs are there to add plugin filesystems to Windows, and the OSS-friendly Microsoft shouldn't have any objection in principle to linking against the GPL'd kernel implementation..


There is WinExt2[0], and also WinBTRFS and WinMD. I've not been able to test them myself. Microsoft briefly allowed Windows Pro users to create ReFS volumes (can still read), but since reneged on that, and I can find no Linux implementation. It looks like WinBTRFS gets usable performance.

[0] https://sourceforge.net/projects/ext2fsd/


A lot of the issues people see isn't NTFS per se, it's how the VFS is connected to the IFS drivers and cache manager. Essientially on windows it goes:

  1) user space
  2) cache manager
  3) filesystem driver
  4) disk device driver
whereas on sane platforms its

  1) userspace
  2) filesystem driver
  3) cache manager
  4) disk device driver
So on windows you as the filesystem have to cache your own metadata reads because only you know when they're invalidated, and the caching layer above you is more focused on data plane. On other platforms with the caching layer below you, you can just pretend you're hitting disk for any accesses (data or metadata), and it's probably already cached.

Just throwing ext2/3 in there wouldn't help because as I said above, it's mainly the system architecture that's screwy.


> Even Microsoft doesn't try to build on NTFS.

Citation needed. I haven't worked at Microsoft for a while but when I did we built on Windows using NTFS. When I found a correctness bug in NTFS last year I was told that it had been affecting Windows builds, which means they were using NTFS as recently as February 2018.

https://randomascii.wordpress.com/2018/02/25/compiler-bug-li...

The biggest problem I usually encounter with building on Windows is slow process creation.


Bruce Dawson does some of Microsoft's most valuable work for Windows. Doesn't even work for them.

He worked for Microsoft years ago. Still an MVP all this time later.

I worked with the man for years in the Xbox Advanced Technology Group. Amazing individual. When he left the team, I conducted my own exit interview so I could learn from him, and walked away with pages and pages of insights on growing my own career and becoming a subject matter expert.

He was on my interview loop at ATG, and I recount it as my favorite interview of all time. He pointed to a circuit diagram poster, and said to me "You have to write a game for that, what design considerations should you be aware of?"

It looked something like this (can't find the actual poster, it's been a decade): https://qph.fs.quoracdn.net/main-qimg-9cdbc7bf35ef8126755175...

A bit out of my league, but I identified the important aspects (multicore/hyperthreaded design, small L0/L1 cache and impacts to mispedictions, etc.) and spoke to what I could and where my uncertainties lay. Afterwards he gave up the rest of the time to let me ask questions about the team.

One XFest he stood on stage giving a Powerpoint presentation on debugging and multithreaded concerns. An animation was slow, and he broke into it and started debugging Powperpoint live to demonstrate some of his techniques. A legend.

A huge loss to Microsoft when we stepped away. I did and do hope him the best!


Wow. This was such a valuable post.

Can you describe some of the insights you learned from the exit interview -- about growing your career and becoming a subject matter expert? I'm new to the field and I feel like that'd be immensely valuable to me and many others.


Sure! It's been a decade, but the biggest thing that stuck out was to pick something that needs an SME. Fill a gap on the team, even if it's something you may not have a huge interest in. Find the interest in it, and _own it_.

But most importantly, give talks and brownbags about the technology. Understand that there's going to be someone in the room that knows more than you... but they aren't giving the talk and helping everyone else, you are. They will chime in, and that's OK. You are the one putting yourself out there educating yourself and others. This helped me so much when I gave talks at GDC... even if I'm helping ONE person, it makes the event worth it (and the talk serves as my unique perspective / take on the industry).

Pour over source materials. Bruce read the 600+ page CPU documentation front-to-back, twice over. He said the second time, he gleaned so much more insight.

The engineers didn't realize just how much knowledge they were trying to distill, so you might read a comment that says "Of course, the second parameter determines XYZ." The first read-through, you might gloss over that. The second read-through, you realize the instruction they're documenting is doing double-duty elsewhere, and the comment is an important indicator of how that interaction plays out on the die.

Good luck!!


Wow. Seems like a really cool person to work with. I hope that during some point in my career I’ll get the chance to work with someone like that :)

>One XFest he stood on stage giving a Powerpoint presentation on debugging and multithreaded concerns. An animation was slow, and he broke into it and started debugging Powperpoint live to demonstrate some of his techniques. A legend.

Is there a video of this somewhere? Sounds amazing.


To be clear, the profiling of Powerpoint in the middle of the presentation was a stunt, planned in advance. I just didn't tell anybody I was going to do that.

At that point I was an ex-Microsoft person giving a talk at a Microsoft conference using Microsoft tools to profile Microsoft's presentation software. It may have been a cheeky thing to do, but it was so much fun.

I'm not aware of any publicly available video but I did a writeup of the issue: https://randomascii.wordpress.com/2011/08/29/powerpoint-poor...


What's ATG?

Advanced Technology Group.

We were the firefighters when a game studio's experts couldn't figure out what was going wrong. They may provide a code snippet, or in rare cases the full game, and we could debug with the console's OS/driver source code. We even had access to the processor layouts for figuring out hardware bugs. We'd get copies of the Red Disc and Green Disc masters used for duplication, before the game was published (helped figured out a few 0-day patch bugs that way).

The other half of the job was proactively figuring out what problems studios would run into with new APIs and new SDKs. How would they want to use them together, and the challenges that posed.

Finally, we were the developer representatives, advocating on their behalf as the platform progressed.

Was an amazing job. I only left because I just couldn't pass up my dream job (reworking the telemetry/stats pipeline for Halo 5, and getting to play with TB of data).


Sounds amazing. Are people typically familiar with the acronym?

"loop running in the system process while holding a vital NTFS lock"

It's not about the seven instructions. It's the lock that's been held while doing a busy loop.


For each input source file, cl.exe creates at least 7 temporary files (with suffixes "gl", "sy", "ex", "in", "db", "md", "lk"). The churn of creating and deleting those, coupled with the slowness of performing checkpointing on a huge empty drive, seem to be the root cause here.

This appears somewhat related to this bug report: https://developercommunity.visualstudio.com/content/problem/...

Marking the temporary files as FILE_ATTRIBUTE_TEMPORARY could improve things, without having to go into significant Windows kernel changes.


> Marking the temporary files as FILE_ATTRIBUTE_TEMPORARY could improve things, without having to go into significant Windows kernel changes.

Having used Cygwin and MinGW (and less so WSL) NTFS is probably the main factor here. Especially when using Cygwin program compilation is very slow not just due to process creation but file access, potentially on disk.

I could see checkpointing contributing, but having used backups/file history on Windows server, you will see CPU use irregularities that I think are similar in cause to this but you do not see CPU use irregularities as bad as described in the article.

The pathological behavior of NTFS with many files is easy to prove once you encounter it.[0] This would be exposed to the kernel as well and is likely holding up NtfsCheckpointVolume. At least in my experience the problem goes deeper, and how NTFS structures are handled also contributes; for example, trying to enumerate and copy files in certain ways is extremely slow even if the files are easily enumerable.

You can say that if disabling checkpointing gets rid of noticeable slowdown it is the right thing to do, but there are people who will rightly not want to disable it.

[0] https://stackoverflow.com/questions/197162/ntfs-performance-...


Time the difference between an npm install, with all its thousands of tiny files, on NTFS and ext4. It's excruciating.

Just copying Emacs for Windows from one folder to another takes surprisingly long time :)

I've definitely noticed this behaviour. It's excruciating.

There's a ZFS driver for Windows now. I haven't tried it yet, but I think I will. Can't be any worse.


> Moving %TEMP% to a RAM drive made the builds noticeably faster and also left the system responsive, even during two concurrent 20-way parallel builds (40 compile jobs in parallel, no loss of system responsiveness).

This matters more than anything else. You can play all you want with FILE_ATTRIBUTE_TEMPORARY or whatever else markings but the OS will just not care about them enough.


Does creating a RAM drive in Windows still require third-party software/drivers or is it natively supported by Microsoft these days? I haven't tried to do it since Windows 7 I believe.

AFAIK, it still requires 3rd party software.

However, having a Samsung SSD and enabling RAPID mode in Samsung Magician (SSD management software) which effectively uses an invisible RAM disk accelerated my games' startup times by a factor of at least 3x. Some games even start 5x times faster.

I do recognise that games have a much different workload than compiler jobs of course; but invisibly utilising a RAM disk might help a lot regardless.


That whole Linux integration into Windows marketing campaign could have you believe that Linux things just work on Windows now and you can just "mount -t tmpfs none /path/to/my/mount -o size=64G".

I doubt that is going to happen. File system mounts do not run on marketing fuel.


The latest variant of WSL uses a VM. So that will work, but you won't be seeing that mount inside Windows.

Well, you can access it over the \\wsl$ shared drive. But that share can only do about 4k IOps.

> still require third-party software/drivers

Back in the 3.1 and 9x days it didn't require 3rd party drivers, thanks to ramdrive.sys


Isn't FILE_ATTRIBUTE_TEMPORARY the same thing as putting it in the RAM drive (minus needing a special mount and what part of the disk gets written when you run out of RAM)?

I've often wondered about the viability of putting all my build agent's folders in a ram disk to speed up build times.

Granted, IT won't be thrilled that I'll need ~4-8x the RAM, but the devs would certainly love the speed.

....now if only I could make fastlane take less than 2 hours....


As long as you have the RAM available and your OS has reasonable file system caching behavior - you should already get that effect.

On Linux, I can run a build repeatedly that touches several GB of files and there is basically zero IO during the build, because everything is in the page cache.

There is some IO at some point after the build as the dirty files (e.g., .o files) get written out, but if if you delete them fast enough, even that doesn't happen.


Is ninja python or C-based? I cant remember.

Is "-pipe" in their c-compiler's make.conf ? Would that even matter when using ninja as the compiler?

Im curious, and trying to think towards a solution.


Excerpt: "...I mean, how often do you have one thread spinning for several seconds in a seven-instruction loop while holding a lock that stops sixty-three other processors from running. That’s just awesome, in a horrible sort of way."

I respectfully disagree.

That's because everything in the universe that is percieved as negative -- turns out to have a positive use-case somewhere, sometime, in some context...

In this case, I think the ability for one core to stop 63 other processor cores is purely awesome, because think of the possible use-cases! Debugger comes to mind immediately, but how about another if let's say there are 63 nasty self-resurrecting virus threads running on my PC? What about if you were doing some kind of esoteric OS testing where you needed to return to something like Unix's runlevel 1 (single user), but you'd rather freeze most of the machine (rather than destroying the context of everything else that was previously running?).

Oh, here's the best one I can think of -- don't just do a postmortem, everything's dead core dump when something fails -- do a full (frozen!) "live" dump of a system that can be replayed infinitely, from that state!

Now, because I take a contradictory position, doesn't mean we're not friends, or that I don't acknowledge your technical brilliance! Your article was absolutely great, and you are absolutely correct that for your use-case, "That’s just awesome, in a horrible sort of way.".

But for my use-cases, it's absolutely awsome, in the most awesome sort of way! <g>


63 cores blocked on a single mutex is not at all like any of the scenarios you're describing. That's almost like describing the notre dame fire as having a positive use-case because what if you want to do a controlled demolition of a large building.

I suppose it's awesome for that capability to exist, though it wasn't even close to what should have happened here. Awesomely horrible.

Making all your processes wait on a lock is multithreading 101. The horrible part is the specific way this lock is getting held, which is not useful.

And there are simpler ways to prevent all access to a drive.


So, one busy process performs a file operation that triggers a system restore checkpoint, and the OS locks the entire drive during this file operation? Sounds strange to me.

Is the problem that the checkpointing critical section has the same duration as the triggering file operation?

I get that there must be some sort of critical section for setting a checkpoint, but I don't understand why it takes so long, and why it would be affected by how busy the userspace process that triggered it is.

I would expect it to have a short barrier-style critical section; drain all outstanding writes, record some checksum or counter from a kernel data structure, and then release all writers again.

In my mind this should be kernel code only, entirely unaffected by userspace, and if designed nicely, quite fast.

So I guess I don't get what is going on here.


My guess would be that the system restore checkpointing functionality ends up holding a lock by necessity while it manipulates internal state. It would be really hard to make that sort of algorithm lock-free and preserve data integrity guarantees - I certainly wouldn't want to be responsible for writing it.

Obviously the lock shouldn't be held so long and so often though...


My understanding is that the system restore checkpoints happen every five seconds. They hold a lock, which seems reasonable.

The problem is that for some reason on this machine the checkpoint process was taking a really long time. I also don't understand why it was taking so long. It normally doesn't. Something went terribly wrong.

> and if designed nicely, quite fast.

Yep, should be. But it wasn't. If everything worked as it should then I'd never get to write any blog posts!


It looks like this is a case where a process is holding lock A while waiting on lock B; and every other process is waiting on lock A. That's normal enough, though it seems like there are two mistakes:

First: Never spin waiting on a lock for 3 seconds. If you expect a lock to be released very quickly, you spin K times and then, if you still don't have the lock, try something heavier that can deschedule your process. K should be small enough that your time slice is unlikely to expire while spinning, otherwise, it just causes confusion and wasted work because it looks like your process is doing work when it's not.

Second: It seems dubious that using a feature like system restore causes all Write calls to wait for a lock held by a process in the middle of I/O. I'm sure there are some cases where that must happen (like if out of buffer space to hold the writes), but I would think it would be harder to hit.

EDIT: Rephrased my comment in terms of two problems rather than just the first one.


Look again - that tight loop in RtlFindNextForwardRunClear isn't spinning on a lock - it's scanning forward through memory, 4 bytes at a time, looking for 4 bytes not equal to the pattern in %ebp.

So it looks more like "process is holding lock A while doing a very long scan through memory". That would fit with the name of the function, too.


Correct. Nobody was spinning on a lock. Everybody was waiting politely.

The problem was that the system process held the lock for too long, due to some inefficiency in system restore (root cause not yet understood by me).


> Second: It seems dubious that using a feature like system > restore causes all Write calls to wait for a lock

I agree that it seems dubious, but it is indisputably what was happening, repeatedly.


So when did you first realize he was discussing Windows, reading this?

The "of course everyone is a straight white male" attitude that the OS need not be stated, so often seen in Windows posts, gave it away for me. However, my biases threw me for way too long: the level of sophistication meant this must be Linux, right? I should have recognized the graphics style in the screen grabs. Certainly not MacOS, but Linux can be all over the map stylistically. Does Windows really still look like that? Wow.


I caught on when I realized this just wouldn't be happening anywhere else.

Microsoft has, singly-handedly, got two generations of people used to computers working badly, convinced that it's not just unavoidable, but normal. If cars worked as badly, we would all see multiple explosions every day (and think it was awesome).


These posts by Dawson are always interesting. Now, if only he would investigate and remediate the performance deficiencies of other complex systems, such as ... Chrome?

Legal | privacy