Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
The Intel Xeon E7-8800 v3 Review: The POWER8 Killer? (anandtech.com) similar stories update story
58 points by rbanffy | karma 158565 | avg karma 2.97 2015-05-13 06:30:20 | hide | past | favorite | 68 comments



view as:

POWER8 was alive?

I know there are niche markets out there, but I'm still surprised that anyone would use anything but Intel for the very high end.


Read the article. Their conclusion is that 24 core power8 systems can beat 36 core Xeons in some important benchmarks, both in performance and power-per-dollar (but not power-per-watt). If you want to see the reason, just look at the memory bandwidth - Power8 has about double that of the latest and greatest Xeon. That's very important, especially for fixed grid computational tasks (e.g. weather, climate) that traditionally have a very low arithmetic intensity, i.e. what really counts is the memory bandwidth.

also the code tends to be highly SIMD, code with a lot of data dependent branches benefits very little from higher memory bandwidth when the fetch latency from ram is on the order of hundreds or even thousands of cycles.

Yes that's a good point. So low arithmetic intensity + SIMD benefits POWER mostly.

I did read the article. Power-per-dollar are the smaller 2S Xeons btw. Not the massive 36-core Xeons.

However, the memory-bandwidth argument makes sense to me. So I'll accept that as a good reason to go Power8.


> Power-per-dollar are the smaller 2S Xeons btw. Not the massive 36-core Xeons.

Could you give a quote? All I could find about performance-per-dollar was this excerpt from [1]:

> Ultimately, the POWER8 is able to offer slightly higher raw performance than the Intel CPUs, however it just won't be able to do so at the same performance/watt. Meanwhile the reasonable pricing of the POWER8 chips should result in third party servers that are strongly competitive with the Xeon on a performance-per-dollar basis.

This was in conclusion to the E5-vs-Power8 comparison. For some reason they didn't use E7, however this shouldn't change the outcome, since according to [2] the E7 is basically the same chip with the same number of cores, just with a few more features enabled (4 and 8 socket systems possible and support for more memory). So AFAIK the E5 shouldn't be looked at as a "small" version of the E7, since taken for itself in isolation for a given benchmark that doesn't have extremely high memory demand, it should deliver the same performance. Whether you need an E7 or not should be decided in terms of the architecture you want to build, not the performance you aim for.

[1]http://anandtech.com/show/9193/the-xeon-e78800-v3-review/7

[2]http://anandtech.com/show/9193/the-xeon-e78800-v3-review/3


Why would you have such limitations? You can install Linux on anything. This allows you to leverage better processors or get better deals.

Except Intel is significantly cheaper. I do like the other answers stated so far, but lets be frank... the typical Linux cluster of 2S Xeons is going to be much cheaper and scalable than massive 4S 144-Thread beasts.

The Xeon in the article is more expensive than POWER8. Every other Xeon is cheaper than POWER8, but if you need multi-terabytes of RAM and corresponding amounts of CPU on a single motherboard, POWER8 is cheaper than Xeon.

3rd party OpenPower Motherboards are still only sampling for POWER8, so it really hasn't been fully birthed yet.

You can throw a lot of CPU power at your Websphere apps and do things like lease capacity on demand. It's useful in some business models. Power8 cut the power/cooling requirements dramatically.

If your employer historically likes IBM, it makes sense. I don't think they have any new customer acquisition.


this topic and the power8 have come up before.. i think it boils down to what companies have already invested in. if you have a strong AIX team already supporting power/ibm infrastructure, this is great. you have one of the fastest system platforms out there.

however, if your application/workload can be distributed over multiple machines, Intel is going to eat your lunch on cost, as you can buy several of these new Xeon systems and crush it.

not all of us have the luxury of control over the software we use (many enterprise shops are held hostage by vendor software which hasn't scaled out, and continues to depend on scaling up, because distributed is hard)


totes banter

The x86 cores are still internally a lot more CISC-y than POWER cores, especially with things like uop fusion, so no surprise that they have a higher IPC.

This caught my attention: http://images.anandtech.com/doci/9193/MemoryConfigBiosE7v3.p...

I wonder whose BIOS that is, or if it's just a simulated screenshot. All the ones I've seen look a bit nicer: http://www.overclock3d.net/gfx/articles/2009/08/10133320867l...


Intel writes their own BIOS. That's probably from an Intel reference box.

Me want.

And that's just my born in Computer Shopper Pavlovian response to articles about new more powerful CPU's. I have no idea what I would do with eight fifteen core CPU's and six terabytes of RAM other than write checks to Alabama Power and think of ways to mention it in casual conversations. I mean a 911 Turbo? at least I could use it to pick up hamburger buns.

At a deeper level, I always wonder, what [besides mining bitCoin] would other people hack up with a monstrous amount of computing power on the order of a data center in a container?


Easy. Any involved computation like raytracing, molecule interaction, protein folding.

Raytracing alone can load absolutely any system to the max, because it scales so well.


Render 3d graphics really fast? That's all I do that would begin to leverage that sort of power. Typically the two taxing things I do are play around with games and do 3d graphics/animation in Cinema 4D and AfterEffects. Games aren't built to make use of that many threads/cores or that much RAM but for graphics stuff, it would be like having my own little render farm in one workstation.

Well, actually making games requires that much power. I work at a games studio, run an 8-core(16-thread) Xeon + 64GB of ram + 1TB SSD and everything I do daily is just sloooow. Running the game in debug mode uses up all my ram, compilation times are between 20-40 minutes for the whole project even when using distributed build systems, and that 1TB SSD is not helping that much if it's nearly full all the time. If we could have our workstations upgraded to something like 16-core Xeons + 256GB of ram it would be a godsend.

Talk to your management. Your time wasted is definitely more expensive than 256GB of RAM (around $3K).

Show them the numbers, make it an easy decision for them.


You probably already know this - but these CPUs would be absolutely horrible at mining bitcoins.

but pretty awesome at mining the cpu-friendly Cuckoo Cycle...

https://github.com/tromp/cuckoo


That's why you mine altcoins and trade them for bitcoins.

http://www.coinwarz.com/cryptocurrency


Depends on what are you doing. For any programmer working on a bigger project, where compilation time is greater than 2s, (let's say) 4x more power is superb.

If your code compiles in ~16 seconds, reducing that to 4 would add comfort.

If your code compiles 60 seconds, reducing that to 15 would allow not losing concentration.

If your code compiles in 60 minutes, reducing that to 15 minutes would actually allow to do any work on that (not that it's not possible, but it's really not pleasant).


And luckily, code compilation tends to be one of those things that can actually be effectively parallelized.

Another thing useful for a programmer is being able to benchmark things - with a multicore machine you can tie off one core (or multiple cores!) exclusively for a benchmark, which can really help reproducibility.


I don't compile much code, but having my full test suite run in 10 seconds instead of 10 minutes would do wonders for my productivity.

That's assuming it's sufficiently parallizable. If you only have a single thread, an ordinary Intel desktop CPU is still faster than any Xeon: https://www.cpubenchmark.net/singleThread.html

I thought about 4-core vs newer 8-core desktop CPUs when upgrading, but decided that I more often do work on less than 4 cores (with 1 you get the "Turbo" capacity). The only multi-threaded program is PyCharm which can use up all those cores when recalculating its static type checking of Python code (which it seems to do quite excessively).


> At a deeper level, I always wonder, what [besides mining bitCoin] would other people hack up with a monstrous amount of computing power on the order of a data center in a container?

I wouldn't call it "hacked up", but large-area surveillance radar systems like AEGIS and JSTARS use very high density systems like this, because the size, power, and cooling available for the signal and display processing are severely limited. Same can be said for systems like the mobile Doppler radars the National Weather Service and various universities use for studying severe weather.


You can build Android/Cyanogen or Chrome from scratch -- it easily uses dozens of cores and takes hours on a slow machine. A lot of other build systems won't use the machine because they don't parallelize. IIRC, the OpenWrt build doesn't parallelize.

Another idea: you can also do some big linear algebra problems. I'm doing this now at work and it's kind of interesting that a single cloud machine is like a supercomputer from a decade or so ago (32 cores, 128 GB of RAM). I guess that is obvious but once you start using the whole machine it becomes impressed upon you.

I actually bought a big Dell machine to do secure builds... thinking of getting something like this though :)


Along with the other answers here, things like distributed processing projects tend to be embarrassingly parallel. I was running distributed.net OGR for a while, and that runs quite a bit better on many modest cores. Almost all my other programs bottleneck on disk / RAM speed, or RAM size... The OGR project in particular is also far more efficient on CPUs than on GPUs, since it is an entirely integer workload.

Some high-end workstations are sold with Xeon CPU's. I had one years ago, and while overall it 'felt' like a high-quality machine, it didn't really feel like it was that much better than other machines for software development (C++).

So, does anyone have experience with today's Xeon based workstations? Are they worth the 1-2k (or more, if you go crazy...) premium?


Over standard similar arch i7s you mean?

They support ECC ram, which is pretty important for some people. I believe the pre-fetchers are more aggressive on the Xeons as well, to make more use of the available memory bandwidth available with multiple QPI connections.


ecc memory for one, even if it's just as a psychological reassurance in my use case.

My last home server was a core i7 with 16gb of non-ecc. I noticed some PostgreSQL corruption, which was odd to me. Then I noticed the most of my compressed backups failed to decompress all the way, due to some corruption. It'd make a backup just fine but they couldn't verify. It had been going on silently for weeks or maybe months before I noticed. Turned out to be a bad memory chip. I have a Xeon with ecc now.

I have a Xeon-based workstation at home, to compensate my relatively weak Mac Mini, mostly for machine learning.

I'd say that normally, you probably don't need it, unless you want ECC memory, large cache sizes, large amounts of memory, etc. (compared to Core i7)

However, they can be very price-effective. Many companies buy workstations for e.g. CAD and replace their machines every 2 or 3 years. Since they are then sold en masse on eBay, you basically get a machine that is still high-end for a bargain price.

E.g., my Xeon workstation was ~400 Euro, has a Xeon E3-1270 that is ~4 times more powerful than my relatively new Mac Mini, comes with an acceptable CUDA-capable GPU, and 12GB memory.


It depends what the core is. There are a number of Xeons which are the same core as their i7 counterparts. In this case, I'm pretty sure you're not buying anything extra (I think they even cost the same price). You probably had one of these, if your computer was a standard tower. You get ECC memory, but for most that's not a huge problem.

This system they're testing is a beast that's designed for enterprise systems; they're hugely expensive. You're meant to get perhaps 144 hardware threads, hundreds of gigabytes of RAM, terabytes of storage and a price tag of >$50,000; it's designed for the limits of vertical scaling (and is a job I imagine it does well).

Even if you were to dump this kind of cash, you'd get terrible performance in practice due to the design decisions made behind all of this (memory system will be tuned for throughput not latency, for example) and poor software support (i.e. an OS like OS X would do a shoddy job of scheduling threads, especially across NUMA domains. Linux was terrible also until a few years ago, it wouldn't manage to schedule more than about 30 threads simultaneously no matter the lad).

I'd go so far as to say that you'd expect better performance from a standard tower than you would from this.

In general you'd expect a desktop processor to give you about as good performance as you'd manage to get for developing C++. All of the editing tools are tuned for desktop processors, and there's not amazing parallelism to be had with most compilations (and incremental compilations won't see them at all). Note the benchmarks are for 'total number of kernel compilations' and not 'time for single compilation' :)


All of the editing tools are tuned for desktop processors, and there's not amazing parallelism to be had with most compilations (and incremental compilations won't see them at all).

I regularly run 'make -j32' on a machine, compilation of well-structured projects is embarrassingly parallel. Incremental compilation is a fair point, but may not always work in template-heavy codebases that are restricted to C++03 (no extern template).


It depends on what you're doing and which Xeon you get. If you're doing number crunching, then it will definitely be worth it. I have some Xeon's with 20MB cache that cut my processing time by ~40%, so they paid for themselves the first time I ran something on them.

If you're doing number crunching, then it will definitely be worth it.

Depends on the kind of number crunching. If you are doing expensive matrix computations, a high-end GTX CPU or a Tesla stream processor may buy you much more for the same price.

Also, if the choice is between cache size and architecture, a new architecture may bring more. E.g. my Core i7 MacBook royally beats some of our Xeon machines with 25MB cache, by virtue of having AVX with 256 bit wide registers for SIMD.

Of course, if you can have it all, take it all :).


I have an Intel Core i7 4930K with simple auto-overclocking, its pretty sweet. I would suggest for software development, you can stick with the high end Core i7s that are usually priced around $500 - $800 per CPU.

If I was to buy a new CPU today, I'd buy this one:

https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-5930K...

I also have quad channel OC memory on my motherboard, the quad channel part is actually significant in terms of compile speed (and the OC part isn't I found out.)


Just installed that CPU into my work computer a couple months ago, and it's indeed quite a fantastic deal overall. It's really easy to see the benefit from having a fast workstation, even if you're not developing extremely resource-intensive applications.

XICA DA XICA DA XICA DA XICA DA SILVA, a negra.

Amazon needs to offer these quad-CPU motherboards and with four Intel Xeon E7 8800 v3 so I can start using these 144 core machines in our render farm...

http://www.supermicro.com/products/motherboard/Xeon/C600/X10...


Isn't it more efficient to use their GPU instances for rendering?

Depends.

If it costs you 15 full time developers 6 months to transfer your x86 rendering code to port to OpenCL (or CUDA), then maybe spending $100,000 on hardware is cheaper than spending $700,000+ on developers.


Let me qualify, this isn't really my area, so I'm a bit out of my element.

If we are talking rasterization, isn't an OpenCL/CUDA version pretty much always going to stomp an x86 implementation? While on the ray tracing side, there's little difference?

I'm just curious because my first reaction was that "sure the dev cost is initially much higher, but how much time is saved on rendering in the long run?".


GPU is maturing as we speak. It isn't yet used on the high end though because CPU-based render machines often have +32GB of ram to store the scenes, where as high end GPUs may have 12GB of ram.

But you are correct that GPUs is the future of rendering. :)


> If we are talking rasterization, isn't an OpenCL/CUDA version pretty much always going to stomp an x86 implementation? While on the ray tracing side, there's little difference?

GPU will stomp CPU on memory-bandwidth alone (GDDR5 RAM is much much faster than DDR3... and when DDR4 catches up in the coming years all the GPUs will switch to HBM and still retain a massive memory bandwidth advantage).

CPU wins due to legacy code and access to huge memory.

In practice, RAM Requirements mean at best, it'd be a OpenCL/CUDA version PLUS CPU to handle the large scenes. I'm sure the OpenCL / CUDA version will be cheaper from a hardware perspective, but once again... developers are very expensive. A decent software development team runs in the millions of $$ / year in costs, let alone a team of GPU-specialist high-performance programmers.

Its things like this why Intel's Knight's Landing GPU chip (which runs x86) actually have a valid market case. Sure, Knights Landing is slower, weaker, and more expensive than the competition. But that x86 compatibility is a major feature that saves developer's time.

Also, when you start doing GPU + CPU computations, the PCIe latency starts to become a major issue. Tasks may complete faster on the CPU than CPU+GPU, because the task is in CPU L1 or L2 Cache.

While if you do CPU+GPU, you push the code out the PCIe and then synchronize the CPU+GPU... then the GPU starts pulling the data into its cache. When the GPU is done the task, its pushed over PCIe AGAIN.

So its a bit tricky task to balance the memory issues and cooperation across cores.


Just be careful about the Knights Landing GPU, it could be cancelled at any moment as I think it never got adoption outside of a few supercomputer installations.

It hasn't been released yet has it?

Take up should be much better, as it supports much more memory and full x86 instructions.


I just want to support one of your points: for some applications, the PCIe latency kills the potential of using GPUs. It doesn't matter how fast the GPU is at a computation if the time to transfer the data is longer than it would take to do the calculation on the host CPU.

I published a paper with this conclusion a while back ("Evaluation of Streaming Aggregation on Parallel Hardware Architectures", http://www.scott-a-s.com/files/debs2010.pdf). My paper was basically, "GPUs are awful for our problem, so let's compare to CPUs and Cell". A more recent paper from NSDI 2015 from a group at CMU, with Anuj Kalia as the first author, that also looks into it, "Raising the Bar for Using GPUs in Software Packet Processing", http://www.cs.cmu.edu/~akalia/doc/nsdi15/gopt_nsdi.pdf. They look at just the GPU and CPU part of the spectrum - which is probably more relevant to most, as the Cell is a dead technology now. (I think my conclusions will matter more when on-chip GPUs become more prevalent, but we're not there yet). Their paper starts with an excellent analysis of the latencies for transferring data to GPUs, and comparing end-to-end performance against several different algorithms, where they vary the amount of data reuse.

My consistent gripe when talking about improving application performance by using accelerators is that one must take data copying costs into consideration, which means we must use application end-to-end performance as our metric. There's a temptation to focus on just the computational kernel performance. The Kalia et al. paper is the best treatment of this subject that I have seen.


Indeed. Which is why I've been keeping up with AMD's HSA news. Their GPUs that share L2 Cache with the CPU intrigue me. After all, if you're piping data back and forth between the GPU / CPU, all sorts of applications open up.

Unfortunately, there's no real scaling. The APUs are mostly consumer products. The biggest APUs are XBox One and PS4, unavailable to the public.

Still, its nice to see a company tackle the latency issue.


For feature film, CPU still largely rules the roost, thanks to insane memory requirements and the fact that properly SSE'd/AVX'd code isn't THAT much slower than GPUs for highly parallel stuff like raytracing when you've got multiple CPU sockets.

GPU's still only have a quite small amounts of memory, that greatly limits the type of work they can do.

High end GPUs can have up to 12GB of ram now, which I have to admit is starting to get reasonable. Just a few more doublings and you've got what high end render machines have, which I understand is in the range of 64GB to 128GB. Although I've heard that some simulation oriented machines have 256GB to 512GB.

A lot of the time yes, but a lot of the high-end renderers used in production are still CPU based (anything from Dreamworks, ILM, Disney or Pixar are all using CPU-based renders for the most part.)

It is a lot of work to convert some of the global illumination calculations that are high quality (not special case) to run on GPUs fast. That is the main issue.

We support real-time high-end GPU-based rendering as well:

https://clara.io/view/b43f3215-c9ef-488f-a55f-1bf2a7d74f3f/w... https://clara.io/view/193070f2-e8af-4afc-a531-9d82338b5288/w... https://clara.io/view/a558dca2-8c2f-432c-ab34-9135c3066010/w...


This reads like a native advertising for Intel. Intel must be worried that there is something about the new OpenPOWER movement (such as Google and Nvidia starting to design their own OpenPOWER CPUs) that threatens them, which is why they are pushing this whole "Intel vs OpenPOWER" narrative now through "exclusive" articles with certain websites.

In case some people don't get it yet, the way these exclusive indepth pieces work is not that Anandtech goes to Intel and says "we would like to do an article about your chips and IBMs'". It's more like Intel going to Anandtech and saying "we have some great material to give you - would you like to write an article around it?"

And that's how Intel's side of the story gets pushed into the market.


Also known as a “submarine” article:

http://www.paulgraham.com/submarine.html


That's not quite how I read it. Anandtech raises valid concerns about the power consumption of power8, but other than that they are quite supportive and show power8 beating these xeons in benchmarks [1]:

> Ultimately, the POWER8 is able to offer slightly higher raw performance than the Intel CPUs, however it just won't be able to do so at the same performance/watt. Meanwhile the reasonable pricing of the POWER8 chips should result in third party servers that are strongly competitive with the Xeon on a performance-per-dollar basis.

I think the comparison is fair, however I also have to say that repeating all the benchmarks in the rest of the article with power8 as well, would have been better. It seems to me a bit like the power8 comparison was an afterthought.

[1] http://anandtech.com/show/9193/the-xeon-e78800-v3-review/7


Did you read far enough to get to the benchmarks? The Power8 chips do very well, and their discussion says good things about it - it does not read like advertising from Intel.

Disclaimer: I work for IBM (but not on hardware).


The thing is, if you're going for a POWER server these days you aren't going for the overall performance of the machine, you're going for the per-core performance.

It has been pretty easy to get an x86 box with better total performance than a POWER box for years. The problem is that you're going to go bankrupt trying to license all those cores for the kind of software you're getting from IBM and which is the reason POWER is even in the equation in the first place...

Most IBM software (if not all) is licensed as PVUs (processor value units). This hurts multi-core machines badly.


Is it easy to get an x86 that fast? I thought x86 was limited to only 4 sockets, way less than the biggest POWER machines.

The 8 in the 8800 means it can do 8 sockets.

http://ark.intel.com/products/84685/Intel-Xeon-Processor-E7-...

>Package Specifications >Max CPU Configuration 8

So 18 cores * 8 = 144 cores / 288 threads per board. Not too shabby.


When I last checked, IBM sold these x86 servers (I can't recall what the model line was called) that you could stack and made to behave as a single NUMA box. These compared very favorably to POWER servers in the same price range (32 cores on x86 vs. 6 cores on POWER7).

I don't remember the maximum scalability of these, but I doubt they would reach the biggest POWER machines. Problem is: if you need the biggest POWER machines, you're probably running something that isn't even portable to x86, so the comparison becomes academic.


This has got to be the worst written article I've seen all year. I've seen high school essays written with more skill.

Can anyone recommend a pre-built server with high end specs like this, which is more like a turnkey experience? For example, I bought a Dell desktop, installed Ubuntu on it, and everything just worked (the things that didn't work were my network).

I guess I want something that distros test on -- I would have said Debian, but Ubuntu does seem to have better driver support.

Price is not a big consideration; I just want to see what something "easy" is like and how much it costs.

From page 8, http://anandtech.com/show/9193/the-xeon-e78800-v3-review/8 :

As far as reliability is concerned, while we little reason to doubt that the quad Xeon OEM systems out there are the pinnacle of reliability, our initial experience with Xeon E7 v3 has not been as rosy. Our updated and upgraded Quad Xeon Brickland system was only finally stable after many firmware updates, with its issues sorted out just a few hours before the launch of the Xeon E7 v3. Unfortunately this means our time testing the stable Xeon E7 v3 was a bit more limited than we would have liked.


Legal | privacy