"However, the experimental setting does not seem fair. The version of Stockfish used was not the last one but, more importantly, it was run in its released version run on a normal PC, while AlphaZero was ran using considerable higher processing power. For example, in the TCEC competition engines play against each other using the same processor."
> Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons. Instead, the company prefers to let customers try out the CS-1 using their own neural networks and data.
I suspect that despite Cerebras being a massive technical achievement (wafer-scale computing!) it performs worse than standard GPUs, which is why they won't release benchmarks on standard models (e.g. resnet/transformers/etc)
There is also an article about how they did submit a score from one of these Chinese "exaflop" systems, for a different benchmark and it turns out it can only achieve the claimed performance at half precision:
It's hard to reconcile the performance across the 2 tests, perhaps they were set up/tuned differently. I wish they had published their methodology - I'd have loved to benchmark my long-in-the-tooth RX580 & rocm-tensorflow against their numbers
This title seems like an exaggeration of what is claimed in the article. In the article, they state that they benchmarked their solver in a biased way that made their solver look like it performed better than it did, not that they faked performance data altogether.
The article mentions "See Longbottom’s extensive tests and comparisons article here." and [1]. This was already mentioned in a snapshot of 18 Jan 2024 [2] so it wasn't added after your criticism.
> Microbenchmark results don't linearly scale to everything else.
Certainly they don't. But when evaluating something like this it is up to the reader to have critical thinking skills and realistic expectations about the level of experimental design applied to an admittedly alpha implementation published on a wiki on GitHub vs. maybe reading something like published in a peer review journal.
Wasn't there a thing about the mistake of using different tricks and techniques to beat benchmarks but in the end, the product would only be good for getting benchmark scores and nothing can surpass raw computation in general purposes?
Those benchmarks don't include energy efficiency, which was a primary design goal of LZFSE. I also don't see LZFSE on that page anyway, which makes it kind of hard to compare.
Note: since Alex is a fairly gender neutral name I'm going to use they/them pronouns.
I went through and read it, and the submitter is incredibly confrontational and not at all open to feedback on the correctness of their benchmark. Also, when presented with contradictory evidence (their own benchmark where the results show io_uring is faster than epoll on other machines), they essentially dismiss it and says the other users ran their benchmark wrong, or that they can't reproduce their results on their machine, or that the other users used Boost and therefore are invalid. So not an entirely reliable criticism coming from them, in my opinion.
See my above comment - the choice of metrics in table 3.2 doesn't really make any sense.
Also, they revived some ancient 1998 IBM chip, and report on that, for no clear reason. They call it a 2004 benchmark, but what actually happened is that in 2004 someone made some synthetic variants and put them into a dataset. This is not some widely used benchmark in the field. Given their highly questionable choice of metrics, I would not be surprised if there was some serious cherry-picking on the benchmark as well.
Not a retraction, just that many of the claims about supercomputer performance were easily shown to be less than accurate, making any claims about supremacy less exciting.
Also, stuff like this is hard to take the results seriously:
* To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.
* All inference frameworks are configured to use FP16 compute paths. Enabling FP8 compute is left for future work.
They did everything they can to make sure AMD is faster.
> also, this comparison would not register as a proper “benchmark” as it’s not even close to how you would perform a proper benchmark. it’s more of a data point.
I would prefer to not have to argue about that in court...
Well, given that in their benchmark Go ends up Boeing almost an order of magnitude raster than Rust, I wouldn't trust their benchmarking methodology too much.
I agree AlphaZero had fancier hardware and so it wasn't really a fair fight.
reply