Hacker Read

drewg123 · 2018-11-27 12:44:40+00:00

For every good experience like that, there is an equal and opposite bad one.

I have an X399 Taichi for my TR2. I'm using 2 older U.2 NVMe drives that have a legacy boot option rom that causes the board to hang 9/10 times at POST. The only option is to disable CSM, so it will only try to use the UEFI option rom. This works fine, EXCEPT that the CSM disable is lost every time the board looses power. Eg, that configuration setting is not properly saved to non-volatile storage.

When I reported this to ASrock (via email), I was told in very broken english to "re install windows". This is great advice, considering I'm running FreeBSD, not to mention that the entire issue happens at POST. Sigh.

reply

loeg | karma 21759 | avg karma 2.57 · | 2018-01-04 06:24:40

What bugs have you observed in the X399 Taichi? Coincidentally, I have the same board (since TR launch) and haven't even updated the BIOS -- it just works fine for me.

Arie | karma 81 | avg karma 2.53 · | 2018-11-27 16:29:40

That's very interesting, that just made me realize my x299 ITX board has the exact same issue with CSM while trying to boot into OS/X (so also BSD-ish, but hackintosh). So a good chance of hanging on boot, unless I disable CSM and a power loss event will clear this CSM setting.

ComputerGuru | karma 29702 | avg karma 4.11 · | 2018-01-06 17:36:04+00:00

Version 2.0 exposes options for the following on two or more different pages that are not kept in sync and there’s no indication which is followed:

* power status after ac power loss * SATA ahci mode enabled * DDR speed * advanced ram timings * enables NICs * cpu overvoltage enabled * iommu enabled status

And others I’m forgetting.

X399 IOMMU support is completely broken, as well as ASPM. These options must be disabled to even successfully install Windows 10.

The X399 Taichi bios shows me tons of options for hardware/controllers that aren’t even physically present.

These motherboards cannot run RAM at the rates speeds due to the infinity fabric design. Look at the compatibility list AsRock published. They cannot achieve 2666MHz without jumping up to ram rates for 3200MHz, if I recall correctly. That’s because X399 is built as two separate sockets that happen to share a heatspreader and lga connection, and the interconnect operates at the same speed as the ddr bus; it just can’t keep up with tight timings.

After updating to Firmware v2, 4/5 Times the bios says “no connection” when trying to check for online updates (despite the wired connection in both ports plus a 10gbe pcie card), and a full power cycle is needed to try again.

This is the first chipset I’ve seen that won’t let me run a stick of ram at a slower speed (but it let me run other ram at slower speeds, so I’m not sure why) and fails to post if I set my ram to less than the XMP 2 setting manually.

The bios is built for gamers and overclockers. The options to enable OC are “auto” or “manual” and there is no “stock” or “disabled” option.

This wasn’t a platform built for stability. Motherboard manufacturers are treating ThreadRipper as if it were Ryzen, as if we were all gamers and this wasn’t a workstation chipset.

Intel lets you run Core i7 on its server boards. AMD won’t let me use a motherboard sold for Epyc with my ThreadRipper CPU. Nor should they have to.. except my current options are all crap.

My biggest problem of all is that X399 (not just the Taichi) randomly stops recognizing certain PCI-E cards. My PCI-E 10gbe x520-da2 NIC worked ok for a couple of weeks hen suddenly wasn’t seen any more. I thought it was the card and bought a new one. Same problem. I thought it was the os so I installed another. Same problem. I thought it was the bios so I reset it. Same problem. I manually toggled each option pertaining to PCI in the bios and rebooted to test between each one. Same problem.

Took a look on NewEgg. All the x399 boards have 3 star reviews at best, and are filled with similar PCI-E complaints and “worked for a month then stopped working” across all the manufacturers and models. Which isn’t saying much, since there are only a dozen or so, really.

reply

Arie | karma 81 | avg karma 2.53 · | 2018-11-27 12:19:48+00:00

I recently encountered a bug in an Asrock UEFI (CPU power limits ignored in the X299 ITX motherboard) and contacted Asrock about it. Got a fixed version emailed to me within a couple of days. Perhaps try that instead of the forums.

treprinum | karma 1055 | avg karma 1.76 · | 2023-08-19 09:31:52

My x299 board refused to boot from a 2TB WD SSD drive with a single memory chip (forgot which color it was)... No issues with Crucial, Samsung nor Intel.

vetinari | karma 8298 | avg karma 1.82 · | 2019-09-27 08:32:26+00:00

There was breaking issue, but it didn't manifest automatically.

It happened, when the 2nd gen TR arrived. It used the same mainboards, so all the manufacturers issued BIOS updates.

Unfortunately, these updates claimed to support SEV (Secure Encrypted Virtualization). Linux of course tried to initialize it at boot/module load time and the entire thing went hanging, because TR CPUs do not support SEV, only EPYCs do.

So there were the following fixes:

1) downgrade BIOS back to pre-TR2 version,

2) blacklist the ccp module; which would make kvm_amd non-functional,

3) wait for a fix in Linux kernel, which initializes SEV with a timeout.

So it wasn't that tragic issue, if you had first gen TR.

reply

shock | karma 1711 | avg karma 2.35 · | 2020-10-18 20:17:42+00:00

Thank you for the link, I was not aware of that problem. I suspected something UEFI related.

franzb | karma 2180 | avg karma 10.43 · | 2020-02-21 14:30:37+00:00

OP here. I wish I had! Unfortunately that's my work machine and I don't have the time to somehow get my hands on a new fairly rare, very expensive motherboard, disassemble my workstation and rebuild it just to see if my motherboard is the culprit. From the experience of other owners of the 3970X, this problem happens with other TRX40 motherboards.

tytso | karma 4696 | avg karma 8.54 · | 2015-08-31 13:58:15+00:00

Well, the question is kind of moot, because I haven't dared to re-enable UEFI boot since. Supposedly newer BIOS's have the bug fixed, but the value to me in risky another range of motherboard replacements is just not worth it. I suppose if I cared about UEFI it might be a good idea to try it before the maintenance contract runs out, but as a kernel developer, I'm constantly replacing the kernel, so using UEFI is a PITA anyway.

tarruda | karma 2401 | avg karma 4.66 · | 2017-08-05 10:15:32+00:00

I had a similar problem with my ryzen 1700/asrock x370 taichi: Every time I left computer idle, when I came back it was frozen. I didn't try pinging to see if it responded.

What solved the problem for me completely was blacklisting the nouveau module for nvidia(about two weeks without a single freeze). In my case it was an option because I use an AMD gpu for linux and a nvidia for passthrough, so I have no need for the nouveau module to be loaded. BTW this is where I got the hint: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085

What I got from this is that it is not an nouveau driver bug, but rather a hardware bug that has a greater chance of being triggered if the module is loaded. The same thread suggests that disabling ASLR is also a valid workaround.

reply

Grimm665 | karma 323 | avg karma 2.36 · | 2019-08-27 12:47:52+00:00

My Strix X99 just kicked the bucket a week ago...is this possibly why? Shit, I've had it on XMP since I built it 3 years ago.

I've already gone through one CPU, about 4 months after I built it. Now the whole machine is just dead with no signs of life. I bought a new board assuming the board was bad but maybe I need to test the CPU now...

Agreed though, one of the worst boards I've ever bought and seriously rethinking all Asus purchases in the future.

reply

seritools | karma 526 | avg karma 5.26 · | 2022-10-17 09:38:25

First-gen products often have issues like this. My old X370 board (zen1 era) turns off and on again twice before actually booting; a similar zen3 system doesn't need that and boots faster in general.

merb | karma 3754 | avg karma 1.43 · | 2018-04-19 21:45:13

well my mainboard is shaky with linux support also... somehow the chipset does not love it :( (b350m)

p_l | karma 8319 | avg karma 1.99 · | 2020-02-06 13:43:28+00:00

That's more a motherboard problem, as TR itself supports registered memory and the same goes for its firmware.

washadjeffmad | karma 3162 | avg karma 2.25 · | 2020-05-05 17:46:02+00:00

I've repurposed a R5 1600AE in a FreeBSD storage server, and I was frustrated by near-daily lockups after running it stably for the past few years on Linux.

After a BIOS update, an option for power supply idle control appeared, and selecting the "typical" over "low" option resolved the hangups. No C-state or C'n'Q changes required.

I did have to RMA the CPU initially when it wouldn't post with DIMMs in 2/4 slots, but later revisions seem to have quickly overcome the memory controller issues.

reply

ensignavenger | karma 4082 | avg karma 2.16 · | 2018-01-04 14:35:36

I had to update my Taichi to support my RAM at full speeds, but other than that, it has worked generally well. My system overall sometimes does not wake back up after going into power save, but I am not sure that is the MB or if it is something else.

mauricemir | karma 237 | avg karma 1.18 · | 2015-04-30 10:55:58+00:00

This isn't a problem for most boards some of the clones have problems with FTDI drivers and widows 8

ValdikSS | karma 341 | avg karma 4.55 · | 2023-06-15 06:33:25

Uuh…

I have an old VIA-based 32-bit x86 machine (VIA Eden Esther 1 GHz from 2006), and it hangs in different times, but I managed to create a reproducer which hangs the system not so long after the boot. About 1 in 20 boots are unsuccessful.

I noticed that verbose booting reduces the chance of hanging compared to quiet boot, but does not eliminate it completely.

The similar issue was present even on Dell servers back in 2008-2009, which are based on more recent x86_64 VIA CPUs, here's an attempt to bisect the issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=507845#84 The CPU seem to enter endless loop, as the machine becomes quite hot as if it's running full-speed.

All these years I believed this is a hardware implementation issue, related either to context switch or to SSE/SSE2 blocks, as running pentium-mmx-compiled OS seem to work fine, given that no other x86 system hangs the way VIA does.

However after this post and all LKML discussions, ticks/jiffies/HZ mentions, and how is it less an issue on Intel, I'm not so sure: the issue mentioned is related to time and printk, I also associate my problem with how chatty the kernel log is (at least partially), and the person in Debian bug tracker above also bisected the code related to printf, although in libc. It could be another software bug in the kernel. If that's the case, it is present since at least 2.6 times.

I would appreciate any suggestions to try, any workarounds to apply, any advice on debugging. If anyone have spare time and interest, I can setup the dedicated machine over SSH for testing. I have a bunch of VIA hardware which is reused for my new non-commercial project and I struggle to run these machines 100% stable.

reply

davemp | karma 2089 | avg karma 3.43 · | 2018-09-24 16:10:51+00:00

I've encountered issues that must have been BIOS bugs such as hanging before the BIOS screen with a corrupted Linux drive plugged in but working just fine with it disconnected. Trust nothing.