Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Why Linux developers do not fix reported issues or ignore bug reports completely (linux-regtracking.leemhuis.info) similar stories update story
5 points by jiripospisil | karma 8274 | avg karma 9.97 2024-05-29 16:49:04 | hide | past | favorite | 82 comments



view as:

What the article says is true.

It's also a great list of why most people will never report bugs, and why many of the people who do come away from the experience unhappy about doing it.


You mean many don't want to spend time bisecting and building the kernel with proposed fixes to test if they work?

If the bug really bugs them, they'll probably overcome the reluctance to do the above. Not sure if it's many, but it's usually enough in case it's not some obscure component of the kernel but something more widely used like amdgpu for example.

The more obscure the use case though, the worse it gets.


One of the phenomena the industry identified back in the Windows days was subconscious bug avoidance.

Some tech savvy people have a sense of what interactions with the OS are fraught and they steer themselves around them often without even realizing they are doing so. So you have one class of users who believe the system is much more stable than it really is, and an underclass who seem to bounce off every step as they tumble down the proverbial stairwell.

So the bugs that cannot be avoided get fixed and the ones that are avoided by the “elites” may escape their attention if not in fact their knowledge.


I feel this. I recently ran into a problem with a kernel update breaking Bluetooth on resume from suspend. My Logitech keyboard and mouse that I use at my desk stopped working after each resume, and I had to go to the laptop keyboard and manually stop and start Bluetooth. On each resume.

Look, I'd love to contribute to the kernel, but the amount of time I have to play around with these things is now measured in minutes per day. So no, sorry, I don't have time to rebuild my kernel from mainline, and then bisect, and then cherry pick patches, and then test out proposed updates, and then help shepherd it through my distribution's patching mechanism.

So I just gave up on Bluetooth on Linux and plugged in Logitech's proprietary Bolt dongle, and re-paired my devices with it, and haven't had trouble since. I'd prefer to use standard protocols, but this isn't the first time a kernel update broke my Bluetooth setup, so I think I'll stick with what just works.


[dead]

Not really applicable to Linux kernel bugs, but I have a story of it being much harder than one would expect to convince my ISP (a generally cool company, as opposed to some behemoth telecom) that their payment system would allow me to pay them essentially nothing, but their system recorded it as a payment in full.

I pay my ISP using a method where I use an app on my phone to scan a QR code and I can then pay with my credit card (and get associated rewards).

For reasons of maximising my rewards, I decided one month to try split the payment and pay a small part with one credit card, and the remainder with a different one.

I opened the app, scanned the QR code, changed the default payment amount to about 10%. A moment later their system says “Thank you, paid in full!”.

I send customer service an email explaining that there is a flaw in their system which has allowed me to me to underpay.

Them: “Don’t worry, we can see you’ve paid us.”

I reply: “I really haven’t, you need to escalate this to someone technical over there to investigate this further.”

Them: “I can confirm you’ve paid.”

Fortunately for my ISP, I’m a good person and they have a presence on a local forum, I PM’d one of the reps on there who based on their posts was possibly one of the founders.

And he responds saying something to the effect of “thank you very much, I have no idea why customer service didn’t escalate this” and goes on to explain that the payment app evidently did not respect a flag in the QR code which should prevent me from editing the payment amount and he had fixed their code to actually check the incoming amount.

Same rep a few years later was able to resolve unstable Spotify streaming issue I reported by getting Akamai to fix their DNS resolution to use the PoP on my ISPs network as opposed to the one on a different ISPs network which was throttling their peering link.


I once worked at a mom-and-pop ISP back in the late 1990's. The previous sysadmin overloaded the UNIX password field in /etc/shadow to be used for account suspensions, when somebody wasn't paying their internet bill, the custom "Web UI" he built would allow the business owner to lock out the account by changing the shadow password field to "*", thus denying the user dialup internet access.

At least one customer figured out that they could call the girl at reception, tell her they'd forgot their password, and she'd reset the password for him, which restored his access without paying his bill for another month. He started to do this on a monthly basis.

The punchlines:

1) the front desk girl was the girlfriend of the business owner; you'd think they'd communicate about a customer who was getting away with this, and how they were doing it

2) the sysadmin who built this system went on to become a well known project leader for an major open source encryption project that is in wide use today


FWIW using * or ! in shadow to indicate lockout was a common thing; it's even mentioned in the man page for shadow.

There's a special meaning to x in the crypt field for passwd, too.

See pwconv.


I had some experiences with emacs that were like that. After 2 or 3 of those, I stopped bothering.

Are there any good guides on compiling the Linux kernel from source and testing it on a VM? Many resources seem to assume that you want to install it next to your existing kernel, but I'd prefer not to risk my filesystem or hardware if something goes wrong.

you could mount and chroot the vm image, and install into that.

The easiest thing is just to follow said guide in a VM. Setting up a full system VM is the simpler path since it's such a common tasks that all the VM tooling is already built around the idea you'll want to do that. If you absolutely must compile outside the VM for whatever reason you can just transfer the resulting files over.

It's a bit of a dead end if your kernel issue has to do with your hardware of course. I once had to patch a bug about specific NVMe drives I was working with so there was no choice but to do that with the actual hardware.


You can direct boot a local kernel with qemu:

  qemu-kvm -kernel /path/to/your/bzImage ...

Don't forget the bajillion arch specific command line flags for qemu!

We have wrapper scripts around qemu that we use heavily for CI and development.

https://github.com/ClangBuiltLinux/boot-utils

https://docs.kernel.org/kbuild/llvm.html


Wireguard testsuite comes with a pretty good userland for testing kernel facilities (obviously wireguard itself). It's been integrated in the kernel tree, I can't remember where, but it's great to use for experimentation.

Not a "Linux" issue, but I once had to do a presentation to all developers / testers for a major product.

It's quite important to set clear standards for what is, and isn't, in a bug report. Otherwise, far too much time is spent chasing down information that the reporter (unintentionally) neglected to provide.

It's also important to realize that whoever is fixing a bug isn't responsible for onboarding the submitter in writing a good bug report. That also wastes time.


Both of these ideas also cut down on the number of bug reports you have to handle.

I'm not sure if you're trying to be snarky or not.

In our case, "the parties that be" realized they had to hold a higher bar at triage. It didn't cut down on the number of "bug reports," but instead cut down on the amount of back-and-forths on confusing bug reports.


There's something I consider the three necessary components of a problem. You need: What you did, What you expected, and What you got.

If you don't have those three, you have a complaint, not a problem. "Don't work" is not a problem, it's a complaint. "When I run the report with THESE PARAMETERS, I get X, but it should be Y"

Sometimes, parts are implied, but they should be explicit when there's any chance of variability. Sometimes people would tell me, "this person isn't in the system". And I'd check the database, and they'd be there. So I would say "Yes, they are". Then, the back and forth to narrow down that what they meant was that this person wasn't showing up in a certain part of the application that they expected them to.

Which is a different scenario all together.

It's also, a little related to the "XY problem" as people like suggesting solutions rather than describing problems. Then there's also the case where the problem is the expectation itself rather than anything technical.


If I tell you what I did and tell you what I got, you have the information you need, I think.

‘I selected spell check and the application crashed’ is sufficient in many cases - and is a bug report.


Yes, those 3 exist, and are important. But let's not pretend they are present in every single bug. Insisting on non-existent information is how you alienate bug reporters.

Often enough "the application doesn't work" is a complete description of the problem. Even though many people would prefer the shibboleth of "the application has a fatal crash at startup".


> ‘I selected spell check and the application crashed’ is sufficient in many cases - and is a bug report.

Ooooh, that really depends on context.

If that came from someone who isn't part of the team, who isn't knowledgeable in the information needed to diagnose the bug, ok. If that's actually how to reproduce, ok.

What really happens is that the crash depends on running in an XYZ environment with a document that's exactly 200 letters long and the 3rd word needs to be red. In this case, ‘I selected spell check and the application crashed’ is quite unprofessional. QA is responsible for figuring out what combination of inputs creates the crash, or if that's not possible, documenting that it's really ‘I select spell check, the application crashes 15% of the time, and I don't know why’

Otherwise, someone providing that as a "bug report" isn't a team player, and it's up to management to train that person. (Or fire them if they refuse to do their part.)


Agree. Details about the surrounding environment might not be too important most of the times, but they could be the key to solving nasty bugs.

Example:

> the application crashes 15% of the time, and I don't know why

It's because the user was using a stylus instead of a mouse, and was accidentally dragging the button when trying to click it, triggering the "drag" event instead of the "click" event, and this control wasn't prepared to deal with that.

The remaining 85% of times it doesn't happen is because the click is done properly. And the developers can't reproduce it no matter what they do because they all use a mouse so it's very difficult to accidentally drag the button.

You only found the issue because the user told you it only happened when they had Photoshop open (because it's the only time they use a stylus instead of a mouse), and by chance someone from the design team offered themselves to give it a try because they heard you talking about it during lunch, and they could reproduce it.

> Ooooh, that really depends on context.

Another example. For me, my Steam Deck tends to freeze in two circumstances, requiring me to force-shutdown:

1. When I have a certain website open in Firefox for a few hours. The key information in this case is which tabs I usually have open when this happens.

2. When I'm playing Stellaris, and the Steam application decided to start eating all my RAM in background. The key information here is that I'm playing on Steam Deck, which is a very small part of the user base. There have been other bugs specific to Steam Deck as well.


In your case, “what you expected” is implied by telling what you did.

And I did say some parts may be implied, but to be explicit where it could be ambiguous.


> Then, the back and forth to narrow down that what they meant was that this person wasn't showing up in a certain part of the application

In my case I said that developers shouldn't spend more than 90 minutes reproducing a problem. If it took longer than that, the submitter wasn't correctly communicating the problem.

Everyone learned how to write a bug report after that.


Oh, no, it's a well-known approach to handling bug reports.

Years ago, I worked with several of IBM's AIX kernel team; they talked about a well-defined, 3-level triage process where the third level was the actual developers. Unfortunately, they were still getting too many bug reports which was impacting productivity. IBM added a fourth level between the first level, call-center tech support, and the second level (I can't remember their term for this, basically people who would try to reproduce the bug).

I got to experience this system a little later, working as a sysadmin for a CS department. I made a clear bug report, including code to reproduce it, showing a denial of service attack against AIX 3.2.5 (and possibly earlier). The first layer of support read me the relevant manual page and pointed out I was using "undocumented behavior". I said, "Yes, but DOS attack." The bug report was closed at the second level as user error according to the documentation. I still feel bad that I didn't forward my code to the BUGTRAQ mailing list.

Tl;dr: The more barriers you put into place, the fewer problems you have to actually handle.


Which is why the problem arose: Far too many people believe anecdotes like yours, without understanding the difference between a barrier and common sense. (A four-step process to triage a bug only serves to protect fiefdoms.)

We ultimately had the leads (managers and most experienced team members) triage in a small group 4x a week. It kept the BS (and barriers) to a minimum, and standards high.

In your case, we generally didn't "close bugs in isolation" like you encountered. That being said, we did have a few "security" bugs raised by people who didn't understand the use case or deliberate tradeoffs. These were closed with a careful explanation of the tradeoffs or misunderstanding.

In your case, I would have re-submitted the bug, and/or reopened it.


Error or crash-based telemetry share is actually very efficient way to make valid bug reports. You would be able to catch many hard bugs automatically.

But people don't like it for privacy reasons.


There's a middle ground in the form of "report error" windows that pop up when the application crashes.

It depends on how many clicks it will take and whether it needs some additional manual input.

If it is just one click, then most might actually do it. But I would not hold my breath.


> If it is just one click, then most might actually do it. But I would not hold my breath.

*exactly*.

"You must tell us what you were doing at the time of the crash before submitting".

Uhh, I woke the computer up, logged back in and was greeted by this prompt.


I think one of the most important responsibilities of a QA dept is to translate user complaints into an actionable bug report.

The systematic erasure of QA from the “modern” development process is pure arrogance and I hope we grow the fuck up before I retire. I’ll take early retirement over continuing as we are.


The QA team are chilling in the break room with the DBA team.

I don’t like what you said but I agree anyway.

Who knew adding indexes could be so simple but also so neglected?

I wonder if the neglect of basic efficiency practices is largely a function of cheap money and unlimited cloud budgets. Now that the new tech shine is fading from the cloud and companies are grappling with higher interest and operating costs, maybe we'll see more of an effort towards efficiency.

On the other hand, who am I kidding. Engineering and business leaders are far more likely to push the focus towards AI solutions and other shiny trends, regardless of how effective they are.


>The systematic erasure of QA from the “modern” development process...

Is that a thing? Which industry/country are you in?


You’ve heard that Microsoft got rid of their software test teams, right? It’s not as if GP’s experience is an isolated incident in one small niche of the software world.

This is totally a thing. In the late 90s when I worked at Microsoft and then at startups, QA was made up of full time employees whose leadership had input into the product process.

Today at my large tech company, QA is mostly contract employees validating test plans that the normal engineers author. Zero autonomy or ownership offered.


Also, there's a huge difference between teams where QA is seen as a stepping stone on the path to graduating into "real" development, versus teams where QA is viewed as a critically important role that's valuable in itself. There needs to be a role for senior QA engineers, because if all your junior QA folks are looking to "graduate" to development you won't get expert QA folks who really care about QA.

I've seen what great senior QA engineers can do. Proactive approaches to testing; integrating new approaches to testing; influence on architecture and design to make codebases more robust; optimizing testsuites so they can run more often; better capture of long-tail errors from production; design and implementation of scratch infrastructure to test more things before production...


Yeah I completely agree. In the era where people could build a career in this area, they developed skills and brought insight that made the whole product better. Automated testing and SRE/DevOps reliability that I see focused on today do not fill in the gaps.

QA people are very bimodally distributed. Excluding managers, 90% of the best and the worst coworkers I’ve ever had have been QA. If you have drive and focus you can be amazing. If you have neither then you’re an albatross around the team’s neck.

It’s a very passive aggressive way to root out a problem in an organization by just removing it.

To your comment about being a path to coding: if you can code well and test well, you should skip entirely over being a software dev and go into security consulting. Instead of a 40% pay bump you could be looking at an 80% pay bump. What is a red team member but a coder with the suspicious mind of a QA person?


Which industry/country are you in, that it's not your experience?

And are you hiring.

I'm not in Silicon Valley, perhaps that's the difference. Most programmers I know have about 1 QA (of the old school type described) per 5-10 programmers.

There is a bit of a self-reinforcing effect going on: If I spend the 2-3 hours required to write a good bug report, and I see no indication that any human even looked at that work, I'll be less likely to put that effort in. After a few iterations, my bug reports, if I'm still filing them, are probably going to be quite bad, which isn't going to make it more likely that they'll be picked up...

> and I see no indication that any human even looked at that work

You'll get plenty of indication that a human looked at the ticket when you see notifications in the ticketing system that the bug was triaged and then you're the one verifying that the bug was fixed.


Yes, I would consider that indication that a human looked at it. Unfortunately, for many bugs (not kernel related) I have never seen such an indication.

> The tone threw people off.

That a group that contains Linus Torvalds is policing tone is hilariously tone deaf.


Everyone cares about tone. Especially those so chippy they pretend not to.

The positions of two parties are different, they are not symmetric.

Person who is asking someone else to spend time on his problem should be aware of his tone. Unless bug reporter send report to RedHat or some other company he has business relationship with, it matters.


Linus never shouted at random people reporting bugs, or anything like that. He occasionally shouted (or still shouts?) at people breaking stuff they really shouldn't have broken, and who really should have known better.

What you think of that is up to you, but the suggestion that Linus is somehow a complete asshole and rude to everyone and everything is just not true, and has never been true.

Besides, "hey, can you help me?" and "I gave you responsibility over this part and you broke stuff for users you numpty" are fundamentally different positions to start with.


The last time I filed a Linux kernel bug report, it was to report that you could plug in a Firewire cable and directly read or write memory, which is a security bug. But only the first 4GB of memory, because the bug that allowed this was from the 32-bit days. There's are registers which set the upper and lower limits of the accessible area, and they were set for a range of 0..0xffffffff.

Loud complaints from someone who had a debugger which used that exploit. Wasn't fixed.


That sounds insane; can you share details or link to a report?

Windows and OS X had the same/similar problems. On consuner hardware at that time you probably couldn't make firewire secure.

Yep, same problem exists over Thunderbolt connections afaik. By emulating PCIe devices you can basically coerce the host into giving you whatever you want from memory, as long as it's configured to conveniently do-so by default: https://thunderspy.io/

my understanding is there are so many vulnerabilities when someone has physical device access that the security community essentially suggests you assume any device is compromised if you lost physical access

That used to be the advice. But it doesn't fly anymore when your OS is running on an easily-stolen device like a phone.

Or a use wants to plug into someone else's charging port. (Don't tell them not to do it, they are going to anyways)

Phones don't have Firewire, and no one is charging their device through Firewire.

My understanding is that these bugs should not exist in 2024. Less if they are published.

Do you lose physical access if someone connects a thunderbolt/usb/whatever to your computer? No, this is not losing physical access.


This has not been true for years for consumer electronics.

How so?

The threat model for a modern smartphone includes people with physical access to the device.

This is begging the question, because the cavalier attitude described in the anecdote above illustrates how that advice arises.

Did this bug have anything to do with firewire providing direct memory access to peripherals, bypassing the cpu?

Yes. There was at one point a software hack for Macs where you could plug a FireWire cable between two of them and write directly from one of them to the video output of the other.

I use this facility on FreeBSD (dcons) to do debugging. Works great. There's a knob to explicitly turn on DMA, though. It looks like Linux has something similar. Is that not sufficient?

That user might be the one from that old XKCD: https://xkcd.com/1172/

And I would like to add another reason:

* Fixing the bug will not add value to the various Fortune 500 Companies that control Linux.


Tell us more

Not sure what more you need; it is no secret that kernel code is written primarily by and for the benefit of corporate entities, and naturally they have priorities.

[I am not saying this as a bad thing; in fact, the GPL licenses are designed around this "fact of life" aiming to maximize the benefit for (all) the users and minimize control of a single powerful entity. I'd say the current state of affairs is amazing, considering the historical alternatives.]


Large companies hire people to work on open source. Some use their clout to influence standards.

As a developer I like that companies hire for open source. As a user the quality increases with full time employees. What we don't notice is many projects morph to fit corporate needs because they become important stakeholders.

Money and power corrupt by design.


It seems like a reasonable complaint, but a subset of

"Your bug is neither a regression nor a severe issue."

Since the paid developers, just like literally everyone else who contributes, do fix the most serious bugs in their areas of expertise/responsibility, but triage some reports as not the most valuable use of their time, or, if applicable, their company's resources.


The them-and-us mindset on display here is faulty. If you’re reporting a Linux kernel bug, then you are now a Linux kernel developer. Doesn’t matter that your name ain’t in MAINTAINERS or whatever, you still just joined the club. Please add your stone to the cauldron.

If you want to get something fixed whine a lot and bother everyone who can't run away from you. Also what a git I was... https://bugzilla.kernel.org/show_bug.cgi?id=101061

I will fix your bug if I find it interesting. I will fix your bug if you pay me. I will fix your bug if I get assigned it at work, or if I know you personally. Otherwise, fix it yourself. No one owes you anything, and if someone does, they can fix it for you.

I like it when you create bug reports. However submitting a bug report to an open source project is not issuing a work order. If you have any urgency to your bug then you should pay to have it fixed.

However this article is not about open source devs being treated as free labor. This article is all about how to file a good bug report. Having a bug report being politely submitted with all the correct details and in the appropriate channel is a skill many users are lacking. We should really be promoting this sort of document.


Legal | privacy