Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
It’s the spec bugs that kill you (blog.foretellix.com) similar stories update story
67.0 points by yoav_hollander | karma 182 | avg karma 2.12 2015-08-11 12:46:46+00:00 | hide | past | favorite | 39 comments



view as:

> Somebody simply did not consider one specific implication of some top-level level requirement (“Don’t harm the ground operator”) on some lower-level specs.

While I agree with the general point about specifications, we are talking about a weapon of war here. Interrupting a firing procedure for maintenance and then not checking the co-ordinates is exactly the kind of careless mistake I'd expect to get someone killed. At least it defaulted to the location of the receiver, preventing further errors, rather than a semi-random target.


This is certainly an extreme example, but I can't tell you how many bugs I've seen closed as "Works As Intended" where it may have been what the spec intended but certainly wasn't what the user expected is...well, it's way larger than it should be.

I've seen both engineers and product managers use this as a crutch. And, frankly, as a PM I can say the blame generally lies with the product manager. This is our job. It is to understand how this technology will be put in practice by users and to make sure we are properly distilling and prioritizing the needs of those users into requirements for the engineers.


> This is certainly an extreme example, but I can't tell you how many bugs I've seen closed as "Works As Intended" where it may have been what the spec intended but certainly wasn't what the user expected is...well, it's way larger than it should be.

Sounds like just about every law or regulation written since the dawn of civilization.

Not sure if it a solvable problem. Heck, even nature has crap like this popping up (a human embryo grows a functional tail at one point, and sometimes it stays around).


What's disturbing is the mindset in the friendly-fire disaster that the OP links to:

> Nonetheless, the official said the incident shows that the Air Force and the Army have a serious training problem that needs to be corrected. ?"We need to know how our equipment works; when the battery is changed, it defaults to his own location,"? the official said. "?We'?ve got to make sure out people understand this.?"

I hope that's not the mentality of the military today, that special forces operators in the field need to remember UI/UX details of the software they use...rather than the software developer adding in relatively simple safeguards, such as confirmation boxes.

It's also a lesson in how software development and engineering can go awry without close feedback from the audience. There are on-the-field realities -- such as the battery going low -- that even if the developers are aware of, they can't easily predict how the intended user will react in those scenarios...to tragic results, in this case.


While I agree with everything you've said, this could very easily have been the opposite problem had they implemented something like that. A squad is killed by enemy fire and we recover the GPS unit with the "Are you sure? Y/N" box still flashing, and a newspaper writes an article about how our military needs push-button solutions and not dialog boxes.

Yes, I was thinking that too...there really is no "perfect" solution, just a thoughtful consideration of the tradeoffs...my inclination is that this particular "feature" hadn't been discussed enough. To be fair, this is a product that's probably hard to user test, due to the limited number of users and the black box nature of most proprietary military software.

Another thing I wonder about: this happened in 2001, an era that IIRC, portable devices were few and sleep/power-down modes were not as well-executed as they are today. Yes, today, if I were to power off my mobile device, then replace its battery, and power it back on...I would expect things to be more or less how I left them (which is a little unrealistic, depending on the application)...but back then, I could see how a developer would think: "Well of course everyone knows system state gets reset when the device is powered down for any reason"


It's a bit of both... yes, these bugs ought to be fixed in software whenever possible. However, the military can't wait until everything's perfect to deploy, and the users of the technology need to be trained on all the sharp pointy bits in the code. Some of these will be bugs, and some of these will just be that this is military code and it's built to have sharp pointy bits, some of which inevitably can be aimed at oneself. As someone else points out, firing on one's own position is a valid use case in the military.

If you can't wrap your brain around that statement... go thank a soldier.

Again let me emphasize I'm not saying the software is OK as is. It should be fixed, somehow. But at the same time, the military can't afford to just, say, stop using the unit entirely until then, and wait until all equivalently serious bugs are fixed. Not having the software 'cause it's in the shop can be fatal too.


Right. Cutting-edge military systems are, indeed, risky business, and one has to balance (as you say) more testing against not having those systems when you need them.

This puts a high premium on the efficiency of bug-finding, especially spec bugs of new systems. My intuition is that this could improve by a lot.


I'm not sure this tragic example is actually a good example. If I replace the battery in my alarm clock, I check whether the alarm is still set, because I know I can't rely on it. If you set coordinates for something as vastly more important as an airstrike, shouldn't it be par for the course that you doublecheck the coordinates if you replace the battery between setting the coordinates and pressing 'fire'?

Think deeper: what is the use case for a GPS unit setting airstrike coordinates to its own location?

A guy here recently got quite an important medal for doing exactly that.

From Brian Thacker's Medal of Honor citation: "Then, in an act of supreme courage, he called for friendly artillery fire on his own position to allow his comrades more time to withdraw safely from the area and, at the same time, inflict even greater casualties on the enemy forces."

John R Fox's citation: "As the Germans continued to press the attack towards the area that Lieutenant Fox occupied, he adjusted the artillery fire closer to his position. Finally he was warned that the next adjustment would bring the deadly artillery right on top of his position. After acknowledging the danger, Lieutenant Fox insisted that the last adjustment be fired as this was the only way to defeat the attacking soldiers. Later, when a counterattack retook the position from the Germans, Lieutenant Fox's body was found with the bodies of approximately 100 German soldiers."

Pretty sure you can find some more if you look.

[1] - http://www.cmohs.org/recipient-detail/3431/thacker-brian-mil...

[2] - http://www.cmohs.org/recipient-detail/2744/fox-john-r.php


Ok, what is the use case for it doing that by default?

How does one enter coordinates on that device? I'd guess that as an offset from your location ("200m north of my position") might be the easiest way. If that's the case, then starting with a zero offset would be the logical choice.

You'd hope they would add some sort of safety to prevent this sort of thing, though.


I would say that no default is the right choice.

It could actually be very simple. At power on the coordinates are zeroed, but "fire" is not allowed until coordinates are adjusted again. This prevents the original situation, but allows a tactical decision like "fire on own position" to be executed with at least as much speed as any other targeting decision. Firing at power on is an implicitly optimized UX for firing at the default position, and there is no reason for any target to be favored in this case.

"logical" choices that are divorced from situational reality, are bad choices.

Sheesh, guy asks for a use case, I give him two real-world examples, and get downvoted for it. Nobody likes examples?

(I know, I know, don't complain about the downvoting.)


> I hope that's not the mentality of the military today, that special forces operators in the field need to remember UI/UX details of the software they use...rather than the software developer adding in relatively simple safeguards, such as confirmation boxes.

Those aren't mutually exclusive alternatives. Special forces operators (and soldiers more generally) need to understand the operational characteristics of the equipment they actually have.

This does not mean that suboptimal UI/UX in military equipment shouldn't be addressed, it should.


Can anyone find more information about the friendly fire incident cited in the article? I found a scan of a newspaper printing: https://news.google.com/newspapers?nid=1876&dat=20020324&id=...

I did not find anything in the official Washington Post archive: http://www.washingtonpost.com/wp-adv/archives/front.htm

This search shows other articles by that author on the Washington Post site during that time, but not this specific one: https://www.google.com/search?q=%22vernon+loeb%22+%22kandaha...


The archived Washington Post article is still here: https://web.archive.org/web/20130624110018/http:/www.gpsnavi...

I wonder why that article does not appear in the official Washington Post archive. I am interested in a better source than a third party site.

This is especially confusing because "GPS Navigator Magazine" has the wrong date on the article.

It looks like this happened in December 2001 and the speculation of it being caused by the battery change was released in March 2002. Christian Science Monitor [1] reported on it in December when the cause was unknown. This PDF [2] has some more information on the Washington Post article.

1: http://www.csmonitor.com/2001/1207/p2s1-usmi.html

2: http://wwwhomes.uni-bielefeld.de/cgoeker/SysSafe/WiSe%2011-1...


Dec 6, 2001. Bomb Kills Three U.S. Soldiers; 20 Are Injured 'Friendly Fire'

http://pqasb.pqarchiver.com/washingtonpost/doc/409215373.htm...

Feb 2, 2002. U.S. Soldiers Recount Smart Bomb's Blunder

http://pqasb.pqarchiver.com/washingtonpost/doc/409315969.htm...

Mar 24, 2002. 'Friendly Fire' Deaths Traced to Dead Battery; Taliban Targeted, but U.S. Forces Killed

http://pqasb.pqarchiver.com/washingtonpost/doc/409245838.htm...


To Matthew Wilkes: It is true that in military situations the tradeoffs of safety vs. utility are quite different.

However, I think this is one of those (many) cases where it is an "operator error" which a better design could have prevented.

In other words, I think it was a plain bug. Somewhere in the fire() function there should have been some check for:

     distance(current_gps_coordinates, target_coordinates) < min_safety_distance
but there was not.

Thing is, you actually don't want to put that line of code in there, because sometimes you DO need to fire on yourself. If you build a tool that fails that case because some idiot hard coded a rule with no override because they didn't understand every single case where their too might need to be used no amount of training can fix the problem. Sometimes you have to assume the user is smart enough to use your tool. I think this is the right idea, but you absolutely want to let the user shoot themselves in the foot if they want to, but you should make sure they know they are going to shoot themselves in the foot.

Absolutely. What I meant was that there should have been some "are you sure" confirmation in this case.

It seems reasonable that this device may not only be used to drop bombs, could be replenishing munitions or dropping food on target. Point being, a simple "are you too close" does the complexity a disservice.

One thing the article touched on that I think is dead on: unexpected interaction between complex interdependent components is a really common root cause of failure. Something veils the flaw, and then one day the circumstances change and it's revealed. I think monolithic systems which undergo incremental improvement are particularly susceptible to this. Loosely-coupled systems make it much easier to precisely specify and test the desired functionality.

As an example, consider the Therac-25 incident. It was a radiotherapy machine designed to operate in two modes - direct exposure to a low-power electron beam, or firing a high-power beam at a set of targets to produce X-rays. The predecessors used a loosely-coupled system - when the targets were not in place the high-power mode was electrically disconnected, or in other words the exposure system was separate from the control system. This was switched to a tightly coupled system, where the computer also served as the safety interlock. A particular sequence of inputs combined with a race condition could result in the targets (and sensors) being unlocked and removed but the beam firing at high power. Since the predecessors had hardware interlocks, this didn't result in any exposure. But they re-used the software going forward, and this bug surfaced.

https://en.wikipedia.org/wiki/Therac-25

No direct relation to this incident, of course. The controller should have a specific warning message and confirmation dialog before firing within a minimum safe distance. Probably wasn't in the specs though.


Whenever i hear about confirmation dialogs and warning messages i find myself reminded of how many times i have seen or heard about people going "next next next ok" while operating Windows. Throw enough dialogs and such at anyone, and they will develop the instinct to hit ok without reading.

To me, the really-interesting question here is "How can we find such issues in the design before they hit".

In the article, I explain why this is going to be a bigger problem in the near future, and speculate that some simulation-based could, perhaps, do the trick. But I am not sure.

Any thoughts about that?


I find it interesting that the OP has had essentially zero feedback on anything other than confirmation dialog boxes - especially since I don't think that was part of the OPs speculation.

Having some form of 'interactive spec' where stakeholders can 'play with' the system to verify intended behaviour is a really interesting idea. Of course it is used in software development all the time to varying levels of fidelity (wireframes, mockups, interactive prototypes and so on). But I think here is the rub... the better the quality the simulation the greater the amount of effort until it becomes approximately equal to the cost of just doing it.

Maybe with enough automation and tooling...


I think you are absolutely correct. It is the price/performance of creating these "interactive specs" that will determine whether this idea is any good.

From what I have seen (I looked a bit at the what people are doing in UAVs, autonomous vehicles and so on - see e.g. http://blog.foretellix.com/2015/07/03/my-impressions-from-th...), I think a lot can be done to improve both price and performance.

In other words, I think it is possible to invent new tools / methodologies to make simulation (especially high-level simulation) easier, and especially to get a lot more out of it, at all stages of design / verification / maintenance.


I see at least a couple of posters in this thread are suggesting "there should be a confirmation dialog".

Stop.

Confirmation dialogs are dead UX. Users have been trained by internet browsers with pop ups and other poorly designed pieces of software with poorly worded dialog boxes to completely ignore dialog boxes. I've tested this several times. Users don't even consciously register that the dialog box ever appeared. You can watch them, right over their shoulder, and as soon as they close a dialog, they will turn to you and ask "what do I do now?" you will ask, "what did the pop-up box say to do?", and they will respond "what pop-up box?". Walk them through the process again, don't pre-warn them about when the dialog appears, and they will do it again.

This is also why crapware is so easy to install on any system that has wizard-based installers.

Do not use dialog boxes. It's better to hire two testing teams and set them against each other to break the software. Test and test and test again, then disallow bad behavior. The "plugger" example should not have allowed issuing a fire command on its own location, unless some explicit, completely out-of-band, one-time-use option would enable it so. Actually, better yet, it should not re-initialize the target location on start-up with the device's location. Just clear out the target, when the user tries to fire, they will see the lack of target and probably know "oh, must be because I changed the batteries". There is a reason they are called "fail-safe"s. You create failure modes that err on the side of safety. It's better if the device fails to fire than to fire at the wrong target.

I don't care if you can't afford it. If you cannot afford a large testing infrastructure, then you can't afford to make safety-critical software at all. You cannot solve this problem with software.

These situations come about when you have a management infrastructure that cares more about feature lists than correctness. I can hear them now, "we need to move faster", and "if it works, don't fix it."


If the dialog never appeared before people will notice it.

If the dialog has OWN POSITION in big red blinking letters they will understand.

Never firing on your position can cost lives too.


The point is that the vast majority of people reading this thread aren't working on weapons systems. They're working on web app software to create ad space they can sell to leverage their userbase as a source of income. Dialog boxes are terrible UI.

I agree that "are you sure you want to proceed?" dialog boxes on a word processor are absolutely horrible, but not all dialog boxes are so bad - Firefox's unacceptable certificate page, is, for example, a decent solution for its problem.

There are many actions that could be very harmful under some situations, but occasionally required (e.g. dealing with websites that get a certificate for `www.mysite.net` instead of `mysite.net`). Much of safety engineering is ensuring these are not encountered in day-to-day routine, but these are never perfect, and much worse in open, dynamic environments (like the military, when this case occured).

Militaries do typically have radio protocols to reduce the risk of artillery targeting unintended locations. I agree that good testers should have caught this bug, but there are hundreds of corner cases and you will always miss some of them. Domestic electric equipment is designed to prevent live wires from being exposed, but RCD-s are still a thing.


All software has spec bugs. Normally these can be ironed out in extensive testing, if your software has many users. This sort of military system has few and probably is not used that much. An testing jig should be designed and used for this sort of system. However, it will not iron out all the bugs.

Legal | privacy