Sunday, April 10, 2005

When Software Kills



Click here for AmazonThe Therac-25, a computer-based radiation therapy machine, massively overdosed patients at least six times between June 1985 and January 1987. Each overdose exposed a patient to several times the normal therapeutic dose and resulted in the patient's severe injury and, in some cases, death. The overdoses occurred primarily because of errors in the data validation routines contained within the Therac-25 software.

For example, a normal therapeutic dose of radiation might consist of exposure to around 200-rad. Physicists believe that the Therac-25 exposed patients to 15,000-rad... or more.

How could such a thing happen?



Poor design and implementation of a multi-tasking application was the primary culprit. If the operator of the Therac-25 performed data-entry under special circumstances, shared variables between the keyboard-handling routine and other tasks could become corrupted. These other tasks included verification that the machine's settings were correct.

The upper collimator, on the other hand, is set to the position dictated by the low-order byte of MEOS by another concurrently running task (Hand) and can therefore be inconsistent with the parameters set in accordance with the information in the high-order byte of MEOS. The software appears to include no checks to detect such an incompatibility.


Basically, aside from the poor design and implementation, there were no paranoia checks.

During machine setup, Set-Up Test will be executed several hundred times since it reschedules itself waiting for other events to occur. In the code, the Class3 variable is incremented by one in each pass through Set-Up Test. Since the Class3 variable is 1 byte, it can only contain a maximum value of 255 decimal. Thus, on every 256th pass through the Set-Up Test code, the variable overflows and has a zero value. That means that on every 256th pass through Set-Up Test, the upper collimator will not be checked and an upper collimator fault will not be detected.

The overexposure occurred when the operator hit the "set" button at the precise moment that Class3 rolled over to zero. Thus Chkcol was not executed, and F$mal was not set to indicate the upper collimator was still in field-light position. The software turned on the full 25 MeV without the target in place and without scanning.


Subsequent studies of the software and the processes around the events in question led to recommendations for basic "best practices". Most were obvious: documentation, processes, and standards should have been established - and never were. Even formal testing and rigorous stress tests never took place.

But one recommendation, in particular, is near and dear to my heart:

* Ways to get information about errors -- for example, software audit trails -- should be designed into the software from the beginning.


One of my personal heroes -- Dan Bricklin, the co-inventor of the spreadsheet -- made a similar point a while back. And I blogged about it last year. It's a point worth considering - again.

Because if you write software for a living, you have a responsibility to be dead serious about your code's quality. You never know when someone will borrow, reuse or transplant your code into another package, device, or system. And your code could end up in another system like a Therac-25, where lives hang in the balance.
 

No comments: