Newly stringent FAA tests spur a fundamental software redesign of Boeing’s 737 MAX flight controls
Aug. 1, 2019 at 11:18 am Updated Aug. 1, 2019 at 9:45 pm
By Dominic Gates Seattle Times aerospace reporter
While conducting newly stringent tests on the Boeing 737 MAX flight control system, the Federal Aviation Administration (FAA) in June uncovered a potential flaw that now has spurred Boeing to make a fundamental software-design change.
Boeing is changing the MAX’s automated flight-control system’s software so that it will take input from both flight-control computers at once instead of using only one on each flight. That might seem simple and obvious, but in the architecture that has been in place on the 737 for decades, the automated systems take input from only one computer on a flight, switching to use the other computer on the next flight.
Boeing believes the changes can be accomplished in time to win new regulatory approval for the MAX to fly again by October. Significant slipping of that schedule could lead to a temporary halt in production at its Renton plant where 10,000 workers assemble the 737.
After two deadly crashes of Boeing’s 737 MAX and the ensuing heavy criticism of the FAA for its limited oversight of the jet’s original certification, the agency has been reevaluating and recertifying Boeing’s updated flight-control systems.
It has specifically rejected Boeing’s assumption that the plane’s pilots can be relied upon as the backstop safeguard in scenarios such as the uncommanded movement of the horizontal tail involved in both the Indonesian and Ethiopian crashes. That notion was ruled out by FAA pilots in June when, during testing of the effect of a glitch in the computer hardware, one out of three pilots in a simulation failed to save the aircraft.
The thoroughness of the ongoing review of the MAX flight controls in light of the two crashes is apparent in how a new potential fault with a microprocessor in the flight-control computer was discovered during the June testing. Details of that fault not previously reported were confirmed both by an FAA official and by a person at Boeing familiar with the tests.
In response to finding that new glitch, Boeing developed the plan to fundamentally change the software architecture of the MAX flight-control system and take input simultaneously from the two flight-control computers that are standard on the 737.
“This is a huge deal,” Peter Lemme, a former flight-controls engineer at Boeing and avionics expert, said about the change. Lemme said the proposed software architecture switch to a “fail-safe,” two-channel system, with each of the computers operating from an independent set of sensors, will not only address the new microprocessor issue but will also make the flawed Maneuvering Characteristics Augmentation System (MCAS) that went haywire on the two crash flights more reliable and safe.
“I’m overjoyed to hear Boeing is doing this,” Lemme said. “It’s absolutely the right thing to do.”
According to a third person familiar with the details, Boeing expects to have this new software architecture ready for testing toward the end of September. Meanwhile, it will continue certification activities in parallel so that it can stick to its announced schedule and hope for clearance from the FAA and other regulators in October.
When Boeing announced June 26 that a new potential flaw had been discovered on the MAX — this time in a microprocessor in the jet’s flight-control computer — it even caught Boeing CEO Dennis Muilenburg by surprise.
Speaking at a conference in Aspen, Colorado, that morning, Muilenburg reiterated a prior projection that the MAX could be carrying passengers again by “the end of summer.” Later that day, Boeing announced the problem in a Securities and Exchange Commission filing, and soon after projected that the issue could add a further three months’ delay.
What the FAA was testing when it discovered this new vulnerability was esoteric and remote. According to the person familiar with the details, who asked for anonymity because of the sensitivity of the ongoing investigations, the specific fault that showed up has “never happened in 200 million flight hours on this same flight-control computer in [older model] 737 NGs.”
In sessions in a Boeing flight simulator in Seattle, two FAA engineering test pilots, typically ex-military test pilots, and a pilot from the FAA’s Flight Standards Aircraft Evaluation Group (AEG), typically an ex-airline pilot, set up a session to test 33 different scenarios that might be sparked by a rare, random microprocessor fault in the jet’s flight-control computer.
This was standard testing that’s typically done in certifying an airplane, but this time it was deliberately set up to produce specific effects similar to what happened on the Lion Air and Ethiopian flights.
The fault occurs when bits inside the microprocessor are randomly flipped from 0 to 1 or vice versa. This is a known phenomenon that can happen due to cosmic rays striking the circuitry. Electronics inside aircraft are particularly vulnerable to such radiation because they fly at high altitudes and high geographic latitudes where the rays are more intense.
A neutron hitting a cell on a microprocessor can change the cell’s electrical charge, flipping its binary state from 0 to 1 or from 1 to 0. The result is that although the software code is right and the inputs to the computer are correct, the output is corrupted by this one wrong bit.
So for example, a value of 1 on a single bit might indicate that the jet’s wing flaps are up, while a 0 would mean they are down. A value of 1 on a different bit might tell the computer that the MAX’s problematic flight-control system called MCAS is engaged, while a 0 would indicate it is not.
This isn’t as alarming as it may sound. There are standard ways to protect against such bit flips having any dangerous impact on an airplane system, and FAA regulations require that this possibility be accounted for in the design of all critical electronics on board aircraft. The simulator sessions in June were designed to test for any such vulnerability.
During the tests, 33 different scenarios were artificially induced by deliberately flipping five bits on the microprocessor, an error rate determined appropriate by prior analysis. For all five bits, each 1 became a 0 and each 0 became a 1. This is considered a single fault, on the assumption that some cause, whether cosmic rays or something else, might flip all five bits at once.
For these simulations, the five bits flipped were chosen in light of the two deadly crashes to create the worst possible combinations of failures to test if the pilots could cope.
In one scenario, the bits chosen first told the computer that MCAS was engaged when it wasn’t. This had the effect of disabling the cut-off switches inside the pilot-control column, which normally stop any uncommanded movement of the horizontal tail if the pilot pulls in the opposite direction. MCAS cannot work with those cut-off switches active and so the computer, fooled into thinking MCAS was operating, disabled them.
Since MCAS exists only on the MAX, not on earlier 737 models, this potential failure applies only to the MAX.
A second bit was chosen to make the horizontal tail, also known as the stabilizer, swivel upward uncommanded by the pilot, which has the effect of pitching the plane’s nose down. Other bits were flipped to add three more complications.
Even though the MCAS system that pushed the nose down on the two crash flights had not been activated, these changes in essence gave the FAA test pilots in the simulator an emergency situation similar to what transpired on those flights. This was deliberate. The FAA demanded, with knowledge about the crashes, that this scenario be rigorously reexamined in a new System Safety Analysis of the MAX’s flight controls.
“We were deliberately emulating some aspects of MCAS in a theoretical failure mode,” the person familiar with the tests said.
We need your support
In-depth journalism takes time and effort to produce, and it depends on paying subscribers. If you value these kinds of stories, consider subscribing.
This person emphasized how extremely improbable it is that five single bits on the microprocessor would flip at once and that the random bits would make these specific critical changes to the aircraft’s systems.
“While it’s a theoretical failure mode that has never been known to occur, we cannot prove it can’t happen,” he said. “So we have to account for it in the design.”
He added that early published accounts of the fault suggesting that the microprocessor had been overwhelmed and its data-processing speed slowed, causing the pilot-control column thumb switches that move the stabilizer to respond slowly, were inaccurate.
Lemme said he was happy to learn this because those accounts hadn’t made sense technically. And he said that the description of the fault and the chosen combination of random bit flips represent “a terribly worst-case condition that I cannot imagine happening in reality.”
Dwight Schaeffer, a former senior manager at Boeing Commercial Electronics, the company’s one-time in-house avionics division, agreed. “Five independent bit flips is really an extremely improbable event,” he said.
A crash in the simulator
What happened in the initial simulated run of this fault scenario in June is that the FAA test pilots handled the emergency using the standard procedure for a “runaway stabilizer” and recovered the aircraft. But they felt it took too long and that a less attentive pilot caught by surprise might have had a worse outcome.
FAA guidelines say that if an emergency arises on a plane flying by autopilot, the assumption is that a pilot will begin to respond within 3 seconds. If the plane is being flown manually, the assumption is 1 second.
That may seem a very short response time, but it’s not dissimilar to what a driver would be expected to do if, for example, a car skidded on ice or a tire blew. Still, not every driver and not every pilot is attentive.
“It took too long to recover,” said the FAA official familiar with the tests, who also asked for anonymity because of the sensitivity of ongoing investigations. “An important aspect of these simulations is to capture how a representative airline pilot would respond to the situation.”
So again in light of what happened in the crashes, the FAA pilots took a further step. They flew the same fault scenario again, this time deliberately allowing the fault to run for some time before responding. This time, one of the three pilots didn’t manage to recover and lost the aircraft.
Reclassified as “catastrophic”
In testimony Wednesday before a U.S. Senate Appropriations Subcommittee hearing on FAA oversight, Ali Bahrami, associate FAA administrator for aviation safety, confirmed this.
Describing what was tested in June as “a particular failure that was extremely remote,” Bahrami said “several of our pilots were able to recover. But there was one or so that could not recover successfully.”
According to a second FAA source, it was the AEG pilot, representing a typical U.S. airline captain, who failed to recover the jet.
That outcome changed everything for Boeing.
Prior to that, Boeing had classified this failure mode as a “major fault,” a category that can be mitigated by flight-crew action. The one pilot’s failure to recover immediately changed the classification to “catastrophic,” and FAA regulations require that no single fault can be permitted to lead to a catastrophic outcome. That meant Boeing must fix it and eliminate the possibility.
“There are active means of protecting against bit flips,” said retired Boeing electronics manager Schaeffer. “We always built it into our own software.”
One standard way to fix such a problem is to have the second independent microprocessor inside the same flight-control computer check the output of the first. If the second processor output disagrees with that of the first processor for some specific automated flight control, then no automated action is initiated and the pilot must fly manually.
“Now it takes two processors to fail to get the bad result,” the person familiar with the tests said. “You are no longer in the realm of a single point failure.”
A radical redesign
Boeing could have just rewritten the software governing what functions are monitored within the flight-control computer to eliminate this failure scenario.
Instead, it’s decided to make a much more radical software redesign, one that will not only fix this problem but make the MAX’s entire flight-control system — including MCAS — more reliable, according to three sources.
This change means the flight-control system will take input from both of the airplane’s flight computers and compare their outputs. This goes beyond what Boeing had previously decided to do, which is to adjust the MCAS software so that it took input from two angle of attack sensors instead of one.
The problem with that earlier approach is that if something serious goes wrong with the single flight computer receiving this input — whether it’s the bit flipping issue, or a memory corruption or a chip failure of any kind — then the computer output to the flight controls could be wrong even if both angle of attack sensors are working correctly.
For the MAX, the new MCAS was simply added to an existing 737 flight control system called the Speed Trim System, which was introduced with this one-channel computer architecture on the older model 737-300 in the 1980s.
With the proposed dual-channel configuration, both computers will be used to activate the automated flight controls. They will each take input from a wholly independent set of sensors (air speed, angle of attack, altitude and so on) and compare outputs. If the outputs disagree, indicating a computer fault, the computers will initiate no action and just let the pilot fly manually.
In other words, the new system will detect not only any disagreement between the sensors but also check for any processing error in interpreting the information from the sensors.
“This is a really good solution,” said Lemme, adding that “it should have been designed this way” from the beginning of the flight control system in the 1980s.
This raises the separate question of why the potential microprocessor fault discovered in June wasn’t caught in the original System Safety Analysis when the MAX was certified.
That original System Safety Analysis, as The Seattle Times reported in March, was performed by Boeing, and FAA technical staff felt pressure from managers to sign off on it. And as reported in May, there was also pressure from Boeing managers on the engineers conducting the work to limit safety testing during the analysis.
The person familiar with the testing said the new tests in June were informed by the knowledge of what had happened in the crashes, especially the erroneous activation of MCAS that pushed down the nose of the aircraft on both flights.
“It was a reassessment in light of everything else going on in the world with MCAS,” he said. “It’s a different set of eyes, asking a different set of questions.”
David Hinds, a retired Boeing flight controls and autopilot expert, said that clearly “something got missed” in the original MAX certification of MCAS and now this microprocessor fault.
“I’d like to think you’d catch this on first pass,” said Hinds. “They should have looked harder at some of this.”