Multiple failures caused state network crash
Multiple failures contributed to the crash of the state's entire computer network in Salem last month, according to the results of a post incident review.
The crash on Saturday, June 9, paralyzed web and phone services at dozens of state agencies from at least eight hours to up to nearly 48 hours, depending on the agency.
Officials at the state data center and Cisco, the vendor that provides the state's computer hardware, determined that two of the state network's four core routers simultaneously had back-up failures when PGE turned off power for scheduled construction at the state Capitol. The failure of router at the state Public Service Building was insufficient to trigger a takeover by a redundant piece of hardware at Department of Revenue, and the DOR router failed to communicate the crash to the other two routers at the state data center, said Sandy Wheeler, administrator for the state data center.
The four-router system is designed to prevent any interruption in online access. Each router has a battery and a generator backup and two motherboards. The second motherboard is meant to take over if the other one fails.
"When the outage happened, one of things we did not understand was why all of sudden within the state data center things stopped communicating," Wheeler said. "We could understand two pieces failed and couldn't get information to the data center, but what we couldn't figure out why those other things also stopped communicating.
"There were key network files sitting on those two pieces of equipment at Revenue and the Public Services Building that were not on the pieces of equipment in data center."
Technicians copied those files June 20 over to the two pieces of equipment at the state data center.
"By putting that there — if we were to have the same kind of experience, the state data center would keep routing traffic," Wheeler said. "Anything dependent on internet would continue working."
The state data center also plans to change the way it tests the network. Tests used to test equipment at one location at a time. Now, the tests will focus how the four routers work together, Wheeler said.
Crews were able to replace defective equipment in the system in a matter of hours, but it took up to nearly 48 hours for some agencies, including the Oregon Department of Transportation, to get websites back in operation.
State date center experts are still investigating why ODOT took longer than other agencies to recover.
"We take this and we learn from it and we continue to modify our procedures and processes," Wheeler said. "Hopefully, we won't have this kind of incident again because we have made technical changes as well as procedural changes."