Router Failure and Software Crash UW-Madison Network

Router Failure and Software Crash UW-Madison Network

Madison, Wis. – A network crash last week left many University of Wisconsin-Madison students, faculty and staff wondering just what exactly was wrong with their e-mail. The Division of Information Technology, which monitors and sustains the UW’s networks, experienced a system-wide crash last Friday that left thousands of people without e-mail or Internet access.
According to Brian Rust, spokesperson for DoIT, the problem started after new code was installed last Thursday as part of backbone maintenance. On Friday, two network crashes occurred, one within a core router and the other on a platform switch, which affected access to DoIT’s platform. The problem seemingly originated with a failed router card in the primary core router. The router was rebooted but loaded without configurations, which is indicative of a memory problem. The card was replaced, but two redundant switches failed to synchronize. In what DoIT described as a “poorly timed hardware failure,” the department reverted to code, which was in use before the upgrade in order to increase the stability of the network.
“We tried to redirect traffic and some people were able to get in and check their e-mail. But it’s similar to an accident on the interstate—even with traffic rerouted around it or off the road, it’s still choked,” Rust said.
The system was down from 10:30 a.m. to 4:00 p.m. on Friday.
According to Rust, the upgrade that caused the crash was a part of the 21st Century Network Upgrade program, which is designed to boost Internet speed and capacity at UW-Madison. A part of this upgrade is HP Openview, new software that monitors all network traffic. The crash was Openview’s first large-scale test on the network and the software preformed well by helping to isolate the problem and increase efficiency.
DoIT has taken a proactive stance to prevent future network outages. The department recently installed software that will allow systems administrators to contact other departments with outgoing phone messages to notify them of a system-wide crash.
DoIT is working with router vendor, Cisco, to further investigate the failure, but the long-term effects of the problems have yet to be determined.
________
Kristin V. Johnson is a Madison-based writer. She can be reached at kristin@wistechnology.com.
)