16 Sep Cagigal: No flirting with disaster
Editor’s note: Before the record June 2008 floods, Alliant Energy had been using its Cedar Rapids, Iowa data center as a redundant back up for Madison, but had not set up the reverse. Fortunately, Alliant had a full year of data replication work under its belt, and was able to make due when its tower facility in Cedar Rapids was flooded.
WTN interviewed David Cagigal, Alliant’s chief information technology officer, for an August feature, and presents more details in this Q & A format. Here are more excerpts, including a fuller accounting of what the company faced the day after.
WTN: Without that two-way backup capability, what would you have faced?
Cagigal: [Laughs] Outages, long, long outages in the state of Iowa that would have been detrimental to assisting our customers.
WTN: Can you summarize what you faced on the first day you reported for work after the flooding?
Cagigal: I’ll give you some background leading up to the day. We have just a little over 30 people that work there [in Cedar Rapids]. They are network staff, server, and workstation. And then there is a group of application developers that supports billing. Those are the four groups that are there in addition to records management, so there are five groups, and the sum total of those groups is about 30 people.
So on Wednesday night, June 11, we [Madison] were alerted, and we got on a conference call, and this is the Crisis Management Team (CMT). We have members on the Crisis Management Team, when it’s invoked, and everyone has a role to play. Although the CMT was not invoked, we got on the phone on Wednesday night in anticipation that there may be an issue of service [in Cedar Rapids]. This was about five o’clock, four or five o’clock Wednesday night. Weather predictions were pretty dire, water levels were rising, and we were informed that although the water was rising, we had no imminent danger. We were assured that it would crest and then it would pass without fanfare.
We were on the call for maybe an hour. I was talking to the select few that were still in the building after six o’clock, and they said the water was rising rapidly. We remained in the building in Madison until about eight or nine o’clock. My staff and my leadership team, we were here, let’s say it’s about eight o’clock, and then we went out to dinner over at Texas Roadhouse, the three of us, myself and two managers.
We would get updates by cell phone from people who were in the building, in the [Cedar Rapids] tower. It’s a 21-story building. So we went to dinner and the three of us said that we need to send somebody there. So we sent Dee Dee Crowley. She left by 10 o’clock at night, got there about midnight, and the water was continually rising. When she got there, she rendered an assessment that was inconsistent with the assurances that we got.
Wednesday night, several people were sandbagging the building, and they had completed the perimeter of two to three feet, and if you look at some of the photos you can see the sandbagging that went around the building. So they had completed that task by midnight and Dee Dee got there and said that the water was coming out of the riverbank and onto the street and toward the tower.
By midnight, we had ordered a number of servers to be pulled from the rack and sent to Madison for e-mail and voice mail and network shared drive-related tapes. So we loaded up a couple of carts and a couple of trucks and we brought the equipment back to Madison Thursday afternoon. By Thursday afternoon, we were the last people out of the building because they then shut the building down, they had lost power, lost access, and by that time they had seven feet of water in the lobby – late afternoon on Thursday, June 12.
We were the last ones out of the building with the select servers that we needed, with everything else being replicated. They were well on their way to Madison and arrived Thursday evening with the equipment, which we then turned around and installed Thursday evening through Friday with a few people not going to bed, working through the evening, and being there Friday morning. So we had everything up and running late Thursday night, Friday morning.
Parallel to that, the officers were able to contract with Kirkwood Training & Outreach Services (KTOS). By Thursday night, they had established the EOC, the Emergency Operations Center at KTOS. In doing so, they had no equipment and they had no network connectivity, and that’s where we came in. We were able to dispatch laptops and other equipment to KTOS by Thursday night, so while we were bringing equipment back Thursday, we were shipping equipment back Thursday to the Emergency Operations Center, the EOC, and so by Thursday night and into Friday, we had a number of people already equipped, and we had MediaCom drop in a cable modem for connectivity and PPN to the corporate network Thursday and Friday.
I drove out there myself on Sunday. All the water was gone. However, no one was allowed into the building and power had not been restored to the Cedar Rapids area. We commenced generation power on Sunday. As long as the diesel generator tank was filled with fuel, it would run, and it ran for three weeks. I think we started it up on Sunday, and on Sunday we were able to, with the water gone and the generator working and people working on restoring commercial power for three weeks, we were resigned to the fact that we were going to have to run on generated power for as long as we could, not knowing when commercial power would be restored.
So our staff went back into the data center and the network operating center with the diesel generator running on Sunday, and we were able to reconnect people. So fortunately, this happened over the weekend. On Monday, we were limping along with many people in temporary facilities, and Virtual Private Network connections from home, the week of June 16 to 20, we were orchestrating other temporary facilities. In total, I think we had five other facilities that we dispatched people to through that week, not knowing when we would get commercial power. Estimates ranged between 30 days and 60 days. In reality, it was only three weeks.
But for every temporary facility, we had to drop in equipment and network connectivity.
WTN: How much of this was flying by the seat of your pants, and how much was predicated on your business continuity plan?
Cagigal: Our BCP regarding the processes and the people, were headed up, and we practice this drill, by CMT and EOC. We’ve got well-defined roles and responsibilities in people’s names on an organizational chart that says this is what you’re supposed to do, as we invoke calling trees, the ability to contact people, and delegate tasks. All that worked very well because it was documented, rehearsed, and organized.
What we didn’t do in those plans is identify the temporary locations. I think we deliberately choose not to do that because along with that comes contracts and hot site obligations and invoices to be paid, etcetera. We take it so far and then make it up as we go.
WTN: Would you change that now and go ahead and contract for temp locations?
Cagigal: It’s in the category of lessons learned. We will evaluate the risk of not having contracted sites available to us versus the way it was handled. We’ll evaluate the risk. Many will say we were fortunate that it worked out as smoothly as it did. Others may say that the risk is too great, that we should contract with those facilities for their availability.
WTN: Do you have a new answer now for the following question: As CIO, what keeps you up at night?
Cagigal: Yes, today it’s my business continuity plan. The business continuity plan is being evaluated. We’re organizing some events here to discuss exactly as we have here the timeline of events, our ability to respond, the communities that we live in, and the vendor support that we’ve got. All those dimensions are going to be re-evaluated and calculated in the sense that can we predict response rates similar to those that we experienced? If not, then we are going to have to take on some additional precautions to ensure that we can respond to our gas and electric customers as they are accustomed.
WTN: Given the plan, what role was information technology supposed to play in disaster recovery, and then what role did it play?
Cagigal: I think it’s very apparent to everyone who endured the flood how much day-to-day activity is driven by technology. Let’s just say they were very smart and they had a contract to go over the KTOS. They walked into the building and they can do nothing. They could make a telephone call; that’s it. What they need to know is where are all the diagrams? Where are all the inventories? Where are all the people? Who is going to get paid? And how is all that going to happen when they got nothing?
What I mean by nothing is they don’t have a computer, they don’t have a printer, they don’t have network connectivity. They basically have realized the importance of technology in their day-to-day lives. So the urgency was, number one, other than breathing the air, they needed a computer that was connected to the network so that they could talk to the other members of the CMT and the other members of the EOC.
They firmly understood that they needed copiers, cell phones, landlines, computers, and network connectivity. All that needed to be delivered at a moment’s notice to secure the network as well. We could not give people network access unless it was secured.
WTN: The flood summary mentioned several fortunate circumstances. Without a disaster recovery plan, how much residual luck would you have experienced?
Cagigal: Before data replication, we had a hot site in Atlanta that was a contracted service. We had been going down there with testing. None of us was ever happy with testing going on down there. We thought about data replications, moving in a direction where Cedar Rapids would be a backup and we would put in an OC-48 in the event Madison goes down, and we would terminate our hot site in Atlanta.
We were a few months away from completing it, and enabled us to recover our data center and continue – this is a very important point here – continue to bill our customers and continue to receive revenue from our meters.
WTN: What would you glean from that, the need for a better plan and process?
Cagigal: We became favored people. We became valued contributors. We became the go-to people for technology. We rose to a level of importance that none of us has ever enjoyed in the day-to-day activities of contributing technology to Alliant Energy.
We were like dial tone on phone, running water from the tap – it was just expected. So what we want to do is build upon our value to organization day in, day out. We’re struggling with that now. We’re trying to form a set of behaviors to understand the business and that we are valued by the business and we are treated with that kind of acknowledgement, not just because of the flood but because that’s a meaningful contribution.
We’d like to be considered prominent in the business model of Alliant because of what tech contributes, not just to reduce the cost but also to enable business processes to be productive. We’re going to explore business process transformation. We’re a lean Six Sigma shop and we routinely scrutinize our processes.
We also suspended policies and controls, and we were able to trust one another to expedite with a sense of urgency to get things done. Now we’ve reverted back to policy and control. We want to strike a balance in getting things done and adhere to sensible policies and controls. We’re scratching our heads about this, and it was said several times by the officers: How come we can get so many things done during the flood and it takes us so long to do so otherwise.
That will challenge us to evaluate policies and controls to see if they are equally balanced in terms of getting results.
WTN: What areas were controls were abandoned?
Cagigal: Buying supplies. We just went to the store, we did not fill out a purchase order. We paid more than we would have if we went through procurement. To fill out all the temp facilities, we paid premium prices for office supplies.
We also worked with vendors like AT&T and Paetec. They behaved in a manner I never would have expected because when you purchase a T1 line, it takes 30, 45 days if not 60 days to get it provisioned and get it tested and turned on. We had those up in days. They circumvented some of their controls for the sake of getting us service because they knew customers were depending on us to be up and working.
WTN: Do you have any technology or process-related takeaways, for other CIOs, from this experience?
Cagigal: I guess my key take-away here is always – always – assess the risk of a reactive plan versus a contracted plan. Cedar Rapids is a very small town with a lot of intimate relationships that we were able to get KTOS, that we’re able to get Paetec, that we were able to get a lot of vendors, the city government, to help us do things.
In a larger city, with many companies suffering the same outage, we would be queued up and taking on a significant amount of risk of not being able to run our business. You need to know the town you live in, you need to know your environment, you need to know your vendor dependencies, and you need to understand your risk.