Let’s talk about Cisco bugs, outsourcing of critical resources and Disaster Recovery

I am commenting on “Cisco’s Network Bugs Are Front and Center in Bankruptcy Fight”

http://www.bloomberg.com/news/articles/2016-09-08/cisco-s-network-bugs-are-front-and-center-in-bankruptcy-fight

The short version is this: PeakWeb, Platform-as-a-service (PAAS) provider had a catastrophic outage they blamed on a Cisco bug. PeakWeb’s customer, MachineZone felt the brunt of this; as none of MachineZone’s customers were able to access their  gaming platform, so much so MachineWeb is filing for Bankruptcy.

This Article fascinated the Network Engineer in me; as I have dealt with so many bugs bringing down Network Systems. In this case; the end customer felt the edge of sword; resulting in Bankruptcy. The Article references the Cisco Nexus 3000. Cisco Nexus Switches have had their fare share of bugs since coming onto the market in 2008, I, and many of my peers have experienced many of them first hand. Although some of the bugs are catastrophic, yes – this is not a Brad Reese article to bash on Cisco.

I think the weak link here was not the bug, but the relationship/contract Machine Zone had with Peak Web as well as Peak Web’s own lack of any apparent Disaster Recovery Plan. Outsourcing to the Cloud is almost a no-brainer these days with all of the low cost horse-power that is out there.  Although true, the Info Security Engineer has been skeptic of  “the Cloud”  . . . really going back since I began watching companies turn over their critical systems and applications and put them in computers they don’t own run by people they did not hire. Don’t get me wrong, the Cloud has it’s place for some things, but in most cases –  just not the entirety of your critical infrastructure. The case referenced here supports my argument. Even from an InfoSec standpoint, PeakWeb’s outage comprised the Availability of Machine Zone’s data of the Confidentiality/Integrity/Availability triad.

An alternative for Machine Zone, if they were so determined to use Cloud, they could have leveraged another redundant PAAS provider, as a ‘warm standby’ and had their Application ‘Game Stack’ / Container / Docker / and ready to go in a 2 -to -4 hour -or even 8 hour period. Fire up the back-up’Game Stack’ / Container / Docker and then some public DNS changes; and viola – ready to go.  Even if their hypothetical back-up could have handled 70% of the normal traffic load; thats still 70% of customers they are serving. Having a “warm site” would have killed two birds with one stone. First, all of their eggs are not in one basket ( no matter how much the Cloud Provider preaches their own redundancy ), second – they now have a Disaster Recovery Plan.  I understand that these alternative solutions are expensive; especially for a smaller org.., but what is the cost of not having a DR? Bankruptcy.

Another what if scenario . . what if Machine Zone had their Network / App Platform  in-sourced in their own Data Center they owned and controlled and their own talent to manage it? Can’t promise they would not have had a 10 hour outage; buy my opinion as a seasoned Network Engineer is the the likelihood that their network would not have been down that long; even with a bug like that – assuming they followed standard redundant Architecture in building their  infrastructure. (Again, I know this is expensive).  Also, they would have owned the relationship with Cisco and leveraged Cisco TAC ( probably one of the best Technical Support organizations on the planet ) , to help them resolve rather than having a middle man and a contract that bound their hands of doing anything to improve their situation.

My heart goes out to MachineWeb, I don’t mean to be an “Armchair critic” and point out what they did wrong, but it makes for an interesting and lesson-learned and personal take-away. Especially true in an era where most businesses don’t think twice about the implications of moving critical infrastructure to “the Cloud”.  We just see the cheap price and we hear about all their redundant Data Centers – so we buy in. So I am not bashing anyone, just challenging you to think about all implications of Cloud-sourcing your infrastructure and or Applications to another organization whose policies you do not control.

Fun Fact:    The term ‘Cloud’ is derived from old Telco schematics where a ‘cloud’ logo was used to represent the ambiguity of the entire carrier network; in an otherwise detailed depiction of the connectivity of your own network layout. To me, “the Cloud” is just another word for ‘Other People’s Computers’, I don’t own or manage.

Stay safe! Stay Secure! Chris out.

Advertisements

One thought on “Let’s talk about Cisco bugs, outsourcing of critical resources and Disaster Recovery

  1. Pingback: Thought’s on Cricket Liu’s Webcast – Henson Security Tools

Comments are closed.