Disaster Recovery site Over Night, yes it happened!

I want to share with you our last week experience at Emind.

We are all aware of the damages that were created by Hurricane Sandy’s storm, most if not all New York and New Jersey data centers were impacted. Some of them were down, some of them were partially up, some of them were powered by diesel generators, some of them were fighting water leaks, all were struggling around BGB routing changes, the bottom line was: “Almost worst ever” had happened.

On Monday night (IL time, 12 hours before storm was hitting NY) I got a call from the Head Of Production Operations (Amit Schnitzer) of TGS Systems, one of the biggest players in the online travel industry, informing me that their CEO is concerned about the potential impact of Hurricane Sandy’s storm and therefor they have decided to immediately setup a recovery site on AWS as a replacement for their cage in NJ datacenter and this is exactly what happed.

Over the same night Amit and myself have done all the preparations, adhoc designs and definitions. The next morning a team of 10 super humans (TGS’s application engineers, DBAs, infrastructure experts and Emind team consisting an architect (myself) backed by two of Emind Engineers) got into their war room and started executing the deployment plan.

Parallel work was done on both the public cloud and the VPC in order to save time…
Actions that were taken:

Pull data from the existing data center including data bases and storage and moved it to the AWS region (few TB, it takes time…)
Preparing base server images with their tools and settings on it
Setup a VPC, connect it using a site-to-site VPN to their exiting cage in NJ
Setup a SSL VPN to allow direct access to the new VPC based data center
Adjust the VPC settings to match their AD settings
Bring up the first domain controller and connect to their existing AD domain
Setup the suitable MSSQL servers configuration to match the storage requirements (side, raid, performance…)
Recover the data to the newly created DBs
Establish replication between the live DB on the NJ data center and the new DB in the VPC
In parallel 85 web, application and cache servers were started and configured
16 load balancers started and configured with their SSL certificates
DNS records were prepared for all the new resources
Adjustments, testing, adjustments, testing …
After 36 hours from the first call, we got a FULL system up and running ready for the worst to happen!

Ohh, this was an interesting week but I feel we have made it for our customer and they are now safe !

I’m proud being part of the team that has made it. thank you so much TGS Team !

Will always stand for you,

On Behalf of the Emind Team, Lahav

Some interesting links reporting the actual impact:

Lahav Savir

Founder and CTO, Cloud Platforms