Recent outages and the road map to preventing them in the future

Spark

HPIC - Hatas gonna Hate
Staff member
Administrator
Super Mod
Moderator
Joined
Oct 2, 1998
Messages
15,225
The last 10 months have been incredibly frustrating.

Our primary service provider & I have spent lots of manhours (at considerable expense) trying to track this down.

We've moved hardware to better hardware. That hasn't helped.

We've implemented more efficient table designs and indexing for the search. Also not helping.

We've changed OS's and various daemons & optimizers and such. No improvement.

Those of you who've been around for a while know my fondness for old-timey aphorisms. There's a perfect one for this situation: "When you hear hoofbeats, you think horses, not zebras."

At this point, we've pretty much exhausted what we can do on our end, so we're looking towards problems coming from upstream.

Surprise! Remember the outage from this weekend?

Analysis of Outage on 08-09-09

Overview: A perfect storm of events and connectivity were required for the outage to occur. Five separate conditions had to be present. If one condition is removed the outage could not have occurred.

1) Redundant layer 2 connectivity between an HA pair of Sonicwall firewalls to redundant Cisco routers using HSRP protocol for redundancy.
2) Per-destination load balancing has been implemented on the redundant routers at the upstream provider.
3) Utilization of VPN connectivity on the Sonicwall HA Pair.
4) A failure in the Sonicwall Licensing database that caused the two paired HA firewalls to think they were not an HA licensed pair.
5) The VPN ISAKP process daemon in the Sonicwall HA pair’s primary firewall - failing, causing an HA failover.

Here’s the sequence of events as they have to occur for the outage:

1) The Sonicwall Licensing database somehow disassociates the HA pair of firewalls. Basically, the database doesn’t show the two firewalls as linked as an HA pair. For them to ever be a pair, the database had to be correct. So this is a database failure or corruption. This database lives at Sonicwall’s datacenter.
2) The HA pair of firewalls update their licensing from Sonicwall.
3) The HA pair continue to function as an HA pair physically and configuration wise, but logically per the license database they are no longer an HA pair.
4) The VPN ISAKMP daemon fails in the primary firewall at 7:46.49PM EST on 08-09-09.
5) The failure causes an HA failover to the secondary firewall, which is promoted to primary and the old primary reboots and comes up as secondary.
6) Because of the licensing error the new primary does not use the virtual MAC address that they are supposed to be sharing, and instead generates its own new virtual MAC address. (it’s configured for HA, so uses a VMAC)
7) The primary router (primary via HSRP) learns the new MAC address and throws out its ARP table entries
8) The secondary router does not see this traffic and hence retains its ARP table.
9) Traffic coming from the Internet through multiple redundant paths, come to either of the two routers to be forwarded to the sonicwall firewalls.
10) Because of per-destination load balancing, some Internet addresses go to the primary router some go to the secondary router and all forwarding will follow the same path once that load balance decision was made. Per-packet load balancing would have allowed all traffic to pass, but slowly due to a high error rate.
11) All traffic passing through the primary router is forwarded correctly to the new MAC address
12) All traffic passing through the secondary router is forwarded to the old MAC address and is thrown away.
13) Clearing the ARP table in the secondary firewall clears the issue at approximately 10:51PM EST 08-09-09

Corrective actions taken:

1) The sonicwall license database has been corrected. All future failovers should properly share the same virtual MAC address.
2) A software patch will be applied to correct the bug in the VPN ISAKMP daemon that caused the failover.
3) Current upgrade plans for the infrastructure will remove the need for HSRP at the upstream provider and remove all layer 2 redundancy. Only Layer 3 redundancy will exist.

The current hypothesis is that the router issues have caused the bulk of the problems overall.

After an hour long phone call today, we've come up with the following:

Due to (and I quote) "artifacts" from bug fix patching on the Sonicwall firewalls over the last few months, a cumulative recurring problem has arisen necessitating another outage to fix things. See below:
Project 1 – configuration scrub and patching
August 16th 2:00am EST
1) secondary firewall will be removed from service, factory defaulted, patched, configuration reloaded, relicensed
2) primary firewall will be removed from service, factory defaulted, patched, configuration reloaded, relicensed
3) Primary will be brought back into service and tested.
4) Secondary firewall will be brought back into service

There an outage during step 2 (~20 minutes) and a minor outage in step 4 (a minute or two). I would like to schedule 2:00am until 3:00am just to be on the safe side.

Following this, we're going to change our network infrastructure to a different configuration, Project 2, around the 24th-30th, one that hasn't been experiencing any of the issues we (and apparently other hosted sites) have been experiencing.

Should this fail to work as well, Project 3 will be "Unass this hosting provider and moving to another". That is the last resort because I've invested a considerable amount into hardware with these guys, but I'm at the end of my rope.

Thanks for your patience. I hope this explanation should help you understand what I've been going through.
 
That sounds very tedious, thank you for your hard work and dedication in this matter.
 
There have been a few hickups these past weeks, but the service did seem to really improved after the hardware upgrade. Thanks for your efforts Spark!
 
you go Spark! You don't get nearly enough appreciation, in any form, for all you do to keep this wonderful site going.
 
Thank you for the concise update. That's some technical stuff. No wonder it's so hard to diagnose. :confused:

Looking forward to the improvements. :)

#### Edit: Then, this post took 45 seconds to go through, FYI. Sitting there loading....

Coop
 
It's a good thing you threw in the Hoofs/Zebra aphorism, or I would have absolutely no idea what all that means.
I do beleive you are doing all you can, and that's all you can do :thumbup:
 
So no change in lags or anything then?
 
I do still get lags...but not anywhere near as bad as a couple weeks ago...and not as long (just the occasional 5-10 second delay...whereas before it would be 20-30 seconds easily, about one quarter of the time I was posting).
 
I still get some lags after I log on. It seems best for me to log on, and then let it sit for a minute before doing anything on the forums. But it has improved quite abit.
 
Seems fine right now, but the lags I had experienced were intermittent and ranged from a delay of as little as 5-10 seconds, to a can't connect message after a minute or more.
 
Spark,
I just had a lag of a few minutes trying to load a page. After approximately 3 minutes of waiting with no apparent movement, I stopped trying to load it, then made another attempt, and it loaded normally. I don't know if info like this will help or not, but thought you probably would want to know.
 
Thank you for your efforts and dedication to make this as smooth a running site as possible :)
 
I've also had the same <20 second lags, which I have no problem with. What made me really mad was when I'm trying to look something up and it's out for 5 minutes, comes back so it loads a page, and then is out for another two or three minutes so I can't see anything else.
 
3 min. lag after post. 1 min. lag changing forums. Between 22:30 & 22:42 8/15/09.
 
A few short lags tonight. Earlier today, I tried to post something and it took so long that a message popped up saying BF's was taking to long to respond.
 
Back
Top