- Joined
- Oct 2, 1998
- Messages
- 15,225
The last 10 months have been incredibly frustrating.
Our primary service provider & I have spent lots of manhours (at considerable expense) trying to track this down.
We've moved hardware to better hardware. That hasn't helped.
We've implemented more efficient table designs and indexing for the search. Also not helping.
We've changed OS's and various daemons & optimizers and such. No improvement.
Those of you who've been around for a while know my fondness for old-timey aphorisms. There's a perfect one for this situation: "When you hear hoofbeats, you think horses, not zebras."
At this point, we've pretty much exhausted what we can do on our end, so we're looking towards problems coming from upstream.
Surprise! Remember the outage from this weekend?
The current hypothesis is that the router issues have caused the bulk of the problems overall.
After an hour long phone call today, we've come up with the following:
Due to (and I quote) "artifacts" from bug fix patching on the Sonicwall firewalls over the last few months, a cumulative recurring problem has arisen necessitating another outage to fix things. See below:
Following this, we're going to change our network infrastructure to a different configuration, Project 2, around the 24th-30th, one that hasn't been experiencing any of the issues we (and apparently other hosted sites) have been experiencing.
Should this fail to work as well, Project 3 will be "Unass this hosting provider and moving to another". That is the last resort because I've invested a considerable amount into hardware with these guys, but I'm at the end of my rope.
Thanks for your patience. I hope this explanation should help you understand what I've been going through.
Our primary service provider & I have spent lots of manhours (at considerable expense) trying to track this down.
We've moved hardware to better hardware. That hasn't helped.
We've implemented more efficient table designs and indexing for the search. Also not helping.
We've changed OS's and various daemons & optimizers and such. No improvement.
Those of you who've been around for a while know my fondness for old-timey aphorisms. There's a perfect one for this situation: "When you hear hoofbeats, you think horses, not zebras."
At this point, we've pretty much exhausted what we can do on our end, so we're looking towards problems coming from upstream.
Surprise! Remember the outage from this weekend?
Analysis of Outage on 08-09-09
Overview: A perfect storm of events and connectivity were required for the outage to occur. Five separate conditions had to be present. If one condition is removed the outage could not have occurred.
1) Redundant layer 2 connectivity between an HA pair of Sonicwall firewalls to redundant Cisco routers using HSRP protocol for redundancy.
2) Per-destination load balancing has been implemented on the redundant routers at the upstream provider.
3) Utilization of VPN connectivity on the Sonicwall HA Pair.
4) A failure in the Sonicwall Licensing database that caused the two paired HA firewalls to think they were not an HA licensed pair.
5) The VPN ISAKP process daemon in the Sonicwall HA pairs primary firewall - failing, causing an HA failover.
Heres the sequence of events as they have to occur for the outage:
1) The Sonicwall Licensing database somehow disassociates the HA pair of firewalls. Basically, the database doesnt show the two firewalls as linked as an HA pair. For them to ever be a pair, the database had to be correct. So this is a database failure or corruption. This database lives at Sonicwalls datacenter.
2) The HA pair of firewalls update their licensing from Sonicwall.
3) The HA pair continue to function as an HA pair physically and configuration wise, but logically per the license database they are no longer an HA pair.
4) The VPN ISAKMP daemon fails in the primary firewall at 7:46.49PM EST on 08-09-09.
5) The failure causes an HA failover to the secondary firewall, which is promoted to primary and the old primary reboots and comes up as secondary.
6) Because of the licensing error the new primary does not use the virtual MAC address that they are supposed to be sharing, and instead generates its own new virtual MAC address. (its configured for HA, so uses a VMAC)
7) The primary router (primary via HSRP) learns the new MAC address and throws out its ARP table entries
8) The secondary router does not see this traffic and hence retains its ARP table.
9) Traffic coming from the Internet through multiple redundant paths, come to either of the two routers to be forwarded to the sonicwall firewalls.
10) Because of per-destination load balancing, some Internet addresses go to the primary router some go to the secondary router and all forwarding will follow the same path once that load balance decision was made. Per-packet load balancing would have allowed all traffic to pass, but slowly due to a high error rate.
11) All traffic passing through the primary router is forwarded correctly to the new MAC address
12) All traffic passing through the secondary router is forwarded to the old MAC address and is thrown away.
13) Clearing the ARP table in the secondary firewall clears the issue at approximately 10:51PM EST 08-09-09
Corrective actions taken:
1) The sonicwall license database has been corrected. All future failovers should properly share the same virtual MAC address.
2) A software patch will be applied to correct the bug in the VPN ISAKMP daemon that caused the failover.
3) Current upgrade plans for the infrastructure will remove the need for HSRP at the upstream provider and remove all layer 2 redundancy. Only Layer 3 redundancy will exist.
The current hypothesis is that the router issues have caused the bulk of the problems overall.
After an hour long phone call today, we've come up with the following:
Due to (and I quote) "artifacts" from bug fix patching on the Sonicwall firewalls over the last few months, a cumulative recurring problem has arisen necessitating another outage to fix things. See below:
Project 1 configuration scrub and patching
August 16th 2:00am EST
1) secondary firewall will be removed from service, factory defaulted, patched, configuration reloaded, relicensed
2) primary firewall will be removed from service, factory defaulted, patched, configuration reloaded, relicensed
3) Primary will be brought back into service and tested.
4) Secondary firewall will be brought back into service
There an outage during step 2 (~20 minutes) and a minor outage in step 4 (a minute or two). I would like to schedule 2:00am until 3:00am just to be on the safe side.
Following this, we're going to change our network infrastructure to a different configuration, Project 2, around the 24th-30th, one that hasn't been experiencing any of the issues we (and apparently other hosted sites) have been experiencing.
Should this fail to work as well, Project 3 will be "Unass this hosting provider and moving to another". That is the last resort because I've invested a considerable amount into hardware with these guys, but I'm at the end of my rope.
Thanks for your patience. I hope this explanation should help you understand what I've been going through.