Resolved - 10/03/08: GNAX network outage
7:56 PM Central - UPDATE - RESOLVED
All servers and sites are back up as of this time. External monitoring showed routing returned after 7:36 PM. The issue is resolved and service has been restored to normal. Thank you for your patience! ##
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7:25 PM Central
At 6:58 PM Central, our servers at the GNAX data center became unreachable. This is a network outage affecting all of our servers hosted at the GNAX data center; it is not limited to one or two servers.
It does appear to be an issue at the data center, as traceroutes are reaching the border routers.
We have notified the data center and will update you as soon as we know more.
Impact: This affects only some dedicated server customers who are hosted at GNAX. This does not affect dedicated servers nor our shared servers which are hosted in Denver.
##
FIXED: DNS issue with server ’svensbluff’
UPDATE: 7:35 AM: Great news! :) The DNS issue on svensbluff has been resolved. No accounts had to be moved. Rebooting the server appeared to resolve the problem, which would suggest there was a tmp file corruption that needed to be flushed out of the system via a power cycle.
If you notice any irregularities with your account, please let us know. We were not made aware of the issue until 3:30 AM Central time. We feel terribly that we weren’t aware of it sooner.
Unfortunately our monitoring system did not pick up on it because it monitors by IP address, and asks each service for a response at the IP. The services were all up, even DNS, and responding … but DNS wasn’t processing requests correctly. Monitoring just asks “are you responding?” — it doesn’t ask and analyze custom requests. That’s why our monitoring system didn’t pick up on this problem. :( I am very sorry for the trouble this caused you. We are going to look into our options for monitoring, in order to be able to better detect situations like this.
We always love to hear from you. We’d rather receive a false report than no report at all. :) So if you see anything weird, please let us know by opening a support ticket right away. We want your service to be top notch, A-#1 !!!
Thank you for all the feedback as we worked on this, it was very helpful to us and helped us to iron out what was/was not going on. ##
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Good morning,
Around 3:30 AM Central time, an as-yet unidentified DNS issue caused websites on the server ’svensbluff’ to become inaccessible.
We have been trouble-shooting this issue since that time.
At this point we have been unable to find the exact cause.
We will restore DNS service to down websites as quickly as possible. We have a list of nameservers that we know are not working, and these are the nameservers that we will assure are answering authoritatively like they are supposed to be, as soon as possible.
Thank you for your patience! We will post updates here as soon as they are available.
DCSN Team
Server Status
Uptime
- evensonfarm
- svensbluff
Server ‘pilotisland’ decommissioned, ‘evensonfarm’ live
Due to repeated problems with the server ‘pilotisland’ in the past several weeks, we have decided to replace it with a new server, ‘evensonfarm’.
There were two ‘pilotisland’ servers, one was a little AMD x2 3600 with 2 GB RAM (which was intended to be a little workhorse administrative server), which was replaced by an AMD x2 4400 with 4 GB RAM (which we commissioned as a new shared server). Both were fraught with random load-spike issues and mystery crashes. These boxes were located at AtlantaNAP, and we want to give the staff at WireSix a hearty “thank you” for their tireless and gracious help trouble-shooting the issues we were experiencing. It wasn’t their fault. We have many AMD x2’s (of various sizes) and the only server that ever gave us trouble was ‘pilotisland’ … maybe it’s the server name that’s jinxed. *wink*
‘evensonfarm’ is a new Intel e2140 Dual-Core Core2Duo 1.6ghz with 2 GB RAM, located in Denver, CO at Handy Networks. It will be a shared server.
All sites on ‘pilotisland’ were migrated to ‘evensonfarm’ on Thursday evening.
‘evensonfarm’ has the same software configuration as ‘pilotisland’ had, including libraries, components, package versions, firewalls and settings, etc. The server IPs are different, of course, and clients can find their new IP by logging in to their account’s cPanel and looking under “Site IP”. The change in IP address has absolutely no bearing on site operations. Everything is running as before, and there was no downtime with the move to the new server.
If you have any questions about the move, or the new server, please do let us know. :)
Thank you!
The DCSN Team
Shared server ‘pilotisland’ outage, recovery
This morning we encountered a combination of issues with our shared server ‘pilotisland’ which resulted in a lengthy (approx. 6 hours) outage. Now that the server is back up, we are able to determine what happened.
A site on the server experienced a large spike in distributed traffic (similar to the “slashdot” effect - but it wasn’t slashdot) which called a dynamic script (PHP/MySQL based) and caused an unsustainable load on the server, and it went down.
We attempted to reboot the server, but kept running into issues both with access (due to network traffic) and also with the hard drives requiring an FSCK before they would come back up. The automatic FSCK would not complete, requiring a manual FSCK, which was very time-intensive due to the size of the drives.
The FSCK has now been completed, the server has been rebooted, traffic for the busy site has been remediated (through both server/script settings and some traffic sharing) and all server operations came back to normal at approximately 1:15 PM Central Time.
We apologize for the frustration that this has caused. We realize that our clients do not want their sites to be down. We don’t want your sites to be down either! This particular server has been up 100% for six months, and showed no signs of trouble prior to this incident. This was an external traffic problem which then became coupled with normal drive maintenance operations for recovery. We are very sorry for the inconvenience and frustration that this has caused, and will continue to do everything in our power to ensure uninterrupted service on this server moving forward. ##
Server svensbluff down, being rebooted; Update: Recovered
UPDATE, 1:37 PM: ’svensbluff’ is back online. Thank you for your patience! :)
———————————————————————————————-
1:29 PM: Our shared server ’svensbluff’ is currently down, after having emergency maintenance performed. Our on-site technicians are working on the server as we speak and will have it back online as soon as possible.
We apologize for the inconvenience!
Uptime stats: December 2007
Shared servers
-
svensbluff - 100%
pilotisland - 99.997%
Networks
-
Handy Networks - Denver - 100%
AtlantaNAP - Atlanta - 99.997%
Shared servers rebooted Monday morning
Our shared servers ’svensbluff’ and ‘pilotisland’ were both rebooted around 2:00 AM Central time Monday, 12/31/07 to complete kernel upgrades. This was routine security work and caused no change in functionality on either server. Both servers were offline for about 2 minutes during the course of the reboots, and have been online and running at 100% since then.
If you have any questions please open a support ticket so we may assist you. Thank you very much!
DCSN Support Team ##
Reboot on svensbluff tonight
Hello,
We will be rebooting the shared server svensbluff tonight at approximately 10:30 PM to upgrade to a new kernel. The expected downtime is under 10 minutes. If you have any questions, please let us know.
Thank you!
cPanel update breaks SSL certs
UPDATE, RESOLVED - 11:28 AM: Apache has successfully recompiled, and we have verified that the SSL certificates are working again without error. We recompiled the same libraries into Apache as before. However, if your site is experiencing any problems due to missing libraries or modules, please contact us right away and we will be happy to assist you with it. Thank you and have a great weekend! ##
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We’ve learned that last night’s cPanel update has broken the SSL certificates on the svensbluff shared server.
Our technicians are currently working on repairing the problem. Included in the repair efforts are recompiling Apache (the web server), so if your site goes offline for a few minutes, that’s why. :) Please do not worry, we are aware of the issue and are working on repairing it as quickly as possible.
Please accept our apologies for the trouble! We are also working with cPanel to identify what caused this in the first place, and to try to get them to not break things this way again. :)
DCSN Team
