Our
Forum has Moved - Please CLICK
HERE to go there now.
This forum will remain open as an archive only and cannot be posted
in but can be read.
Please go to the New
forum and Register as a Member!
I am in the process of of implementing a few actions so that we can keep you better informed of a back up forum should this one every go down again for a lengthy period, like it has over the last couple of days
Hopefully with Network54 shifting their service to another ISP it will mean better quality of performance from them in the future
there has been an e-mail group set up at this address which will be used solely to inform you of a new address should this forum go AWOL for any period of time again
First of all, let me personally apologize for the recent downtimes Network54 has experienced. We know that the forums that you provide are critical to your business, and reflect on you. We strive to have great uptime, great forums, and great service. However, we have failed you.
Here is what happened and what we are doing about it.
If you have not been keeping track recently (oh, so unlikely), we have had a rash of problems with our ISP colocation provider. They were a wonderful group of people with which we had become friends. Our necessary exit of their facility has broken several friendships, adding the costs of the move (which include paying two ISPs for February, loss of ad and premier membership revenues, and the high cost of angering all people that use our service).
The loss of A/C was the driving factor for our leaving. Several times the temperature of the room went to 85-95 degrees. And that is the air with which we were supposed to *cool* our machines. On Wednesday it not only crashed our servers, but it actually blew up a part of the main RAID array that stores all the messages. Eventually all data was lost, as well as some of the hardware.
It would take a while to figure this out, at which point we decided to just shut everything down and let things cool while we found a new home. Thankfully, I had been in conversations with several ISPs last week and this week. We had to winnow the list to those very close by in order to keep downtime low (in retrospect, the move time was immaterial), and those that could take our bandwidth requirements on just hours of notice, and one that we could afford. We found one and moved. In December, we will likely move again. That move will be planned and smooth, and be to such a place that we will never, ever, move again.
Now, after the move, we had to rebuild the hardware, first finding and discarding failed parts. We created the array as RAID 10 this time to protect against disk failures (while still keeping it fast). Before, a failure of a single disk, SCSI channel, card, or cable could ruin the array. The new system protects against all that, and would have shorted downtime by some 30 hours, we have learned. Sadly, this was a planned change that we were going to do when we did maintenance with the arrival of new hardware we had previously mentioned and ordered. Ironically, 90% of that hardware has arrived over the course of the downtime and would have gone in this weekend.
One of the new machines and new RAID arrays is going to be a live backup server. I've been mentioning this for a while. This one would replicate everything, and in case the main server went down, this server would kick-in in read-only mode (ad-free for the free forum users, and without charges for the premier). This would help in the case of minor glitches as well as maintenance. And we get a big benefit in that we can take this server down to do a baseline backup. Usually, we do this once a month and it requires an hour or two to copy all the files. Previously, we would have to take down the whole site. Soon, we will only have to take this backup server and nobody will notice. In an emergency, this server could become the main server. In December, we will try and host a third backup server in our present (then to be old) colocation location. Then in the case of massive failure (complete building destruction), we will still have a live up-to-date backup with which we will not have to do a recovery operation.
Our data recovery operation took quite a bit longer than we could have dreamed. A recovery requires using a baseline backup and applying the changes that are logged in special files. Due to the heat damage, all the backups on site were no good. It would take a while to discover this, as we applied the changes file to a backup and had additional failures. Luckily, we also walk in there and do backups onto a portable Macintosh. And we bring that home and make another backup to another computer. Murphy's law hit: the Mac with MacOS X failed and went strait into console mode. Fast forward a few hours and we are happily doing the recovery. We go home. Only to discover that the recovery operation had problems. So back across town. And another recovery. Which worked. Finally. Stats for the last month are gone, but everything else seems to be on par.
And, unfortunately, we have some more downtime ahead. This weekend, we will be installing an additional UPS and the new hardware. And we will make another baseline backup. God willing, that will be the last downtime we will have until December (that move will still have the site running in read-only mode while the servers are physically moved).
We know that you are angry. So are we. (We are also incomprehensibly tired.) We have taken drastic action (both the move and the new equipment) to make things good. And we will not stop making things better.
We know many either have or will leave us. All the rest will get more stable and even faster service.