Goplan down temporarily [Solved]

Fred Oliveira on June 27, 2007, Comments (3)

A couple of hours ago the Goplan servers at Amazon had a hickup and Goplan is down - our team is working on getting everything back so bare with us for a while - we’ll keep this blog post updated with news until everything is back as it should.

Update, 1pm GMT (5am PST): Apparently this isn’t a problem with Goplan itself but a few of Amazon servers. They seem to have had a hardware failure and are restoring things. We’re doing it on our end as well, by getting new servers up for Goplan.

Update, 3:30pm GMT (7:30am PST): We’re almost done restoring backups from earlier this morning and ensuring there was no data loss. Goplan should be back up within an hour. We’re also taking active measures to make sure similar problems with Amazon EC2 (which we use as our infrastructure) in the future don’t affect you guys.

Update, 4:30pm GMT (8:30am PST): We’re back up! Again, sorry about the inconvenience this downtime may have caused - as said previously, we’re taking the necessary steps to guarantee this doesn’t happen again.

Comments:

We were going over our various options for online project management yesterday and were all set to sign up today, but then we saw this. We like the product, but I’m sure you can understand our trepidation with having seen your system go down.

What I’m suggesting is that you be completely open about the steps being taken to make sure this doesn’t happen again. Obviously, there needs to be redundancy. So, what steps are going to be taken and when do you expect to have the redundancy in place to avoid future outages?

Good question, Christian. Although it is only fair to say we’ve done our best to keep everyone in the loop for the 4 hours the system was actually down. That being said, and to answer your question directly, we are redundant already - allow me to explain:

Goplan runs on Amazon’s EC2 platform - meaning it can scale horizontally indefinitely. It automatically scales in number of servers as the load goes up so that the speed of the service is maintained and the quality is up. The problem today was more serious than that - Amazon’s EC2 platform itself failed with physical machines failing - meaning our own system was inaccessible.

We would have been able to have goplan up instantly again were the affected server not the DB server which we don’t replicate (but is also not worldwide accessible for stability and security). Getting Goplan back up was a matter of getting new instances for the database and restoring from the data backup (which happens every 5 minutes).

Now that the problem has been detailed, here’s the solution: In addition to replicating the application servers, we will start replicating the DB server as well. What this means is, should there every be any problem with a database server, we’ll have another one ready - meaning no downtime will occur.

We can never (as I’m sure you’ll imagine and understand) guarantee 100% uptime, but these were 4 hours in over 8 months of uptime - which is 99.93%. We have been diligent in making sure our service’s quality was always the best possible, and today in making sure something like this will not happen again and that no user would come out losing :-)

All this being said, I’m sure you can ask other people who’ve sent email or questions to our support emails about how passionate we are about making everybody happy - we’ll do the same with you, to the best of our abilities :-)

Hey Fred,

Thanks for your candor. It’s great to see you respond not only quickly, but so thoroughly. We’ll be signing up shortly and giving it a run.

Regards,
Christian

Sorry, the comment form is closed at this time.