contegix: beyond managed hosting

Archive for June, 2008

We’re often asked at Contegix, “Do you perform automatic upgrades of Application XYZ?”, and are answer is always emphatically, “No”. This tends to spark some debate, since we do tend to perform RHEL updates automatically. First, let’s define “automatic”, because obviously we’re not shutting down instances/servers without explicit permission from you or your team. In regards to standard RHEL updates, we inform you after the updates have passed a rigorous round of testing and have both the Redhat and Contegix internal “go-ahead” that we need to perform updates on your servers. We consider these mandatory for the reasoning of security. Redhat doesn’t push superfluous updates down the pipe to your servers. They’re generally provided for very specific means, and the number one reason is security. We can push these updates because 99% of the time, the end-user (you) won’t even notice the difference in most cases. On the rare occasion an update may have an odd effect, but I’d like to stress that the case of that happening is excruciatingly rare.

Let’s compare this to… well, -any- web application you’re running right now. First off, keeping up with what’s running on every customer’s server is a massive chore in of itself. Keeping up with that list, and checking to make sure every web application is running the newest version is just a research nightmare. Obviously, the big applications (aka our managed applications) we’re aware of, such as the Atlassian suite of applications, Wordpress, Jive’s suite of applications, and so on. Unfortunately, keeping tabs on all the various web applications we use, and their version numbers, is a bit rough, but is something we plan on tackling in the future. The real problem however exists in the following question: “Do you really want to upgrade?”

The problem is that many applications have introduced the wondrous world of plugins into their applications. Honestly, from our side of the fence, plugins create a lot of havoc. For one, they’re not always supported by the main developers of the application in question, which leaves us restricted in the level of support we can offer to a product using them. Secondly, they make application upgrades comparable to a roller coaster where the cars may or may not come unhinged from the track, sending you careening into a brick wall. That’s not to say we don’t like plugins, because we love plugins. For instance, the Wordpress Automatic Upgrade Plugin turns Wordpress upgrades into a quick 5 minute ordeal. No need to worry about asking us to upgrade your Wordpress, take backups, and hope that we catch any theme changes that need to be made in the process. Instead, a few button clicks and this plugin will complete the upgrade in no time flat, bringing you to the latest version of Wordpress. I’ve used it on my personal blog a couple times now, and it worked flawlessly. Obviously, your mileage may vary, but if nothing else it performs backups before it does anything, so if the upgrade fails, reverting back is a snap.

Why on Earth would a Wordpress upgrade fail though? Plugins. It’s the same reason we have upgrade problems with any application we work with, plugins inherently create issues for upgrade procedures because they introduce new quirks that may fail when the core application is upgraded. Depending on how integral that plugin is to your application instance, this could cause an upgrade to become a complete failure. A default instance of Confluence/JIRA/Crowd upgrades smoothly, no problems to worry about. An instance of Confluence with a bunch of plugins, theme changes, and so on, however tends to be a bit more interesting. It’s not really Confluence’s fault, in fact it’s quite likely that weird plugin you were skeptical about installing is breaking something internally, thus causing the upgrade to fail. More often than not though, Confluence upgrades can fail due to heavy edits to themes, generally via the Theme Builder plugins. This causes theme anomalies, as the Theme Builder plugin is out of date, not functioning properly, and the changes in Confluence between versions have also contributed to some issues with your themes, such as in 2.8 when the theme was prettied up quite a bit (nice job Atlassian!). All of sudden, what should have been an easy, smooth ride, is now resulting in an extra half hour of down time as we scramble to fix the problems. Then we have to come to a decision on rolling back, or progressing through the issues.

This is why we generally frown on automatic upgrades, because plugins add a significant curve ball to the mix that we can’t foresee. If keeping up with every web application is a documentation job of epic proportions, imagine trying to track compatibility of plugins, the plugins installed, and the ones not installed on all customer Confluence instances! We like to keep downtime to an absolute minimum, which is half the reason you’re with us we hope, and that’s why we avoid automatic upgrades. Instead we encourage staging instances, scheduled tasks, and taking each upgrade on a case by case basis. Do you want us to merely say “Confluence 2.8.1 is out, and we’ll be upgrading you on MM/DD/YYYY at 00:00″? We believe it’s in everyone’s best interest for you to decide when to upgrade, and to let us know. We’ll work through the process with you, check compatibility/dependency issues, and set the event up for a time that suits your needs best. If you’d like to see it staged out first, that’s fine too, we’re more than happy to setup a small staging instance for the upgrade when necessary, assuming it’s not detrimental to the overall health of the server. We want to work with you, as much as we work for you and your company. If you have any thoughts or suggestions on our upgrade procedures, feel free to drop them in the comment box!

We’ve spoken in the past about Hyperic monitoring, and the roll out of this application to our managed customers. I felt that Hyperic is so slick it deserved more lip service. Some of our customers have already been given access to our monitoring system, and from the feedback we’ve received it would appear they’re quite ecstatic with it. That’s not to say that there aren’t some kinks, because there are, but I must say the microscopic kinks are almost unnoticeable. Hyperic is always improving though, and we’re doing our best to exploit the very best of this application to better serve your infrastructure here at Contegix. The servers that have Hyperic configured on them have a wide range of monitoring options such as:

  • CPU Monitoring
  • Load Averages
  • Filesystem Usage
  • Database Monitoring (MySQL, PostgreSQL, EnterpriseDB, etc)
  • HTTP Checks
  • Zimbra 4.X
  • IMAP, POP3, SMTP (on any obscure ports imaginable!)
  • Memcached
  • Tomcat
  • Resin
  • Apache HTTP
  • And so many more options you would die reading the list

We receive well over a thousand emails a day from our monitoring system, letting us know when your servers are leaving the realm of acceptable levels in a wide variety of categories. This allows us to be proactive in regards to your server’s health, and attack trouble areas before services are impacted. For instance, if we see the load on your server climbing above the typically acceptable level of 5, and staying high, we know to investigate the server before services are impacted.

Most importantly though, you don’t have to deal with the awkward situation of your website’s visitors telling you your site is down, if you’re monitored by our system. If Hyperic is monitoring your site, then that site will be checked every 5 minutes, making sure it gets a response, and checking the site for a search string that should appear on your site. If the monitor fails, we’re alerted immediately, and respond to the situation. If you have special instructions for us, we make every effort to follow them to a tee, and if you don’t have special instructions we’ll handle the situation the best way we know how to return your site to working order. For instance, on typical Java applications, we’ll thread dump the instance, restart it, and notify you of the maintenance that was performed.

I do admit though, as much as we strive to be, we’re not always perfect. At times we do require assistance from you and your team to help us be the best that we can be. While many servers at Contegix follow the Contegix way of doing things, not everything follows exactly what we’re accustomed to. That’s okay though, we don’t mind it, after all these are your servers! However, for us to fully monitor your services to our fullest potential, we do encourage you to let us know what needs to be monitored. Even if you don’t have a special setup, we don’t mind you checking with us on what’s being monitored. In fact, I encourage that too! We want you to feel comfortable here, and if double checking with us that everything you need monitored, is monitored, then drop us a line. There’s absolutely no harm in that, as it ensures that nothing is missed, and that we’re serving you to the best of our ability. Please keep in mind though that running the Hyperic agent on your server will require a small amount of memory, as this is a Java application which means it requires some resources. If you already have a heavily taxed server, throwing the Hyperic agent into the mix may not be a good idea, but I believe this to be a very rare situation.

Finally, maybe the coolest part of Hyperic is that we can give you access to the system as well! This gives you the ability to see the metrics that are produced by the monitoring system for your servers. The access that is granted to you is read only access, so you can’t create sensors, but you can always ask for new ones (again, it’s encouraged!). This ability has already helped a few of our customers, by giving them insight into how their services were behaving, allowing them to clean up trouble spots in their applications and infrastructure. All you need to gain access is drop a line to support@contegix.com, and we’ll be happy to get it setup for you. Let’s take a look at Atlassian for a perfect use-case scenario in which Hyperic can be of great assistance.

Their documentation has a section for monitoring critical production systems. If you visit that section you’ll notice the power of Hyperic on display in the images shown. They go onto demonstrate in that article one particular scenario how the graphs enabled them to catch a critical issue with an instance of theirs, which gave them the nudge in the right direction towards correcting the problem. Furthermore, Hyperic themselves noticed Atlassian’s documentation, and hint at a potential pair of plugins for monitoring Confluence and JIRA in particular! Just remember, we’re here to help you improve in anyway possible. Drop us a line, and get more from your hosting environment with us with Hyperic access!