This post aims to detail the issues surrounding the recent problems on the server sh2.us.thiswebhost.com, and what steps we have taken to investigate and implement a final resolution to ensure that these problems do not re-appear.
On 27th October the server sh2.us lost connectivity to services, impacting its ability to serve websites, e-mail, FTP and other services. Our monitoring detected this within minutes and alerted us of the problem. Initially this looked to be a mis-report, as we were able to ping the server and bring up a remote console to the machine, however we could not properly access services – including a remote console login (the process would hang). We then spent a great deal of time investigating the network connectivity to ensure that the network was not at fault. Finally, we decided to initiate a “hard reboot” (essentially the same as hitting the ‘reset’ button on your PC) to see if this would help. After initiating a hard reboot, the server successfully came back online and connectivity was restored.
The following morning (October 28th) at exactly the same time, the server lost connectivity again. We went through the same steps as before but also contacted our server/network provider again to see if perhaps some maintenance or other network issue was occurring that they could advise on. After receiving word that there were no network issues, we again initiated a hard reset and services were restored.
On the morning of October 29th at exactly the same time again, the server lost connectivity once more. Having ruled out network problems over the last 2 days and given this was happening at exactly the same time as before, we decided to investigate the possibility that a scheduled task or “cron job” was running at this time and effectively bringing the server down. We looked through all of the cron jobs running at that time and could see nothing out of the ordinary. All of them appeared to be standard, legitimate jobs and no customer jobs were running at that time which should cause any problems. We then decided to begin going through the list of cron jobs and executing each command manually, hoping that we would be able to reproduce the issue here. Fortunately we got very lucky and found the particular task causing the problem; cloudlinux-summary.
CloudLinux is the OS/Kernel we use on all of our shared servers, as this allows us to restrict the maximum resources (CPU, RAM, I/O) each customer can use. This prevents a single user from being able to exhaust all of the resources available and bring other sites slowing to a crawl or bringing them completely down. We’ve been using CloudLinux for many years now and are extremely happy with the product, and it’s being used by thousands (or even tens of thousands or more) of hosting providers worldwide. We were unfamiliar with this cloudlinux-summary task, and on further investigation it appears to be a task that was added towards the end of October via an automatic update. Strangely, the same task was added to other servers that we have that also run CloudLinux but was not causing any issues. At this point we opened a ticket with CloudLinux to report a problem and see if they could offer any advice. In the meantime, we disabled the cron job and added a parameter to the servers sysctl.conf file to prevent the cron job from being added again by CloudLinux through an update. This (albeit temporarily) resolved the problem of the server losing connectivity at a specific time every day.
Disabling the task was not a “fix” in our minds. While it may prevent the connectivity issues, we still didn’t know the exact cause of the problem, nor did we know if it was a sign of a greater issue somewhere else that may appear again in the future. We continued to converse with CloudLinux who asked us for diagnostic and debug information from the server so they could try and reproduce the problem their side in the hopes of ultimately working on a software fix. The problem is that we needed to replicate the environment and situation in order to gather the necessary information. Over the course of the next couple of days (and concluding today) we manually executed the cron job to gather more information. This of course did cause sporadic and brief downtime for our customers, and we sincerely apologise for this. Despite our best efforts, and working with CloudLinux’s assistance, we were unable to generate any meaningful diagnostic data that could pinpoint the problem.
While performing some other routine tasks on sh2.us, we discovered that one of the server side CloudLinux reporting tools that we use perhaps once a month or so was not working correctly. When a specific combination of options was selected, the server would lose all connectivity, just as it had when the cron job was executed. We then came to the conclusion that there was an issue with the CloudLinux installation on the server. Either it had been corrupted or otherwise was unstable with specific tasks, despite appearing to function fine elsewhere. This is something we haven’t seen before in almost 9 years of using CloudLinux – and CloudLinux had not seen such behaviour either. Finally, the choice was made to re-install CloudLinux from scratch to ensure that we had a fresh and working deployment. This is what caused the significant downtime today, and again we are incredibly sorry for the trouble and inconvenience that this may have caused.
Re-installing CloudLinux has resolved all of the problems we previously had, both with the cron job and the reporting tool that we use. Everything now appears to be running correctly and functioning without issue. We will, of course, continue to converse with CloudLinux to try and determine what may have happened here. While it’s not the perfect resolution we had hoped for, we do strongly believe that this is one of those extremely rare things that happens from time to time, and there is no indication that this is likely to happen again.
We do not anticipate any further disruption to services, and once again are incredibly sorry for this situation and the downtime that was experienced by customers on this server. We are committed to providing you with a quality service and hope that this blog post provides some insight into the problem as well as our eagerness to keep you informed of issues such as this.
As of this blog post sh2.us is online and should be operating perfectly. If any customers are experiencing issues with services on sh2.us following this blog post – please don’t hesitate to open a support ticket from within our client area and we’ll investigate as soon as we can.
Thank you for choosing ThisWebHost.