In my previous blog post I explained how we made the decision to roll out OnApp as our underlying cloud platform rather than continuing the DIY approach. In this post I will be looking at how we familiarised ourselves with OnApp and the types of testing we went through prior to rolling out OnApp.
When integrating new software or systems, it is absolutely necessary to do major testing in stages – even with an ‘out of the box’ solution, as the testing gives you the following benefits:
- Gets the staff whom will be supporting this system fully familiar with the in’s and out’s of the system
- Let’s you understand the ‘limits’ of what the system can and can’t do
- Gives you an opportunity to document and modify your business practices to account for the new system
By achieving the above, you can better support your customers thereby giving them a much better experience. Given one of our values is to be customer centric, this really was critical to the success of our Blaze Cloud VPS platform.
Before we dived head first in, we came up with some key areas that we needed to focus on during our testing campaign:
- Environment Setup
- Performance and Limitations
- Failure and Restoration
This initial stage of learning the system is critical. This part of the testing was all about learning how to install OnApp on our infrastructure and ensure it was optimised and fully compatible. This involves installing it, breaking it, install it again, breaking it, installing it again. To the point you can do it with your eyes closed or simply reading the detailed documentation that we’ve prepared along the way.
This also included upgrades, ensuring we understand what needed to happen to upgrade OnApp to a new platform.
Further it was not just about how to install the software but how to configure all of OnApp’s many options. So we tested them all, even if we were positive we were not using them in our production environment. For all we know, one unrelated option can have a residing impact on another, so it is best to know them all!
For example OnApp has advanced security, and as result has very granular permissions for user access. There are hundreds of user permissions, so we need to be familiar with the majority of them to ensure we can troubleshoot any issues that may arise.
Ultimately, this part of the testing phase is always a good opportunity to get familiar with the software’s documentation.
Performance and Limitations
Every company every installing a new system should do this. Full stop. OnApp and external research provides you insights into what the recommended system requirements will support in terms of running the OnApp software, however you need to always remember that every setup is different. Different hardware, different network, different optimisations so you need to ensure your ‘configuration combination’ has what it takes to handle the real world.
For us, this was all about understanding where the bottlenecks were with both the OnApp software but also the Xen virtualisation software that OnApp communicated with.
We did all sorts of tests including but not limited too:
- Seeing how many 128MB memory based Centos VM’s we could deploy to a 144GB memory based Blade just to see what would happen
- Queueing up hundreds of tasks in the OnApp system to see how long those tasks would take, and which tasks took longer than others
- Enabling OnApp backups on 500 VMs, and let the system try to process it. Does it struggle, or does it perform well?
- Deploying OS templates from different storage locations to see how that impacted deployment times. What is quickest?
- 50 VMs on every Blade all utilising 100% of their network input, just to confirm the bandwidth usage statistics worked, and to see what our network utilisation looked like
- Check control panel performance when supporting 500 VMs across multiple hypervisors. Would it slow down?
Additionally, as we are running the SolidFire storage system with our OnApp system, we had to do additional testing around the OnApp + SolidFire integration.
Failure and Restoration
When we talk cloud, we talk about uptime. Simply put, things can break and in some cases they do. However before bringing our OnApp cloud online we went through a detailed list of failure scenarios, so we understand how to resolve them, and estimate how long it would take to restore should such an event occur.
- Storage failures
- Blade failures
- Network link failures
- Network infrastructure failures
- Storage volume and VM recovery
- Software failures
We went down to the detail of knowing almost exactly how long a single VM will take to ‘start back up’ once an issue was fixed. This is important information if you are ever starting hundred’s of VMs.
Whilst nothing major like this is expected to happen, it is better to know than be blind.
Another important factor is knowing whom we can contact should the software hit unexpected results. Whether that be bugs, configuration issues or something else.
You should always understand the support processes of any third party vendors whom you rely on heavily for critical services like our Blaze Cloud VPS service.
Each company’s support operates in a different way. Prior to choosing OnApp we checked which support channels OnApp have, including 24 x 7 support and so. They ticked all the right boxes for us, but it is one thing to tick a box and another thing to put the support through its paces.
As result we (on purpose) threw all sorts of questions through to the OnApp support team during this testing phase.
We asked about things like:
- Setup questions
- Hardware configurations
- VM recovery / repair
- Best practices
- Driver issues
- Kernel issues
This gave us a better idea of what their support was capable of, and how we should best approach raising particular types of issues. In all cases OnApp support got us the info we needed, or pointed us in the right direction.
Ultimately, we do all of this to ensure we understand our system from all angles, as this allows us to provide a better platform, but gives us the necessary background to investigate issues that do crop up and it does happen, no solution is perfect (although we try!).
What I’ve mentioned in the article is but a snippet of some of the lengths we went to to ensure we fully understand the OnApp software.
Thankfully we had learnt quite a number of lessons from our first in-house cloud platform, so we were able to take that experience and apply to the roll-out of the OnApp solution.
The more time you can allocate to planning and testing, the better the end result for you and your customers. This is always a tough decision as often businesses are looking to deliver new products quickly, with a controlled investment in both time and funds. Given the “out of box” design of OnApp, this is easy to do to a point, but even then if you are looking to offer a premium solution, then you need to put the effort into understanding every aspect of that solution.
I hope this has given you some insight into the effort involved in rolling out a new platform. If you have any questions on any particular area’s, please don’t hesitate to leave a comment.