Updating the Bugfender infrastructure
In this post we’ll be highlighting some of the recent improvements we’ve made to Bugfender’s infrastructure. We’ve written it to ensure total transparency for our readers and (we hope) inspire other companies to make their own internal improvements.
In our transition from fast-growing startup to stable, scalable company, we decided to make Docker images for every component of the platform.
This is helping a lot in testing because we can spin up the same image that will be used in production and test it. Even more importantly, it removes dependency on the configuration of the host machine. Bugfender doesn’t need any package or configuration in the host machine, besides Docker.
Kubernetes and Rancher
We run the containers with Kubernetes and Rancher. This lets us document the whole platform as code and separate configuration from infrastructure.
We can now deploy the new versions of Bugfender without downtime and we can roll back effortlessly when something is not right.
Another important advantage is that the Kubernetes containers can run in multiple servers transparently, enabling us to scale the platform horizontally when needed. And of course, we can run the exact same platform multiple times, for example when running the development and staging environments.
Automated host machines
In our experience, the host machine configuration doesn’t matter that much. But still, setting up the operating system, fine-tuning the kernel details, installing Docker, etc. takes a lot of time. Not to mention that it is a repetitive and error-prone task.
With this perspective, the “host machines configuration” looked like a perfect task to be automated. And we did it with Terraform.
This has removed typical errors that happened from time to time when setting up new machines manually. In addition, it sped up the process of adding a new machine to the cluster when it is required.
Replicating everything x3 and finding the perfect location
Bugfender was previously operating with redundant hardware in a data center, plus a disaster recovery plan in a different data center.
Data centers usually have everything redundant: at least two power grid connections, two backup power generators, two internet connections,… you get the gist. A failure in any one, or even several of those elements does not usually prevent the data center from working. Also, we used to have multiple servers and multiple copies of the data, so a total failure would be quite unlikely, right?
Well, one of the biggest electrical storms even registered in Spain happened, in the middle of summer, while everyone was using air conditioning. The storm made the electrical grid fail. Several transformers were burned and there was a general blackout in the area that lasted for almost a day, affecting our data center.
This usually wouldn’t be a big issue, since the data center can operate on batteries and oil generators for several days if necessary. However, other elements of the center were damaged by the lightning strike and provoked a cascade of failures, resulting in the data center blacking out.
Unfortunately, this event provoked the longest downtime, lasting several hours, that we’ve ever suffered in Bugfender and our clients lost access to the Bugfender console for a while.
Fortunately, the Bugfender SDKs are already prepared for these rare events and they retained the information until the servers were online again. However, we still needed to do something to be prepared for a hypothetical future storm.
Not being data center experts ourselves, we learned a valuable lesson that day. It turns out that in Spain you cannot choose the source of your electricity. There is a single electrical grid, connected to everything. But while this guarantees that blackouts are very weird events in Spain, even when different power sources fail, weird is not the same as impossible.
We started to research the perfect place, and it turns out that this place is in Netherlands. Bugfender is now running in 3 twin data centers, all supplied by different power sources and operating on different network operators. Still, they are close enough to keep the ping low and the synchronization performance excellent.
So, Bugfender now replicates all the data intelligently across the three data centers and it can keep working even if one of the data centers were completely offline. How amazing is that?
As we were moving everything to another data center, this was the perfect opportunity to encrypt all the data at rest. This was not strictly necessary because Bugfender runs on dedicated servers and physical and logical access to them is already strictly controlled. But extra layers of security are always a good idea, so we went for it.
Bugfender is now more stable than ever (you can check it here, and here) but there are always opportunities to improve, and we expect to do so in the following months. Here is what’s keeping us busy:
- We will continue to accommodate the steadily increasing number of customers by evolving our data center and services architecture.
- We will open new data centers for geographic localization and maybe take advantage of cloud services for elasticity.
- We are adopting the ISO 27001 standard to provide a comprehensive risk management across the whole company. Out of this risk management we will have a continuous stream of security improvements, both in the organization’s policies and procedures and in the Bugfender product itself.