An IT system is a collection of associated elements where data flow or data processing occurs in an automated manner, most often using computer technology. IT systems are primarily composed of elements such as hardware and software installed on it.
What should a successful IT system be like? Reliability is one of the key desirable features. The system should work at all times, and it should always work correctly. Is that even possible?
Systems are created by people who make mistakes from time to time, and so they fail sometimes. There is a popular saying: “those who make no mistakes do not usually make anything.” Various types of errors may be committed at every stage of creating and managing a system, adversely affecting its operation.
System reliability also depends on the infrastructure used covering components such as application servers, as well as all network devices and transmission channels ensuring communication between these servers and end users. Each element of the infrastructure can fail due to various internal and external factors.
So we have determined that all IT systems are built of unreliable components by people who commit errors at times. Is it possible to create a reliable system then? I could say ‘no’ and end this article here. However, the topic is a bit more complex and a deeper analysis of it may lead us to slightly different, interesting conclusions.
It is not possible to create a system that is 100 percent reliable, but it is possible to create one that is sufficiently reliable within an assumed level of tolerance of, say 99.9 percent. Appropriate mathematical models can make it possible to calculate how to create a system and what resources should be used in order for it to meet specific reliability criteria. On the other hand, recent technological achievements have allowed minimizing the costs of building systems with increasingly higher reliability. So how should the system be developed?
The key word is ‘redundancy’, which involves the use of an excessive amount of resources. For example, if there is a 50/50 chance of risk that a component will fail, duplicate it so that at least one of the two components operating simultaneously is sufficient to keep the system operating properly. Then the risk of failure drops from 50 to 25 percent. If you add a third component working simultaneously, the risk will be halved again, to 12.5 percent. And so on… This way, you could theoretically reduce the risk of a given type of failure to zero and thus ensure 100 percent reliability of a given part of the system, using an infinite number of duplicate elements. This is obviously impossible to achieve, but you come as close to that ideal as you want to, on condition that you are supported by an appropriate budget for purchasing the required components, ensuring proper space for them, and supplying them with power.
The reliability of the system is strongly influenced by the reliability of its servers. Hard drives are by far the most failure-prone components of a server. For this reason, it is a standard to use RAID matrices that work on the principle of saving each data batch simultaneously on several drives. This ensures that failure of a single drive or even several of them does not disrupt the operation of the entire system. What’s more, the use of RAID can additionally speed up the process of writing and reading data, while the replacement of a failed drive can be done while the system is running, without shutting down the server.
The power module is another vulnerable component of the server. So it is worth equipping the server with at least two such modules. At the same time, it makes it possible to connect the server to two independent power sources, so it will not shut down when one of the sources runs out of power.
How to obtain two power sources? Usually, there is one external power source, but you can and you should have an alternative in the form of a UPS that will maintain the power supply to your devices in the case of power outages, through its batteries and appropriate electronics, without any significant interruptions.
The more batteries you install in a UPS device, the longer power will be available. At the same time, however, the device will cost more and take up more space. An additional problem may be the weight of the batteries, requiring their arrangement not to exceed the permissible load capacity of the floor and thus prevent the risk of a building collapse.
A UPS is a key device in maintaining power supply to devices, but in order to protect them in the event of a prolonged power outage, you may need an additional solution in the form of a power generator using a Diesel engine similar to that found in cars. Computers powered by such a generator will work as long as you have enough fuel, which can be refilled on a regular basis.
Even the best-thought-out and secured server can fail, so you should have at least two servers.
To keep your heat-emitting computers from overheating and to ensure they function well for a long time, it is essential to maintain an optimal ambient temperature. An increase in temperature by just 5 degrees Celsius can cut the life of some server components and UPS batteries in half, and thus greatly increase the risk of failure. Adequate air conditioning is needed to provide dry cold air in front of the servers and to exhaust the warm air from the fans in the back. The air conditioning system may malfunction, resulting in a large increase in temperature, causing the servers to overheat and shut down. This is why at least two independent cooling systems are required.
In order for the system to work reliably, users should have a stable connection to your servers. Additionally, you will need the appropriate telecommunications links. Since no link will provide you with 100 percent reliability, there should be at least two links, fully independent of each other. This means that they should not have any common elements, so that the failure of one element does not disconnect your server room from the world. This common element cannot be ether, therefore they cannot be two wireless links based on a radio line, because weather anomalies such as thunderstorms and rainstorms could seriously disturb the operation of all such links.
Cable links, in particular optical fibers, are more stable, but the bundles of cables should not run next to each other, so that one day an unknowing excavator operator does not literally cut off your system from the world, unintentionally cutting a delicate fiber-optic bundle running underground and thus causing a link failure for several hours. Surprisingly, such incidents happen often.
So it seems that adequate protection of your server can cost you many times more than even a high quality server itself, and in addition, you would have to obtain information from telecommunications operators (on the construction of links, location of cable bundles), which they usually do not want to share with clients.
In addition, an appropriate fire protection system would be useful, which would effectively nip any sources of fire in the bud without damaging the data located on the servers. As a result, a standard water extinguishing system will not be suitable for this purpose. The standard is an argon installation, which fills a properly sealed room with non-flammable gas so that the fire is extinguished by a lack of oxygen. Besides an expensive set of argon extinguishing devices, it is also necessary to provide complete and tight isolation of the server rooms from the rest of the building.
Even if you were able to meet all the above conditions, you cannot assume full system reliability, because a catastrophe that could destroy your server room despite all security measures (earthquake, terrorist attack, flood or other cataclysm) cannot be ruled out. So what should you do to protect yourself against these types of danger? Again, you can apply the redundancy method, but this time at the level of entire server rooms, that is to build at least two server rooms many kilometres apart, each of which will meet all the conditions described above. In such a case, the costs will at least double, but theoretically no single catastrophe would be able to immobilize your system…
Securing your system by distributing it to different physical servers and locations brings additional challenges. How to make such a system distributed on different computers work in a consistent manner? How to make individual servers synchronized so that the user can see the same thing, regardless of which part of the infrastructure is operational and available at the moment and which is not? This is a topic for a separate series of articles… Nevertheless, to summarize it as much as possible, it can be said that the help here comes primarily from increasingly advanced virtualization. Until recently, the most revolutionary solution was the use of whole-server virtualization, which made it possible to have multiple virtual servers on one physical server and transfer them between them. However, these types of operations were usually neither simple nor quick. Docker turned out to be another revolution. It involved the process of packaging applications into containers that are much smaller, lighter, and much better optimized in terms of their functions when compared to virtual servers. Containers can be quickly created, closed and moved depending on the availability of hosts and on the current system load. All these activities can be fully automated with an orchestrator, such as Kubernetes or Docker Swarm, which controls these containers.
After reading all of the guidelines listed above, there may be one important question left: Is it possible to build a system in the manner described above, with the use of a network of independent well-equipped and secured data centres that will be profitable at all? In most cases, no. Not many systems in the world generate enough revenue or are significant enough to justify all these costs.
However, there is a way to significantly reduce costs while maintaining the highest standards. This solution is based on the fact that large companies that already have the appropriate infrastructure make space on their servers available to other clients and their systems. As a result, huge costs become significantly more acceptable when split between multiple clients.
Instead of having your own very expensive infrastructure, you can lease a part of someone else’s infrastructure, trusting that the company managing it maintains the stability of the platform on which your system will operate. This solution is commonly called a cloud and I think that the above arguments explain its great popularity well. The cloud has become a standard and is used almost everywhere. Storing data in the cloud has become the standard even for very sensitive applications such as personal password managers. This makes sense with sufficiently strong encryption on the client’s part. The point of this is that any unwanted access to data stored on cloud servers cannot reveal these passwords.
Cloud solution providers, especially the leading ones (among them Amazon Web Services, Microsoft Azure, Google Cloud and others) undergo regular audits and have appropriate certificates confirming the quality and reliability of the solutions used and the security of the data entrusted to them. When choosing a cloud provider it is worth paying attention to their certificates, which can form the basis on which to build a level of trust.
An IT system does not only involve hardware infrastructure, but also the software running on it, which is the result of the work of a team of people. Various errors made in the design or production of this software may be an important factor affecting the level of reliability. Similarly hardware failures cannot be avoided and errors cannot be fully eliminated, but in this case you can also use a range of tools to control and minimize the risk of error.
It is important to design and maintain the system in an appropriate and thought-out manner, according to the available good practices. They are described by standards such as ITIL, created on the basis of many years of experience, which can help you avoid costly mistakes.
The process of creating the code itself also requires a comprehensive, thoughtful approach, establishing appropriate software development rules, contributing to productivity and minimizing the number of errors.
Errors can occur at any stage. One of the first steps is to establish guidelines regarding the functionality of the system being built. Particularly in the case of systems modelling complex business processes, understanding the client’s requirements and planning the appropriate business logic can turn out to be a major challenge, and errors at this stage can result in a series of problems difficult to solve at subsequent stages. This is where Domain Driven Design comes in handy, and in particular Event Storming, which involves techniques that improve the acquisition of knowledge from the domain experts, and then designing the system so that it is faithfully based on the described model of business processes. The key to success is good communication between the client, analyst and programmer. This is done so that the client’s expectations are faithfully reflected in the system code developed. The techniques mentioned above can really help with this.
Similarly to the infrastructure, some sort of redundancy is also helpful in the programming process. The risk of error occurrence is reduced when each piece of code is created and verified by more than one person. In practice, it comes down to a mandatory Code Review before implementing even the smallest change in the software. Each such change must be verified by a second developer and implementation takes place only when at least two people are convinced that the implemented code is optimal and error-free.
Nevertheless, such belief is insufficient, specifically in the case of large systems with many, often not obvious, dependencies. Tests are also important.
All essential system features should be covered with unit tests to verify the operation of a new batch of code and its impact on the rest of the system. It should be mandatory for such tests to be run before a patch is deployed. You can also take it a step further and apply what is known as Test Driven Development. In this case, the code is created by writing unit tests to test the planned change. After writing the correct test, application programming comes down to creating a code that will make the previously created tests return a successful result. This approach makes the application development process itself slightly easier, while making it possible to reduce the risk of errors, provided, however, that you have prepared in advance a well-thought-out test suite of appropriate quality.
The need for testing is not limited to automated unit tests. They are followed by subsequent stages of system testing, in an increasingly broader sense: manual tests, integration tests and finally acceptance tests.
After the tests, changes are implemented and this also requires the proper approach, minimizing the risk of error at this stage. At times we are dealing with quite complex but usually repetitive activities to be performed, so it is definitely best to automate them using the available CI/CD tools. This is how the risk of errors at this stage can be reduced to almost zero. Designing and configuring an optimally-functioning change implementation process requires specific knowledge and experience in the field of DEVOPS.
If an error is detected at any of the stages of testing, it obviously needs fixing. If this is not feasible in a short period, it may be necessary to roll back a specific change or track the history of changes in the system. Therefore, it becomes necessary to use version control, particularly the GIT tool, which is now an unrivalled standard. It is also necessary for the development process itself, especially in the case of software, in a situation where several people are working simultaneously on changes to a given system component, and they must act in a way that does not interfere with each other’s work.
It is not possible to create a completely reliable system. However, you can approach this ideal reliability to almost any possible degree, but each subsequent approach can mean exponentially growing expenses. It is worth knowing how to calculate the level of reliability in individual areas of the system environment and what guidelines to choose to optimally combine the tolerance level with the cost level. It is a good idea to take into account the various available solutions (Cloud, Docker, software development methodologies), choose them depending on your needs and use them wisely, being fully aware of their advantages and possible threats they may generate. It turns out that ensuring adequate reliability must primarily be done by a team of experienced professionals from many areas cooperating with each other. System security and reliability is a complex topic covering many different areas. Each of them should be approached with due care, bearing in mind that the entire system is only as good as its weakest link.