The data center has been evolving from mainframes to the current cloud model due to various business drivers and technological constraints. There are certain aspects without which modern data centers will be of no use to the business organizations.
Availability: According to NARA, 93% of businesses that have lost availability in their data center for 10 days or more have filed for bankruptcy within one year. On the other hand, when financial data provider Bloomberg went dark one morning in early April 2015, it interrupted the sale of three billion pounds of treasury bills by the United Kingdom's Debt Management Office. The data center outage was caused by a combination of hardware and software failures in the network, which led to disconnections that lasted one to two hours for most customers. These instances indicate how serious is the problem of data center downtime (unavailability). The downtimes can cost lot of money and significantly affect how the customers perceive a company. Reliability is the ability of a system to perform its required functions under stated conditions for a specified period of time. Where as availability is the proportion of time a system is in a functioning condition. This is often mathematically expressed as 100% minus unavailability.
Most of the time, one would hear about 99.999 (often called “five 9s”) availability. For illustration purpose, let’s consider 5 minutes of down time in a year, which results into ~ 99.999 % of system availability. So having data center availability of 99.999% (5 9s) means the system is highly available, delivering its service to the user 99.999% of the time (uptime) with total downtime of approximately five minutes and fifteen seconds.
The goal for many companies is 99.9999% availability, but with each nine you add, costs can increase greatly. Moving from one level to another can encompass things from redundant servers to redundant storage frames or even duplicate data centers. This availability journey can cost thousands or millions of dollars to reach the 99.9999% uptime level. The decision to move forward with this level of uptime should not be an IT decision, but a business decision.
Availability in the Modern Data Center:
Now what it takes to maintain high levels of uptime in the data center? Is it that buying highly availability infrastructure enough? However, it doesn’t look so and here is why? Reputable studies have shown that 75% of downtime is the result of some sort of human error and the rest is due to some sort of equipment or software failure. Even the well trained IT people do mistakes when they are in a rush, are tired, weren't really thinking, or just took a shortcut. With ever growing data center complexity, it would be impossible to prevent every human error or equipment failure leading to an outage. The question in front of us is whether our data centers are really resilient as desired? Or the occasional outages just a fact of life?
The application dependence on 100% infrastructure availability is where the fundamental problem stems from. Think if application designers had choice of relaxing infrastructure availability but designed with the idea that the outage is normal in the data center. Embracing failure gives us true application resiliency because the failure protection is no longer an infrastructure problem alone. This shift in thinking made Google to produce Google File System, back in 2003, a distributed file system for their data center designed with system failures in mind.
Alleviation of availability requirements from infrastructure means, one doesn’t require high-end costly systems. The continued innovations in CPUs (multicore) have made them more powerful, disks of Terabyte capacity are more common now a days (lesser $/GB) and networks have become much faster (10/40 GbE). Thus, today, commodity servers have power of mainframe computer in just 1 or 2U form factor at the fraction of Mainframe computer cost. These modular commodity servers, packed with redundant components for availability and with better supply chain make them well suited for scaling data center capacity growth on-demand.
In summary, the big shift in the data center is on how the availability was viewed from application point of view. Today’s applications are distributed, designed with failure in mind and can scale to 1000+ nodes on commodity servers. This is also apparent with Netflix and its Chaos Monkey Engineering group. Netflix faced a massive reboot of their application instances on cloud. Their group repeatedly and regularly exercises failure of their distributed application, continually testing and correcting the issues before they can create widespread outages; Netflix has created a service designed with failure in mind to ensure availability at lower costs.
Agility: A simple way to measure agility of an organization is to assess how fast it can respond to the changing business circumstances. For the Data Center, it means how fast a new application deployment request can be fulfilled either by buying, building or repurposing the existing IT infrastructure. For example, by adopting agile IT infrastructure PayPal was able to execute product cycles 7 times faster than a year ago whereas earlier it took them 100 tickets and 3 weeks to provision new servers.
Traditionally, IT managers are tasked with planning capacity requirements ahead of time to avoid unplanned downtimes, procurement delays etc. Planning is done to avoid these overheads so that IT staff can concentrate on developing new applications that bring new business to the organization. Capacity planning usually has following steps:
- Determine the SLAs required for the business
- Analyze how the current infrastructure are meeting those SLAs
- Forward projecting the future capacity requirements through modeling
There is always a risk of underestimating future requirements, hence the modeling includes headroom capacity for unplanned capacity requirements. In reality, most of the time, the allocated capacity is higher than the actually required (Figure 1) resulting in waste of capacity and money spent on unused IT. In nutshell, such capacity planning usually end up spending more dollars than required and in the event of business changes it would take humongous task for IT to repurpose the existing infrastructure and can also lead into undersupply sometimes.
Agility in the Modern Data Center
The advent of distributed/decentralized systems and their ability to scale in small increments to 1000 of nodes using commodity hardware on-demand has made the capacity planning a thing of past. Distributed systems provide the ability to start small and then grow at the pace of organizational growth at higher scale leading to “pay as you grow” services. This is the basis on which Software as a Service (SaaS) cloud computing is being offered. The distributed architecture enables them to grow quickly, shrink and repurpose for something else in just few clicks.
Today, many distributed applications (e.g. Hadoop, Spark, Mongo DB and Cassandra etc.) are churning big data to produce actionable business value to the organizations. The need of the hour is that data center should be able scale to these application demands seamlessly. Apache Mesos is one such framework, which fixes the static partitioning problem in distributed applications via API for dynamic sharing of resources.
In summary, going forward distributed applications and commodity hardware will dominate the data centers providing the much-needed agility to organization to quickly respond to changing business requirements; all in just few clicks.