Investigating the root causes of Downtime

In today’s fast-paced world, IT infrastructure and data centers have evolved into strategic assets for businesses. These assets define and dictate the overall business performance and customer satisfaction. The core function of a data center is to provide uninterrupted and continuous uptime for mission-critical applications that it hosts although with every passing day, data operations are becoming increasingly critical as well as complex. This has also increased the financial and indirect cost of downtime.

Unplanned data center outages not only disrupt business operations but also result in loss of revenue, user productivity, IT throughput and customer satisfaction. According to a recent report by Gartner, businesses typically experience over 87 hours of downtime every year, which costs them losses exceeding $3.6 million.

What factors cause downtime?

 Downtimes can occur owing to a plethora of reasons. Let’s take an in-depth look at the various root causes of data center downtime.

Human Error

 As weird as it may seem, human error is a huge contributor of data center downtime. They say “To err is human” and they’re right. However simple the cause is, people do make mistakes. Ponemon’s last survey in 2013 shows that over 22% of data center outages stem from human errors. But businesses should not turn a blind eye towards this. Human error can be minimized through training and equipment modifications. Equipment modifications include shielding and improved labeling of emergency toggle off buttons. Staff must be trained to respect the environment of a data center floor. Data center floors should also have restricted and logged access to avoid data breaches or equipment damage.

UPS Failure

 The most common reason for unplanned data center outages is the failure of UPS. To increase data center uptime and availability, whenever there’s a power outage, the UPS gets power from backup batteries. The backup batteries are rarely checked for their health, which leads to outages. Higher operating temperatures can also shorten battery life. A UPS can also fail due to reasons including extreme power withdrawal which goes beyond the UPS capacity, battery failure or equipment failure. Furthermore, sometimes businesses wrongly select UPS designs that fall below the required IT load. Therefore, UPS and battery health should be regularly checked to ensure UPS performance is in sync with data-center needs.

Cyber Crimes

 Cyber attacks jumped the ranks to 22%, becoming a major contributor in to data center downtime. While some protection is offered at the 3rd and 4th layer of the network, companies can add some extra defense at the 7th layer targeting HTTP GET or similar attacks and they can also ensure their compliance certificates are updated. Furthermore, more protection can be added using the combination of Firewall, IPS/IDS as well as DDoS mitigation services. Another solution is to automate the datacenter’s security management via early detection and prevention of attacks, thereby alleviating unplanned outages.

Over Heat Issues

 As workload densities rise up, computer room air condition (CRAC) failures have also increased. The CRAC cooling systems weren’t designed for the current humongous data center density originally. As a solution, N+1 redundancy can be used to establish good load management to mitigate water, heat or CRAC failures. Businesses can also use chemical refrigerants instead of water-based systems if the data center design has rack-level cooling with an immersion element.

Combat the Root Causes

 Every business should strive for 100% uptime to maximize revenue and customer satisfaction. Businesses can invest in  HyperConverged Appliance (HCA) if they want to minimize downtime, human error, financial losses and security threats with the help of automation and avoidance of staff training. Using Dell OEM servers and a 100% software-defined hyper-converged infrastructure, HCA offloads the responsibility of picking the right hardware and software, migrating applications and integrating HCA in the datacenter. Using just a single onsite node and HCA ProActive Support, the appliance monitors the cluster 24/7 for failure prediction and prevention.