Observability’s Newfound PopularitySeptember 9, 2019
A Lesson for Every Startup from Bitcoin’s 2017 OutageFebruary 16, 2020
Managing a cloud system properly entails numerous tasks (performance monitoring, response times, latency, uptime management, security, compliance, and disaster recovery). Together, these tasks form a comprehensive holistic strategy that embraces and accounts for all potential scenarios.
Especially in today’s intensive business landscape, running cloud operations (NOC) 24/7 has become mandatory together with maintaining a high availability of network services. These two components are primary concerns for all enterprises. In the event that a system, network or application experience any form of downtime, the result will be a significant loss in business revenue.
The 24/7 NOC team should analyze online business availability requirements and ensure that a series of optimized, cost-effective contingency plans are in place and tested regularly to ensure a robust online service that is reflective of business needs. For example, Internet ecommerce systems may have almost zero recovery RTO (Recovery Time Objective) in comparison to less critical, non-customer-facing applications where even a few days of recovery can be provisioned on a less expensive cloud infrastructure with limited redundancy capabilities.
Based on past experience, there are three major factors to consider when pursuing effective Uptime Management. When combined, these factors guarantee a successful network operation in the cloud.
The 3 factors for successful uptime management include:
24/7 NOC & Real-Time Monitoring
An effective Uptime Management cannot be achieved without a well-functioning 24/7 NOC team. The NOC team actually controls the network operation and manages the entire application and service operation. In other words, the NOC team offers a broader, overarching analysis of the entire system operation as well as an accurate business-oriented monitoring, conducted in real time.
The NOC teams relies on a comprehensive monitoring platform, which should be perfectly matched to specific operational necessities. Additionally, the monitoring platform together with the NOC team form an ideal combination for averting emerging issues. This ‘team’ also provides a unified view of the entire IT operation in real-time. The monitoring should be carried out in a centralized dashboard covering the following operational aspects: bare-metal, network, SLA, application and QoS. This sophisticated monitoring facilitates f a smooth service workflow and an effective tool for critical decision making.
Uptime Management Protocol – Runbook Management
With the KPI’s defined, an uptime management protocol should be the next stage in the cloud service strategy. The protocol constitutes the basis for the run-book management. The run-book elaborates on the appropriate actions for any scenario, enabling a real-time response with minimal downtime. An important aspect of run-book management is that it should remain dynamic and open for change based on previous events that offer new insights. Additionally, the protocol should be a proactive aspect of the enterprise to avoid potential changes and issues of severity.
This run-book process forms an accurate transfer of undocumented knowledge gathered by team members to provide the most updated event documentation, reducing overall dependency on specific individuals within an organization.
Tier 1+2 Services
Tier 1 – The NOC team responds to the initial errors and should be able to handle up to 80% of support requests. The Tier 1 NOC team should handle all events according to the pre-defined run-book protocol. In the event of an escalation, the Tier 2 team should be notified. An essential component of the Tier 1 NOC team is their ability to record all events and actions for the purpose of updating the uptime protocol.
Tier 2 – This level of support encompasses DevOps engineers that are equipped with the tools to investigate and solve more complex events. The expertise of the engineers enables them to check logs and take actions to solve configuration issues. In order to reduce future similar scenarios of escalation, these engineers should report back to the internal NOC team.