While your regular job may allow you to make your snooze button your best friend, what would you say to a job that requires 6-8-hour shifts (probably night shifts too) and demands that you be on call 24/7? Welcome to the life of a NOC engineer.
A NOC (network operation center) engineer is required to supervise a business’ servers and networks, in addition to monitoring for any internal and external issues. NOC engineers perform all the necessary daily functions that keep an IT department going as well as managing the IT services. Typically, they investigate any events that occur, troubleshoot and respond to incidents, and communicate with the teams or third parties responsible for resolving the issue. They are also responsible for monitoring every issue until it has been resolved.
As we have mentioned, you’re going to be working 8 hour shifts, and some of those may not be in the daytime. Sometimes, you know you’re in for a few busy days when there’s a new product or service launch, or an event that may cause issues. Unfortunately cloud issues can be caused by a variety of unpredictable external factors ranging from unexpected increased traffic to malicious attacks by bad actors. This can make it tricky to try and guess when an issue may arise, so your department needs to be alert and have engineers on hand.
Using so many monitoring solutions wreaks havoc on the technology stack. The same tools implemented to help become a burden that drowns the team in information. Here are just a few of the challenges this problem presents.
You start your day with a basic system check to make sure all systems are running at peak efficiency, and as an attempt to preemptively catch any issues and resolve them before they can affect customers.
In some cases, engineers start their shift with a systems health check of all cloud services by using the NOC’s monitoring tools. Other engineers before to start their shift by first reviewing any issues or updates from the previous shift in case something requires their immediate attention.
If an issue arises, it’s your responsibility to make sure it’s resolved, either by reporting and escalating it to the relevant responsible party, or handing it yourself. Having a runbook, especially an automated runbook, makes this entire process of escalation and resolution much easier.
A majority of your time on shift, even if you’re working nights, is going to be spent dealing with issues and responding to tickets and requests from employees in other departments within the organization. Most minor issues will be resolved within an hour or two at most, though system-wide interruptions can shut down services for several hours.
You should do your best to answer these requests as soon as possible and let your customer know that a NOC engineer is available and attempting to resolve the issue. Any customer calls or tickets should receive some response within minutes.
You also have the option of opening a ticket yourself, after locating the potential issue with the server and client. If you find that the customer has connectivity problems, a VPN issue, or a hard drive that’s getting a little too full before the customer has either found or reported the issue, you may send a ticket to the customer letting them know.
One of the most important aspects of your job as a NOC engineer is to care for and maintain the data center equipment. A significant chunk of your shift will be dedicated to the regular maintenance of the servers, including performing scans in case of infection and installing patch updates.
Ensuring your VMs are patched and up to date reduces their risk of infection or malfunction as well as optimizing their performance. It’s important to judge whether a component update is necessary for a patch installation on an individual basis, as patches can affect individual VMs in a distinct way.
Continuous troubleshooting and network supervision is crucial to avoid any issues or crises which could impact operational efficiency
Another part of your duties as the NOC engineer may include compiling the NOC daily report which summarizes any crises, issues, or anomalies that have been encountered in the last 24 hours, or performing routine maintenance such as installing new servers or dusting a cable series
Whether you’ve found an issue before it’s come to anyone’s attention or you’re dealing with a full-blown disaster, these problems need to be dealt with quickly. Generally, companies that maintain a runbook have an easier time catching issues early on and solving them quickly, especially if runbook automation is implemented.
A runbook allows you to look back and see if there is a precedent for the issue you’re dealing with. It allows you to analyze factors that caused the issue to possibly predict a recurrence in the future, and provides information about previous occurrences and how they were solved. This is critical information you can use to implement your solution quickly, saving everyone time and leaving your schedule free to work on more important things.
On occasion you may find a problem you can’t solve. It is then your responsibility as the NOC engineer to remain in communication with whichever third party is handling the issue, and continue to monitor it until the issue is resolved. Once it is, your runbook should be updated for future occurrences.
It’s not easy to be a NOC engineer; it’s a demanding job, requiring close attention to detail and willingness to work unusual hours and be on call 24/7. You could even say it will turn you NOCturnal, but it’s fulfilling and intellectually stimulating. A NOC engineer keeps the whole system running. Despite the demands of the job there are factors which can help ease the burden, such as using runbooks to save engineers time searching for solutions.
Reach out to us today to see how we can help you monitor your cloud systems more effectively.