From the Blogosphere
IT Monitoring Clickbait | @CloudExpo #APM #Cloud
Here are a few common alerting problems, along with the reasons they often crop up and how to solve them
May. 1, 2016 12:45 PM
It is a sad but very real truth that many, dare I say most, IT professionals consider alerts to be the bane of their existence. After all, they're annoying, noisy, mostly useless and frequently false. Thus, we IT professionals who specialize in IT monitoring are likely well acquainted with that familiar sinking feeling brought on by the discovery that the alert you so painstakingly crafted is being ignored by the team who receives it.
In that moment of professional heartbreak, you may have considered changing those alerts to make them more eye-catching, more interesting and more urgent. To achieve this, you might have even considered choosing from a menu of possible alert messages. For example:
- Snarky: Hey server team! Do you even read these alerts anymore?!?
- Hyperbolic: DANGER, WILL ROBINSON! Router will EXPLODE in 5 minutes!
- Sympathetic (or just pathetic): Hey, I'm the IIS server and it just got really dark and cold in here. Can someone come turn the lights back on? I'm afraid of the dark.
Or you may have considered going the clickbait route. For example:
- This server's response time dropped below 75 percent. You won't believe what happened next!
- We showed these sysadmins the cluster failure at 2:15 a.m. Their reactions were priceless.
- You swore you would never restart this service. What happened at 2:15am will change your mind forever.
- Three naughty long-running queries you never hear about.
- Hot, Hotter, Hottest! This wireless heat map reveals the Wi-Fi dead zones your access points are trying to hide!
- Watch what happens when this VM ends up next to a noisy neighbor. The results will shock you!
While all of the above approaches are interesting to say the least, they, of course, miss the larger point: it is deceptively difficult to craft an alert that is meaningful, informative and actionable. To combat this issue and ensure teams are poring over your alerts in the future without needing tantrums, gimmicks or bribery, here are a few common alerting problems, along with the reasons they often crop up and how to solve them.
Problem: Multiple alerts (and tickets) for the same issue, every few minutes
This issue is called "sawtoothing" and describes a situation when a particular incident or condition happens, then resolves, then happens again, and so on and so forth, and your monitoring system creates a new alert each time.
To solve this, first understand that some sawtoothing is an indication of a real problem that needs to be fixed. For example, a device that is repeatedly rebooting. But usually this happens because a device is "riding the edge" of a trigger threshold; for example, if a CPU alert is set to trigger at 90 percent, and a device is hovering between 88 and 92.
There are a few common approaches to solving the issue:
- Set a time-based delay in the alert trigger so that the device has to be over a certain percentage CPU for more than a pre-set number of minutes. Now, the alert will only find devices that are consistently and continuously over the limit.
- Use the reset option built into any good monitoring solution and set it lower than the trigger value. For example, set the trigger when CPU is over 90 percent for 10 minutes, but only reset the alert when it's under 80 percent for 20 minutes. This reset option establishes a certain standard of stability within the environment.
- Use the ticket system API to create two-way communication between the monitoring solution and the ticket system that ensures a new ticket cannot be opened if there is already an existing ticket for a specific problem on that device.
Problem: A key device goes down - for example, the edge router at a remote site - and the team gets clobbered with alerts for every other device at the site
If the visibility of a particular device is impaired, monitoring systems sometimes call that "down." However, that doesn't necessarily mean it is down; a device upstream could be down and nothing further can be monitored until it comes back up.
Any worthwhile monitoring solution will have an option to suppress alerts based on "upstream" or "parent-child" connections. Make sure this option is enabled and that the monitoring solution understands the device dependencies in your environment.
Problem: You have to set up multitudes of alerts because each machine is slightly different
You may find yourself having to set up the same general alert (CPU utilization, disk full, application down, etc.) for an ungodly number of devices because each machine requires a slightly different threshold, timing, recipient or other element.
We monitoring engineers find ourselves in this situation when we (or the tool we're using) don't leverage custom fields. In other words, any sophisticated monitoring solution should allow for custom properties for things like "CPU_Critical_Value." This is set on a per-device basis, so that an alert goes from looking like this, "Alert when CPU % Utilization is >= 90%," to this, "Alert when CPU % Utilization is >= CPU_Critical_Value."
This solution allows each system to have its own customized threshold, but a single alert can handle it all. This same technique can be used for alert recipients. Instead of having a separate, but identical alert for CPU for the server, network and storage teams, each device can have a custom field called "Owner_Group_Email" that has an email group name. Then you create a single alert where the alert is sent to whatever is in that field.
Problem: Certain devices trigger at certain times because the work they're doing causes them to "run hot"
During the normal course of business, some systems have periods of high utilization that are completely normal, but also completely above the regular run rate. This could be due to month-end report processing; code compile sequences overnight or on the weekend; or any other cyclical, predictable operation.
The problem here is that the normal threshold for the condition in question is fine, but the "high usage" value is above that, so an alert triggers. But if you set the threshold for that system to the "high usage" level, you will miss issues that are important but often lower than the higher threshold.
Rather than triggering a threshold on a set value - even if it is set per device as described earlier - you can use the monitoring data to your advantage. Remember, monitoring is not an alert or page, nor is it a blinky dot on a screen. Monitoring is nothing more (or less) than the regular, steady, ongoing collection of a consistent set of metrics from a set of devices. All the rest - alerts, emails, blinky dots and more - is the happy byproduct you enjoy when you do monitoring correctly.
If you've been collecting all that data, why not analyze it to see what "normal" looks like for each device? This is called a "baseline" and it reflects not just an overall average, but also the normal run rate per day and even per hour. If you can derive this "baseline" value, then your alert trigger can go from, "Alert when CPU % utilization is >= <some fixed value>," to, "Alert when CPU % utilization is >= 10% over the baseline for this time period."
IT pros tried these weird monitoring tricks and the results will shock you!
When monitoring engineers implement and use the capabilities of their monitoring solutions to the fullest, the results are liberating for all parties. Alerts become both more specific and less frequent, which gives teams more time to actually get work done. This in turn causes those same teams to trust the alerts more and react to them in a timely fashion, which benefits the entire business. Best of all, everyone experiences the true value that good monitoring brings and starts engaging us monitoring engineers to create alerts and build insight that helps stabilize and improve the environment even more.
"Monitoring Team Saved This Company $$$" isn't some fake headline designed to get clicks. With a little work, it can be the truth for every organization.
For even more alerting insights, check out the latest episode of SolarWinds Lab here - what happens at 22:17 will blow you away!