March 1, 2017 by Kenneth Fisher
One of the most common ways to get an event notification is by email. So what happens when you get 500 emails in a day and only one or two are actionable? Do you read every single email? Spending quite literally hours to find those one or two gems? Or do you just ignore the whole lot and wait for some other notification that there is a problem. Say, by a user calling you?
Next, let’s say you have a job that runs every few minutes checking if an instance is down. When that instance goes down you get an immediate email. Which is awesome! Of course then while you are trying to fix the issue you get dozens more emails about the same outage. That is at best distracting and at worst makes it take longer for you to fix the issue.
Ok, obviously event storming is bad. (Quick note, not all event storms are emails, that’s just the example I’m using here.) That said, what do we do about it? Reduce the number of events obviously. There are a few obvious methods here.
Let’s say you are alerting on the percentage of disk used. And you are using a default of > 80% used. Do you really need to know that you only have 500gb of your 3tb disk? You also want to collect baseline information (for many reasons) so you have a better idea of what your actual numbers are and can alert when these numbers move outside of your reasonable baselines. Erin Stellato (b/t) has a good article on SQL Server Central on Back to Basics: Capturing Baselines on Production SQL Servers to help you get started.
Collect your data into a central repository, then send out alerts on the combined data. If you have 100 databases, or 1000 instances, then you don’t want to be sending out alerts on each individual piece of information. Collect it together, parse it, then send out a single consolidated alert. I wrote one system a while back that sent out 2 emails. One listed every database on every instance that didn’t have a backup within a reasonable amount of time, and the other checked things like online status. Each morning we had a status report of all of the problems (of those types) across all of our instances at once. A single email, not dozens.
Collect your data (like I said above) and apply logic to it. If you sent this alert out 5 minutes ago you don’t really need another one right now. Or maybe you do if the situation has gotten worse? Now eventually if the problem persists you want to send out another alert. But how long? Well it depends.
Also, time your alerts. Did your daily differential fail? You’re still recoverable right? Your logs are succeeding and you have a full from Saturday so you probably don’t need to send out an email about the failure at 3am. It can wait until morning.
You won’t always be able to control what alerts you are getting. They may be coming from other systems you don’t have access to, or you may just need to clear the chaff so you can start working on reducing what you can change. Create email rules. Shift events you know aren’t important off to a folder to be reviewed every now and again. You may lose some positives but if you are careful they won’t be many. And IMHO it’s better to lose a few positives (depending on what they are) than start ignoring everything because you are being flooded.