Event storming! It’s raining emails!

5

March 1, 2017 by Kenneth Fisher

One of the most common ways to get an event notification is by email. So what happens when you get 500 emails in a day and only one or two are actionable? Do you read every single email? Spending quite literally hours to find those one or two gems? Or do you just ignore the whole lot and wait for some other notification that there is a problem. Say, by a user calling you?

Next, let’s say you have a job that runs every few minutes checking if an instance is down. When that instance goes down you get an immediate email. Which is awesome! Of course then while you are trying to fix the issue you get dozens more emails about the same outage. That is at best distracting and at worst makes it take longer for you to fix the issue.

Ok, obviously event storming is bad. (Quick note, not all event storms are emails, that’s just the example I’m using here.) That said, what do we do about it? Reduce the number of events obviously. There are a few obvious methods here.

Better Thresholds
Let’s say you are alerting on the percentage of disk used. And you are using a default of > 80% used. Do you really need to know that you only have 500gb of your 3tb disk? You also want to collect baseline information (for many reasons) so you have a better idea of what your actual numbers are and can alert when these numbers move outside of your reasonable baselines. Erin Stellato (b/t) has a good article on SQL Server Central on Back to Basics: Capturing Baselines on Production SQL Servers to help you get started.

Consolidated Emails
Collect your data into a central repository, then send out alerts on the combined data. If you have 100 databases, or 1000 instances, then you don’t want to be sending out alerts on each individual piece of information. Collect it together, parse it, then send out a single consolidated alert. I wrote one system a while back that sent out 2 emails. One listed every database on every instance that didn’t have a backup within a reasonable amount of time, and the other checked things like online status. Each morning we had a status report of all of the problems (of those types) across all of our instances at once. A single email, not dozens.

Logic
Collect your data (like I said above) and apply logic to it. If you sent this alert out 5 minutes ago you don’t really need another one right now. Or maybe you do if the situation has gotten worse? Now eventually if the problem persists you want to send out another alert. But how long? Well it depends.

Also, time your alerts. Did your daily differential fail? You’re still recoverable right? Your logs are succeeding and you have a full from Saturday so you probably don’t need to send out an email about the failure at 3am. It can wait until morning.

Email Rules
You won’t always be able to control what alerts you are getting. They may be coming from other systems you don’t have access to, or you may just need to clear the chaff so you can start working on reducing what you can change. Create email rules. Shift events you know aren’t important off to a folder to be reviewed every now and again. You may lose some positives but if you are careful they won’t be many. And IMHO it’s better to lose a few positives (depending on what they are) than start ignoring everything because you are being flooded.

5 thoughts on “Event storming! It’s raining emails!

  1. notarian says:

    It’s raining men, hallelujah, it’s raining men, amen
    I’m gonna go out to run and let myself get
    Absolutely soaking wet

  2. I swear to god man it’s like you’re actually baiting me sometimes. Cause you know this is exactly what Minion Enterprise does and I think you do this just so you can see how long I’ll last before I have to say something.
    Yes folks, this is exactly what Minion Enterprise does. It never event storms, even right out of the box with no extra config.

    • Wait, you mean I might have had ME in mind while I was writing this? 🙂

      Actually I did, but the post was more written for those people who don’t have ME, or have ME and some other process.

  3. […] however, tell you if things are getting worse. On top of that if you want useable alerts without a ton of false positives then you need a baseline. That way you can tell when a setting is no longer […]

  4. […] be the boy. Don’t yell for help without even trying to do for yourself. Don’t create event storms etc. It’s all to easy for people to get worn down by the constant cry of “Wolf”. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 3,755 other subscribers

Follow me on Twitter

ToadWorld Pro of the Month November 2013
%d bloggers like this: