Why is as important as How6
August 28, 2019 by Kenneth Fisher
I recently had someone tell me, and I’m paraphrasing a bit here:
We know how to fix this now, so all this discussion of why it (the fix) works is superfluous.
And that’s great in the short run. I mean we had to get production up and running again right? But what about next week when the problem happens again? It may not be on that server, or that database, or even that particular piece of code, but most problems turn up at least occasionally and without understanding not just how to fix it, but why it works we are just throwing possible solutions at a problem and hoping one of them sticks.
Just as an example “XYZ is slow, we need you to rebuild the indexes.” Why is that fixing your problem? Which index is the problem? I mean it’s probably not just one right Is an index really the problem or do the stats need to be updated? Or maybe it’s because when the stats are updated it’s invalidating the current query plan for one specific stored procedure. If you know exactly what’s going on you can recompile that one procedure, or even better work on tuning it so that you don’t have that particular problem in the future.
Fixing a problem is important. Figuring out what went wrong will help you solve it faster in the future, or even better stop it from happening again.
You’re hitting on a big pet peeve of mine. I’ll say that the ‘why’ is more important than the ‘how’. It is not just database-specific. Too many people use what I call the ‘shotgun’ approach to debugging an issue. When they happen to find an approach that appears to work, it becomes their go-to response. Often that response is a band-aid, not a cure for the underlying issue. Understanding the underlying issue leads to a real fix, not a band-aid. Maybe this phenomenon is a product of the compressed timelines we are all faced with nowadays.
Yea, that compressed timeline is defintely at least part of the problem. In this particular case we had production down and needed to get it up. Speed was essential and any fix, shotgun or not was what we needed. But only right then. Once it was fixed we should have gone back and done some research to figure out what had actually happened. And to be fair we are doing that, just not with as much support (that I can see) as I would like.
When Production is down, you have the resources that you need. Once it is back up, there is usually something more important that needs your resources, so it is much harder to do the necessary investigation.
I find that this is especially notable with interfaces to other systems, especially at other companies… When testing an interface, it is almost impossible to get everyone together to fix it – Your staff, the company that provides your software, the company that hosts your system and the company you are interfacing with. But, when PRODUCTION isn’t working, suddenly the interface becomes everyone’s top priority.
I don’t like to operate like this, but it does work. So it is really hard to avoid it, especially since testing the interface in a test environment provides no guarantee that it will work in production anyway.
In my experience, it is also helpful to identify why the problem is happening now. What has changed ? If nothing has changed, it sometimes means that there is a major underlying problem which was not apparent and someone has finally noticed a symptom of it. Fixing that symptom just keeps the underlying problem hidden, allowing it to do even more damage.
Exactly. Although knowing what changed is only a starting point. We’d upgraded from 2008 to 2016. We know what changed, but we couldn’t roll back at this point and had to find a way to move forward. Knowing what the change was, we still needed to know exactly why the change was causing the problem and what would fix it.