Oracle Performance: LAST Cause is not the same as ROOT Cause
Looking for "What Changed" not so Helpful
When faced with a new problem to troubleshoot, many smart people immediate ask the question, "What Changed?" This seems logical, but is actually a weak way to solve a performance problem.
By looking for recent changes, you are making a questionable assumption. You are assuming that everything was indeed fine until the problem was reported. That is a bad assumption--especially in the world of performance tuning.
Just because something was running pretty well before, that doesn't mean it was well-designed; it could just mean you sometimes "get away" with a poor design.
In philosophy, the most recent cause is called the "Proximate" cause. The root cause is called the "Ultimate" cause.
Two examples will clarify:
Example 1: Caching Effect
A poorly designed sql statement requires 100,000 disk reads. The designer is not aware of how poorly the code is written, since the report runs "okay." Normally, all the blocks are "cached up," so that the performance isn't too bad. The key is, good performance requires all the blocks to be cached. Even a small reduction in caching will have a disastrous effect.
Then, other jobs are added to the server, and the caching goes down from 99% to 90%. Suddenly, the report takes much longer. The users mistakenly search for the "proximate" cause, and ask, "What changed?" They don't realize that the code was poor to begin with, and suggest that some other job is causing their bad performance.
I have personally worked on many cases just like Example 1. It is very common for users to blame another job, when their own design is really the root cause.
Example 2: Unstable Execution Plans
In this case, the sql is overly complicated, with lots of bind variables and awkward joins. The optimizer struggles to figure out a good execution plan. Sometimes it gets it right, sometimes it doesn't. On the current server, given the current statistics and perceived bind variables, the optimizer has a decent plan, the sql runs well.
Now, the same code is run on another database. Stats are re-gathered, and new queries are run. This time, the code runs horribly. The users again ask, "What changed?" assuming that that question is a good place to start.
True Story: The Car Wash
A friend of mine took his car to the local car wash. The car came out nice and clean, but with just one problem: the windshield was cracked! Well, the car owner was mad, and demanded the car wash owner pay for the damage.
The owner of the car wash, being a former Oracle DBA and regular reader of this blog, refused to accept liability for the damage. He argued,
"Our car wash was only the proximate cause--not the ultimate cause."
Well, okay, the car wash owner didn't really use the term, "proximate cause," but he understood the principle. The car wash owner suggested the windshield must have already been damaged, since many tens of thousands of cars had been processed over the years without causing any such damage.
The car wash was being blamed just because it was the nearest cause, not actually the root cause.
Lessons Learned
I am reluctant to begin any investigation with "What changed." A good DBA normally has far better methods to use.
Lesson #1: You Don't Really Know What Changed
Folks mean well, but rarely does anyone really know for a fact what has changed and what hasn't. How often do people say, "We haven't changed anything," but really can't guarantee that.
Lesson #2: Some Things Always Change
Even when a database is cloned to another server, many things will soon be different, even when everyone says, "We haven 't changed anything." For example, the bind variables specified will usually be different. How the optimizer assesses the bind variables can also change. Here's another: The caching effect will always be different, depending on recent jobs run. When a user claims the the report being run is identical as before, they cannot really know that for a fact.
Lesson #3: Poor Design Can Yield Varying Performance
You can write terrible sql, and sometimes you get away with it; the optimizer happens to get it right in your particular case. The problem is, it's an unstable solution, and can easily shift to a bad solution. The slightest change--such as stats, how bind variables are used, size of objects, or caching can trigger plan changes and miserable results.
It would be easy to get caught up analyzing all the factors that cause plan changes, but they are not relevant to the root cause.
Wrap-up: We can Do Much Better
By looking for "What has changed," you are aiming at the wrong target.
By focusing on recent events, you are actually analyzing the consequence of the bad design, not isolating the bad design itself.
Of course, I must admit, you can sometimes guess the right cause, but the "What's Changed" method is not a consistently good approach. We can do much better.
DBAs have excellent ways to find root cause. Here's one: Become adept at using the Active Session HIstory views.