Entering the Stream: There is a cause.

Meditation:

One of the luxuries of our practice in IT is a rational, empirical framework. There is always a root cause.

Story:

You are wearing your lucky socks. You have been issued your security badge. The HR paperwork is complete.

The friendly smiles from the interview process are strained, this Monday morning. There have been 37 application timeouts since 6 AM. Cash posting is delayed.

The department has been informed, that you, the new DBA are the miracle cure for all existing application problems.

Within five minutes of sitting down at your new desk, you get a call from the VP of Finance. They really need to figure out what's going on- and get it fixed- immediately if possible. It's costing hours per day. The folks in Order Entry have come to hate the slow, malperforming system.

You get another call. There's been another five timeouts in the last few minutes. Things are getting worse.

You pull up the familiar outlines of SQL Management Studio.

Nothing.

Your phone rings again. The comptroller is getting security warnings trying to run a critical report.

Right-click connect. At this point, you know none of the architectural internals of any applications. In fact, all you know are server names. No one knows which databases are exactly where, or even what is being hosted per database. The world is unknown, sea monsters lurk:

The room is very, very small, and very very hot. It takes 30 seconds to connect to the primary server and login.

Nothing is documented anywhere.

What now?

Thus began my first pure DBA assignment. I'd been working as a DBA for about ten years, but always as a member of a team, or in a secondary role. This was my first morning of total ownership.

I could feel the dissatisfaction growing with every passing moment. 50 time outs. The cash posting group was wondering if maybe they should go home.

This particular gig hadn't had a DBA onsite for several years.

Breathe.

At the root, our practice is about getting data to and from disk, then into memory, and into the CPU for processing. All problems stem from blocked flow in one of these areas. This is always true.

Breathe.

SSMS showed no blocked sessions. Perfmon showed some disk activity- but only moderate memory use. The CPU was pegged at 95 %.

An hour has passed; Nothing is better. You are sweating.

An investigation of wait state DMV's shows hundreds of CX_PACKET waits. You think you read a paper about that once.

Can't that be caused by intraquery parallelism gone bad? Poor indexing?

You right click on 'Reports->Index Physical Statistics.'

You stifle a scream. Everything is fragmented to 99 %.

A few emergency rebuilds later, concentrating on the indexes with the highest utilization, and the timeouts start dying off.

By lunchtime, there hasn't been a timeout in an hour.

'Thank god you're here,' a colleague says.

It took three months to solve in its entirety. The root cause was a series of bad/missing indexes coupled with poor query design. This was leading to excessive CPU utilization due to bad query plans, under unconstrained parallelization.

The fix was some new indexes- but most importantly- weekly maintenance to make sure that indexes were unfragmented and statistics were correct.

Which I guess yields the takehome: begin at the beginning.

Entering the Stream

Sunday, June 27, 2010

There is a cause.

0 comments:

Labels