September 2008 22

Monitoring is everything

I was having a chat with a colleague who’s working the other day, he was telling me about a new drive that they have to reduce their operational costs. Too much money is being spent on calling out engineers overnight, that IT is reactive and not pro-active.

Some valuable insight and on the whole true, spending a little effort during the day might easily reduce your overnight callouts – that script which you run to see how many servers are running low on disk space etc.

What really matters though is your call out, and to illustrate this, we’re going to use Chris’ example at his organization based in sunny Canary Wharf. His monitoring solution is linked to their inventory. When they get an alert, the alert states the team that should be notified to action it. The inventory therefore is almost as crucial as the alert, having an incorrect entry might incur greater cost and delay to fault resolution.

  • Server ITL1782 runs out of disk space on its C drive at 22:37 and an alert is generated and sent to the operations team.
  • Operations call out the Windows server team.  The windows server team respond ask for more information including the server name, the application. “Sorry, we don’t look after ITL servers, that needs to go to retail support”.
  • Operations call out Retail Support requesting they look at server ITL1782. Retail respond that they will look into it.

This is a simple example. As the applications, the way you organize/support or even cross charge your IT, the more your inventory and your monitoring needs to be kept in line. That certain teams may only be responsible for specific elements of service of an application or specific parts of ‘the server’. In the example above, we’ve called out two teams when we only needed to call out one. In the grand scheme of things, not a big deal. But as we scale out the infrastructure from 7 servers to 800, from four teams to cross business lines; those costs increase. That your inventory or monitoring is out of date is causing:

  • Delay in incident identification and response by relevant teams involved
  • Delay in fault resolution and disruption to service
  • Extra un-necessary cost calling out teams that are not responsible for those systems/applications

Which is why we need to move to a follow the sun support model, why we need to empower the end users with the monitoring tools, to let them be part of the support process.

I’m not asking an application owner to know what thresholds to set on their AIX servers for disk queue lengths. But I do want to know the following:

  • Application processes that run on the server to be monitored
  • What represents an application failure – if service19 fails do we call out or is that the maintenance task?
  • Log watching – can we monitor your logs and call out as appropriate – error means call out, halt means monitor the batch.
  • What performance thresholds might affect your application outcome/results – if the CPU reaches 100% is this a problem? Or is this normal batch activity between 9pm and 6am?

I want the application team to take part in application profiling, understanding how we can effectively monitor the infrastructure to benefit them; to allow us to have an application centric viewpoint of the application- retailweb is green meaning all is well.

Related posts:

  1. Five ways to reduce server support costs I got asked by a friend today to write five...
  2. What stops delivery we ask a Head of IT I was interviewing a Senior IT Manager who has up...
  3. The Bladewatch CMDB tool announced! IT rocks and we’ll be announcing more soon! Over the last few months and since I’ve began writing...
  4. What to do during a downturn with the IT budget Virtual Studio Magazine The downturn in the economy is affecting...
  5. Achieving dynamic IT through effective monitoring Eweek CA is looking to make it easier for enterprises...

Related posts brought to you by Yet Another Related Posts Plugin.

Bookmark and Share

Leave a Reply