I got an email from someone that shall remain nameless, he’s been having issues with their settlements application and he was asking what I thought about the whole thing. I asked him what had gone on and basically there had been numerous unrelated issues which resulted in the reports being delayed on different days during the past four days, the following happened over the last week:

  • Data feed failed due to the application feed server failing due to lack of memory – settlements didn’t get their data feed to aid adjustment/calculation reports
  • The data drive on the settlements report server got full, the application team had to archive off data – settlements IT support had not archived files, and the application would fail dumping entire system memory to a log file, needing even more space
  • The settlements reports system which generated the reports into a XML feed, reconfigured them and emailed them out hung so no emails were sent out with peoples’ position
  • The rendevous network feed went on its holidays for a while due to an application error

The result has been a rather emotional service delivery manager, a quite flustered settlements support team leader, and traders being a little bit emotional questioning IT’s ability and parentage. As a result the following actions have been taken by IT:

  1. Failing over the application to the disaster recovery infrastructure
  2. Halted all changes to the production and disaster recovery settlements systems
  3. Instigated a service improvement plan for the settlements application infrasturcture
  4. Evaluation of upgrading the application code

You see users don’t understand that for no reason whatsoever, the IT every now and again will get ‘high maintenance’, you might get a six month run where everything works wonderfully, then a few weeks where one bit breaks, then an unrelated/related component seizes and before you know it the world from a user standpoint has ended. For the example above we should be following these procedures:

  • Don’t panic – that is not to say be disinterested, but we need to remain calm and focussed to fix the issue
  • Respond to the call, the issue and tell everyone you are looking at it and will get back to them – keep the call updated with all relevant information
  • Verify that the core is stabilized:
    • Identifying all internal/external upstream and downstream application or data feeds to the settlements system – do a health check on the underlying infrastructure of these feeds (if possible)
    • Identify the systems, the operating system, the middleware and the database involved in the settlements application and perform a health/configuration optimization of those systems
    • Identify the network switches/ports and check if they’re ok – there have been times when we’ve come across a switch that could have done with a reboot, or where there is a port at fault or erroring out.
  • Have the application team verify everything they can
  • Separate each component failure reading above a number of things and explain what happened in order
  • Stop messing with it, stop changes, stop failovers, stop service restarts, stop everything unless it is necessary (this will sound ridiculous), and let it be. Only by doing a few runs of the batch can we identify issues see the trends, the bottlenecks and then make the necessary infrastructure and application changes/projects to upgrade or refresh components/code.

If you look at the example above, the organization, IT/the business have changed the goal posts. We’re now running on a different set of servers, with different application code/logs so we now need to run a few days on that (regardless) to see if we can replicate that behaviour. When you actually look at the issues above, it’s not that one part has failed, it’s that component parts around the settlements application has failed, it could be nothing to do with the settlements application. Break down the series of events to their component parts and identify which parts could be prevented by best practice, which parts need upgraded/reconfigured and which were unexpected.

Related posts:

  1. Your infrastructure is pants One of my friends works for a service provider, she...
  2. MDAC, MDAC, MDAC MDAC or Data Access Component which is installed? One of...
  3. DL360 G6 blue screen Windows 2003 I got an email from Janet: “I found your site...
  4. We need to use the technology for recruitment I was having a chat with a colleague who like...
  5. HP to improve governance to empower business HP HP today announced new service offerings to help companies...

Related posts brought to you by Yet Another Related Posts Plugin.

Bookmark and Share

Leave a Reply