One of my friends works for a service provider, she was telling me the story about a contract that they had been awarded which ended up in a series of discussions about requirements for transformation in order to deliver service improvements.  The issue was that some key servers which performed various functions failed resulting in separate issues and incidents being logged, however due to the proximity of the incidents, the feeling was that the IT kept failing, that service improvements were needed.  The role of my friend was to provide a report and have discussions about the steps to take to resolve the situation. The interview explains it all my questions are in italic.

Explain the issues that you had

  1. The primary SMS server failed which resulted in us being able to deploy new desktop applications to users desktops, so for a few days applications had to be manually dropped causing a perceived delay and discussions about inventory and security patching/virus scan software.
  2. One of the application servers failed resulting in settlements archive data being unavailable, again this was service disruption in terms of archive data but not live service, which could be obtained through other means in other applications.
  3. The primary file server for one business  line got full resulting in users being unable to save their spreadsheets for specific in house written applications.

What was the key factor in each incident and summarize how it was resolved

  1. The SMS server failed as a result of age. It was a Compaq Proliant 1600R with two array shelves, running Windows 2000 with SMS, the array controller failed and corrupted the disk configuration, when this happened the server team tried a Windows repair but this proved to be unsuccessful. The server was handed over to the hardware team, who replaced a faulty disk, an array controller, upgraded the firmware and the Windows guys re-installed Windows and SMS, that part only took a few hours, the hardware part took the best part of a day whilst we ran diagnostics and obtained parts.
  2. The application server had two things go wrong with it, the operating system blue screened complaining that the registry was corrupt and it was discovered that there was no backup due to the volume of data (700GB). Therefore the team had to very carefully try and get windows to start to the point where they could copy off the data, re-install windows and then copy the data back.
  3. The file server had a 130GB volume and it got full. One of the applications that ran on the server failed creating very large logs in a short period of time filling the drive, as users continued to work on the file server and save their work. The fix was to delete the application logs recovering 10GB of disk space.

In summary, legacy infrastructure which lacked investment and best practice configuration, indeed the SMS server was barely in support, the application server was all RAID5 with no thought to backups or what happens if. The file server was simply one of those everything to everyone servers, so cross application, cross purpose, the accounts team stored their accounts on it, whilst development ran their applications on desktops but output their logs to it. Lack of architecture, too much growth and evolution of the business but not the platform.

What steps did you recommend going forward?

For the SMS server, two things, an SMS upgrade to a newer version on a new server, nothing expensive, just something in support, a DL385 with local storage which met our requirements. Windows 2003 and a new version of SMS, all configured in line with standards, the new server proved to be a lot easier to support, more scalable and took up a lot less space.

For the application, we asked for clarification about data requirements, how much of the data actually needed to be online and what could be archived once of to tape (and restored on demand). The feedback was we only needed 6 months data which was 180GB of the 700GB. We therefore performed a full archive of the drive, tested the restore a few times and deleted all but the 6 months of data required. With the data size down, it could now be added to the backups as there was sufficient capacity in the backup infrastructure for incremental and weekly backups.

The file server was more tricky due to the nature of the setup at that time. We had numerous file servers which lacked structure and had grown up from additional requirements, as one server got full, we moved that role/function to another one, creating mixed environments with mixed requirements.

The immediate steps taken were data role consolidation. The creation of three data environments:

  • Application – this comprised of in house written applications which ran from or wrote to a shared network drive.
  • Infrastructure  – data which was needed for support or shared purposes for example IT documentation, install media, and where users ran Office etc from.
  • User data – this is where users shared drive for Outlook and their profile data would be stored.

The three file servers we had were then rebuilt one at a time using backup/restore and migration techniques to move data around, with a separate partition left for the application which ’caused’ the issue so to speak. By doing this we could also illustrate storage requirements and utilization more effectively, and take one part of the IT down without the other, so upgrade ud1 only affected user data, not users ability to access their applications or run Windows.

What was the most significant challenge?

The main step from my standpoint in terms of delivery was to remind everyone that there were three separate component failures.

  • SMS server hardware failure resulting in operating system corruption
  • Application server Windows failure
  • File server space ran out.

This is important as I wanted to highlight that IT wasn’t falling apart, that we were not making mistakes, that there were in some respects to be expected but unfortunate failures within legacy components of the infrastructure. That some immediate small spend investment was needed to prevent this from happening again immediately as well as some data re-organization to meet our changing business requirements, and bring the IT in line with our dependence on it.  SMS when we first used it was an IT tool for testing, as the business requirements and user base evolved, the importance and role of SMS had changed, but the underlying technology had not, just like the management of the data on the application server.

Moving past the IT failure discussion and reminding the stakeholders that the IT was old, out of support and that we had got our return on investment, there was also some hesitiation about spending money, so we had to do some cost due diligence. By replacing that Compaq 1600 and two array shelves and replacing it with a DL385 we reclaim so much power and data center rack space, as well as some performance analysis and reliablity/being in support discussion.

The data migration and consolidation was also a bit tricky simply because of the need to obtain sign off, to ensure everyone was aware of what we were doing, obtaining the downtime window, and managing everyone’s requirements in terms of drive letters, shares and access. But with the new process in place, with the solution working quite well, we have managed to improve the understanding for the senior management team to know what is on what server, to re-brand everything within the IT space so that when we have an issue, it can be identified quickly and know what is being affected.




No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

Bookmark and Share

Leave a Reply