Get email updates every time we post!
One of my friends works for a service provider, she was telling me the story about a contract that they had been awarded which ended up in a series of discussions about requirements for transformation in order to deliver service improvements. The issue was that some key servers which performed various functions failed resulting in separate issues and incidents being logged, however due to the proximity of the incidents, the feeling was that the IT kept failing, that service improvements were needed. The role of my friend was to provide a report and have discussions about the steps to take to resolve the situation. The interview explains it all my questions are in italic.
Explain the issues that you had
What was the key factor in each incident and summarize how it was resolved
In summary, legacy infrastructure which lacked investment and best practice configuration, indeed the SMS server was barely in support, the application server was all RAID5 with no thought to backups or what happens if. The file server was simply one of those everything to everyone servers, so cross application, cross purpose, the accounts team stored their accounts on it, whilst development ran their applications on desktops but output their logs to it. Lack of architecture, too much growth and evolution of the business but not the platform.
What steps did you recommend going forward?
For the SMS server, two things, an SMS upgrade to a newer version on a new server, nothing expensive, just something in support, a DL385 with local storage which met our requirements. Windows 2003 and a new version of SMS, all configured in line with standards, the new server proved to be a lot easier to support, more scalable and took up a lot less space.
For the application, we asked for clarification about data requirements, how much of the data actually needed to be online and what could be archived once of to tape (and restored on demand). The feedback was we only needed 6 months data which was 180GB of the 700GB. We therefore performed a full archive of the drive, tested the restore a few times and deleted all but the 6 months of data required. With the data size down, it could now be added to the backups as there was sufficient capacity in the backup infrastructure for incremental and weekly backups.
The file server was more tricky due to the nature of the setup at that time. We had numerous file servers which lacked structure and had grown up from additional requirements, as one server got full, we moved that role/function to another one, creating mixed environments with mixed requirements.
The immediate steps taken were data role consolidation. The creation of three data environments:
The three file servers we had were then rebuilt one at a time using backup/restore and migration techniques to move data around, with a separate partition left for the application which ’caused’ the issue so to speak. By doing this we could also illustrate storage requirements and utilization more effectively, and take one part of the IT down without the other, so upgrade ud1 only affected user data, not users ability to access their applications or run Windows.
What was the most significant challenge?
The main step from my standpoint in terms of delivery was to remind everyone that there were three separate component failures.
This is important as I wanted to highlight that IT wasn’t falling apart, that we were not making mistakes, that there were in some respects to be expected but unfortunate failures within legacy components of the infrastructure. That some immediate small spend investment was needed to prevent this from happening again immediately as well as some data re-organization to meet our changing business requirements, and bring the IT in line with our dependence on it. SMS when we first used it was an IT tool for testing, as the business requirements and user base evolved, the importance and role of SMS had changed, but the underlying technology had not, just like the management of the data on the application server.
Moving past the IT failure discussion and reminding the stakeholders that the IT was old, out of support and that we had got our return on investment, there was also some hesitiation about spending money, so we had to do some cost due diligence. By replacing that Compaq 1600 and two array shelves and replacing it with a DL385 we reclaim so much power and data center rack space, as well as some performance analysis and reliablity/being in support discussion.
The data migration and consolidation was also a bit tricky simply because of the need to obtain sign off, to ensure everyone was aware of what we were doing, obtaining the downtime window, and managing everyone’s requirements in terms of drive letters, shares and access. But with the new process in place, with the solution working quite well, we have managed to improve the understanding for the senior management team to know what is on what server, to re-brand everything within the IT space so that when we have an issue, it can be identified quickly and know what is being affected.
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.