Chris called me up, he remains a happy man walking around sunny (but cold) Canary Wharf and is now leading the Windows operations team for a multinational, we continue to keep in touch discussing what’s going on and he emailed me a single line question.
We’re building medium’s do you bother with production checks anymore?
A bit of background. Chris’ company has implemented a self servicing imaging process, so to request a machine, you use an automated deployment tool which builds a new machine using their gold image. They can now therefore deploy a virtual machine in about 35 minutes, Chris would select the image (there are only a few Windows 2003/2008) and then the number of cpu/memory and a virtual machine is then built from template, it gets a dhcp address and then when the ‘build’ is complete he allocates a new IP address and updates DNS.
This has resulted in his phrasing around “medium’s”, in that he is increasingly building virtual machines with a similar configuration which the design term refer to as medium that is the midrange spec virtual machine, he is therefore as he jokes deploying “medium’s”. With that in mind, on the basis that it’s all templated and automatic (except for the change of IP address which is apparently being scripted into PowerShell) are the olden days production checks necessary.
What checks do you do I ask?
- Check page file size
- Compare original request helpdesk call with what’s been built (the specs are the same as what’s been requested)
- Verify the security patches are up to date
- That the inventory has been updated and it’s registered for backups and the performance monitoring tools
- Well shouldn’t most of that be scripted?
Yes he says, we’re working on it, what’s your take?
This is a topic which caused me quite a bit of thought. To me it all depends on how you see your server, if it’s your central point of failure, your system which you are not going to swap out, rebuild or replace unless it is completely unrecoverable then yes you need production and verification checks.
For Chris though, who is deploying “medium’s” in a disposable server world (where if it breaks, the app team typically have a resilient solution, and more importantly they can deploy a machine in 35 minutes), I wonder if they are relevant. We script as much as possible and isolate all the tasks that can be integrated into the image and post build activities and with the operational risk standpoint make a judgement call balancing delivery (which is key), with operational risk and acceptance. In a disposable world, if we get the image, the rapid provisioning and automation steps right we shouldn’t need any manual steps, we just need the existing monitoring and security management or policy management tools to verify deviation from operating standards so that they can be addressed.
Put another way we shouldn’t be innovating a server deployment process without considering the go live one, it’s deferred success to say the least to be delivering a server in 35 minutes to the end user and then asking them to wait days or weeks whilst manual go live acceptance checks are performed before it’s ‘supported’.