Blade Watch is your hub for blade, grid and high performance financial computing news blogged by Martin MacLeod, Blade Consultant. Put this in your feed reader and have a scan every now and then to track what’s cooking around the blade world.
Blade Watch
inicio mail me! sindicaci;ón

Archive for January, 2007

How to robocopy

So one of the things that comes with network attached storage is copying data, it might be because you’re allocating a new D drive which has double the space, or migrating data from one filer to the other.

The key things to watch out for are:

  • Amount of data to copy in mb/gb
  • Number of files
  • Type of files
  • Fragmentation
  • How you’re copying them.

So say you’ve got 100gb to copy the time taken to copy this data will depend on a few factors, how busy your network is, what speed it is, but more importantly, how fragmented the disk is, how many files there are and in how many folders and sub folders, crucially their size.

Windows is rather shocking at handling large numbers of small 1k files, particularly if they are all in sub folders d:\data\folder1\folder1.1\folder1.1.1/folder1.111 etc, also
So let’s do the basics, when presented with a new volume, format the disk taking a close look at the cluster size when formatting it, if you know the volume is going to be mainly small 1k files, then set the cluster size to 1k, if it’s going to be sql, you might want to select a larger cluster size - Microsoft covers all this anyway.

Log on to either the server you’re copying from or the server you’re copying too: (if using terminal services remember timeout settings might apply)
Map a drive to the server you need to copy to:

  • net use p: \\servername\sharename

Make sure all the paths are correct then run your robocopy:

  • Robocopy p:\data\copyfiles d:\data\copyfiles /s /sec

Run a test at a convenient time and see how long it takes, 100gb can take anything from a few hours to a few days dependent on the number, size of files, if you’ve got compression turned on and if the drive is fragmented.

Any comments? Get in touch…

Vista and the environment

http://green.itweek.co.uk/2007/01/vista_could_cau.html

Microsoft’s long anticipated launch of Windows Vista was somewhat overshadowed today by accusations the new operating system could result in “massive” environmental damage.

The accusations were led by Tony Roberts, chief executive of UK charity Computer Aid International, which provides refurbished second hand computers to education and social projects in the UK and developing world economies. He argued that the stiff hardware requirements needed to run Vista will force many users to upgrade or replace existing PCs that are still perfectly effective.

Excellent article discussing how Vista has made a large number of computers ‘below’ the Microsoft recommended system requirements.

What remains a challenge though is the used computer market, as long as people feel they can get some economic value out of their computer they will. Maybe if the charities took ownership, published the issues, they might get people rather than sell that 6 year old pc on ebay for £35 to give it away free.

What’s key is that the pcs are re-used or recycled environmentally, you can see online the problems that developing economies are having with our computer waste.

Recycling the network?

http://green.itweek.co.uk/2007/01/used_network_ki.html

But can second hand bargains also work when businesses are looking to buy enterprise class network kit?

Well, there are plenty of firms out there who do successfully buy second hand network kit from enterprises or leasing companies. Some enterprises refresh their kit every three to five years and lease companies pick up the kit after the lease runs out so both can supply a steady stream of second hand equipment to the market.

Some critics would question the reliability of such aged machines, but remember, this kit has typically been sat in a nice, environmentally controlled datacentre, and if it had the right configuration programmed in when it was installed, then the only hands that might have touched it are those of a network technician just plugging in the requisite cables. After that, it’s remote management all the way.

An interesting article and very relevant.  Re-using hardware can make sense, particularly for the non-production environments, the only thing you’ve got to watch is the risk/cost ratio, the only thing to watch out for is the availability of parts/replacements and possibly their power utilization.

Grid market set to expand

http://home.businesswire.com/portal/site/google/index.jsp?ndmViewId=news_view&newsId=20070131005557&newsLang=en

NEW YORK–(BUSINESS WIRE)–DataSynapse, Inc., the global provider of application virtualization software, today announced another solid year of growth in 2006 with revenues and client wins up 60 percent from 2005, as global demand for virtualization technology continues to increase.

According to Gartner Inc.’s 2006 Top 10 Strategic Technologies list, virtualization was a technology most likely to gain widespread adoption this past year. DataSynapse’s market leadership confirmed this industry forecast for growth in emerging virtualization technologies. Expanding its global footprint, DataSynapse’s customer base now spans 13 countries, where leading companies adopted DataSynapse solutions for their application virtualization strategies.

“Performance at the application layer is crucial to IT operations and leading enterprises have found that DataSynapse’s products – GridServer® and FabricServer™ – address the principal challenges at this layer,” said Anne MacFarland, director of data strategies and information solutions at The Clipper Group. “Its tremendous customer success in 2006 illustrated that whether an enterprise was challenged by scale, optimization, heterogeneity, enrichment or complexity, DataSynapse provided tools to address the problems.”

Grid technology remains an area of interest for many organizations, when implemented effectively, the benefits financially and commercially can be significant, this can be seen in particular in reference to risk applications in finance, or for applications in the oil industry.

Managing datacenter risk

One of the challenges is co-location is getting sign off from the IT Security team, getting the sign off that the risk has been off properly and therefore that the organization isn’t risking it’s reputation, it’s data unnecessarily.

So I got a tour last weekend very kindly of a new datacenter that one of the large corporates has just signed up to. 

It’s very nice you know, very new, very efficient. I was shown the security: (I think they were hoping I’d sign up)

  1. Hand scanners for the front entrance
  2. Security cameras on the roof
  3. Eye scanners available
  4. Security cameras at every door
  5. Very thick walls able to withstand something shocking
  6. 24/7 security
  7. Individal datacenter keys for each business
  8. Lights out until you enter the datacenter
  9. 24/7 monitoring of the datacenter with escalation to your ops team
  10. Dual power feeds
  11. Dual backup power - batteries and generator
  12. Eight different communications suppliers coming into the site
  13. Fire proof doors
  14. Fire proof operations center to monitor your datacenter

As I walked around this facility, two thoughts came to mind:

  • Fantastic security just what most banks would want
  • How much is hosting without items 1,3,4,5,14?

The large corporates need to handover datacenter management on a daily basis to someone else, it’s not their core business, it’s expensive and time consuming.  However, at the same point, I can image our IT security team walking around notepad in hand writing down how secure, fantastic it was, look hand scanners, brilliant, but over the 3 years, I’ve committed to this datacenter, how much is that costing me?

There is a cost associated to security, by all means the Financial Services Authority mandates that our IT infrastructure is secure, however, the typical kind of setup in an office building is a couple of video cameras and a door pass.  There needs to be a risk vs. cost assessment, for grid engines, raw calculation boxes the data tends to be near real time, of very little value after the event.

IT security need to be risk averse in a logical cost based approach, assess the risk, the exposure, get the risk signed off,next.

Grid engine - export logs

 Another script to export the event logs, copy the grid log files, names the files and puts them in a folder named server name:

In this case the script is running against a server list, servers.txt, it calls this script against the servers.txt file which is declared as %1%.  It uses the tool psloglist from sysinternals/microsoft.com

  • tasklist -s %1%
  • psloglist \\%1% application -s >logfileapp.csv
  • psloglist \\%1% system -s >logfilesys.csv
  • ren logfileapp.csv %1%_app.csv
  • ren logfilesys.csv %1%_sys.csv
  • md logs
  • md logs\work
  • md logs\temp
  • copy %1%_app.csv logs
  • copy %1%_sys.csv logs
  • erase %1%_app.csv
  • erase %1%_sys.csv
  • xcopy /E \\%1%\c$\”Program Files”\DataSynapse\Engine\work logs\work
  • xcopy /E \\%1%\c$\windows\temp\06S* logs\temp
  • erase logfileapp.csv
  • ren logs %1%
  • rd logs

Reset grid engine, and move to new gridserver

Script to reset a grid engine - please note your configuration will differ so things might need adapted: 

In this case the script is running against a server list, servers.txt, it calls this script against the servers.txt file which is declared as %1%.

It does the following:

  • Export tasklist to filename, then rename it the name of the server.
  • Stop datasynapse service,
  • Kill processes,
  • Delete configuration files/log files and copy over over new files,
  • Start datasynapse service to register to the other grid.

Excuse us, our server’s having an emotional outburst

So Chris called me up, he’d been called in at short notice to discuss a problem an application team were having. It had become clear that the current server used for this overnight batch is not up to the job and is running low on capacity in disk, memory and CPU.

Anyway, Chris made the suggestion:

“Well the server is about 6 years old, upgrading it wouldn’t make sense, we could either virtualize it or order a replacement physical server.”

He was surprised when the reply was:

“There’s no budget. We’re not using vmware for this system.”

This is where Chris asked me:

“What does he mean no budget?”

“But we’re a bank? I explained, we’re only talking about £3500 with all the bits, surely there’s money somewhere? What’s the issue?

What you find in these situations is the following.  The application team/business line have not allocated funds (or choose not to make available) funds to buy a new server. Therefore there is no budget at this point in time.

So the meeting continues, and the application team ask what IT is going to do about it.

Chris replies:

“Well we can certainly see if we can improve the system configuration, but that’s about it, we might have some memory about, but not 4GB”. 

The manager then mentioned that unfortunately they can’t buy a replacement server for them.

What we’re being stopped by here is in theory “the budget”. In practice however, it is short term thinking, looking after MY own costs, MY budget as requested.  The rather trivial issue of willingness rather than ability to pay is the delay here.

In situations like these you might say IT should pick up the cost, keep everyone happy, but they are behaving exactly as an outsourced provider would. “You want an additional server, sure, its £3500, now how would you like to pay?”

Yes costs matter, but equally does delivery to the end user.

Quick tips to managing a helpdesk queue

The different corporates all use different helpdesk logging systems, whether it’s Remedy, Assyst, Axio or Eden. How you manage your queue from a service delivery viewpoint can be important to the way you manage your team, and their perception to your management and the end users.

How do you manage success and failure within the system?
What measurements are in place?

There are typical ones that are put in place to measure time to complete and quality of workload:

Time taken to complete
Number of times call is chased by user
Number of times call is passed to another team
Number of calls re-opened (not completed as requested by the user)

Typically one person will ‘own’incidents, can we not have someone ‘owning’ requests, often the requests will stack up, when they aren’t that difficult to resolve, particularly if they are the standard kind of thing, create a share on the filer, allocate more storage to a share.

When managing a call queue, there are two key things to think about:

Managing failure
Managing success

Managing failure

Nobody likes ‘failure’ but dealing with it in the right way is crucial, find out why the call was ‘breached’, was it that the call was logged incorrectly, was it waiting for sign off or unable to get hold of the user?
What steps can you take within your world to minimize failure, can you improve communication between the team, you never get the best results from your team through fear, holding people accountable in public. Not in the long term anyway.
Managing success

So the team consistently exceeded the SLA’s appointed to it, how to we reward it? Yes it’s there job, but doing little things can change a working environment for the better at very little cost:

Large box of biscuits?
Round of coffees from Starbucks?
24 pack of diet coke?

Even if it became sort of expected, £3.95 (the cost of 24 diet coke cans) is a very affordable way of maintaining your SLA, maintaining a pleasant working environment and on a simple level showing that the workload is appreciated.

Can I log a call to…

In the olden days, the typical IT call logging system comprised of requests and incidents, the ‘change process’ came along for changes to production systems, then some companies brought in the problem request.

Defining what we mean by this on a simple level is important.  If a call is logged in correctly from the user viewpoint it might get closed or dealt with incorrectly, from the IT perspective, I might get fined for missing my SLA, my targets for resolving issues.

  • An incident is a call that requires immediate attention, it is an in time call which stipulates the issue affecting a user and therefore requests it is fixed.
  • The resolution time might be minutes for a business critical incident.
  • A request is a call that requests an action is performed, it might be important but the resolution time might be days or weeks.
    • It might be something like a new share on the filer, or a new laptop for oncall, or even a server install.
  • A change is a request to change something, for example change the memory configuration in an application or upgrade a servers operating system.
    • Changes usually only apply to production systems to monitor production and prevent unnecessary impact to the business users.
  • A problem tackles ongoing ‘problems’ issues that might get temporarily resolved but need fixed in the long term.
    • For example the trading server needs rebooted every day at 8pm due to a memory leak. It’s got a temporary fix, to reboot it, but on a permanent basis we need support to resolve this with the application team if necessary.

    How do you manage these different requests?
    What is an acceptable time delay in resolving an issue?

    They are equally important, but at the same time you want the support teams to focus on incidents, on keeping everything working first, then dealing with requests.

    Next entries »