Neeraj Bhatia's Blog

March 1, 2015

Top Challenges in the Life of an IT Capacity Manager

Filed under: Capacity Management — neerajbhatia @ 16:42
Tags: ,

Let me first confess. I don’t have double-digit years of experience in IT Capacity Management. But I would say, during last 9 years I have defined the process multiple times from scratch. I have seen capacity management process getting traction. Multiple times I was part of the journey where the process got matured from chaotic to efficient. I have also witnessed cases where management couldn’t justify the investment in the process and it got a natural death. Over the years I have talked to and interacted with numerous people working in the same or related domains and this helped me to understand how others are doing. I have also actively involved in the interview process for many candidates and this gave me an insight about how matured is the process in their organizations and what challenges they are facing.

All this makes me eligible to write this post where I have highlighted (or tried to) key challenges an IT Capacity professional faces. To make it readable, interesting and short, I have restricted myself to top 5, but it doesn’t mean no other challenges exist. My criteria for picking these was one which constitutes a foundation of the process and within the technical boundaries. Of course, political and cultural aspects, organization structure and all such things also play an important role in the success.

I would request the readers to make comments at the bottom of the post, should they have an experience, ideas or stories to share. If you feel there is an aspect which should have a space in this list, please comment and post review I will tweak this list.

So let’s start with the list of top 5 challenges.

1) Importance of Capacity Management: You may be surprised to find this at the top of my list, but believe me it’s a fact. Among a very small number of organizations where there is a dedicated team/resources responsible for Capacity management, very few actually understand the value addition an effective Capacity management can bring to the table. What I observed is that the reason of their existence is to support audit/regulatory requirements, or just a tick in the box. Majority of the times, system admins or incident management teams deal with capacity management responsibilities but in almost all case in a purely reactive way. Many times Capacity planners spend their office time in providing data to be consumed by IT service management community, Infrastructure people or business teams.

But this is not what Capacity Management is all about. ITIL defines goal of Capacity Management as

“The goal of the Capacity Management process is to ensure that cost-justifiable IT in all areas of IT always exists and is matched to the current and future agreed needs of the business, in a timely manner”.

I often say, it is also about “Doing more with less” and this we can achieve only when you use the techniques up to its full potential, only when you work in a proactive fashion. Firefighting is a part of capacity management professionals process but the true value can be only realized when work in a proactive way.

Many times I have seen established and working process to take natural death because management start realizing that there is no value add and capacity related incidents can be resolved pretty quickly by incident management teams. There is a part problem on Capacity Management professionals as well who at multiple occasions fail to highlight the benefits. I understand cost has a big role to play in the overall equation and to that, you can highlight cost in either of two categories: cost saving OR cost avoidance. Other than cost, effective process can reduce panic buying which saves cost, escalations and service disruptions. In today’s digital world there are great expectations from IT for super fast time to service (TTS), unlimited capacity at demand (perception from the Cloud), it is more important to have effective and efficient capacity management practices.

No story can be better told than the one with the data and evidence. Imagine the achievements are communicated to the management in the form of reduction in incidents, improved utilization of IT infrastructure, cost avoidance with the help of code optimization, configuration changes, release of unused capacity and reusing it elsewhere to reduce pressure on additional capacity buying.

So the crux is, it is a challenge to justify the need for Capacity management process but supported by real meat in terms of numbers, metrics it would become easier.

2) Bad start (under/over provisioning): This particular one is an interesting challenge as it gives you difficulties in inheritance. You were not involved when services were designed, architecture was defined and provisioning had happened. Mostly people out of fear that future capacity upgrade requests which not be entertained in view of economic challenges or will go through rigorous process which most of the cases are too bureaucratic and in order to play safe w.r.t potential performance or capacity issues, take defensive approach and ask for capacity for next couple of years on day 1 itself by exaggerating the capacity requirements. If capacity management process was not consulted or had a say during the provisioning, it becomes extremely difficult to reduce the capacity at a later stage. Remember in majority of the cases, the real value of Capacity management can be realized by improving the capacity utilization, identify cold areas in your IT estate and release the spare capacity and use it somewhere else where it is actually required.

This is also related to other side of the story. There was not enough thought given to the planning which resulted in capacity issues during early life of the service. This might be due to an inadequate capacity or support life of underlying IT Infrastructure. Due to the complexity it brings during the running services, it becomes extremely difficult to upgrade the capacity (scalability issues) or migrate the service.

This I would suggest can be avoided by proper control mechanisms in the provisioning process. For existing issues, biting the bullet is the only option and fix the issues once and for all, if the efforts and cost neutralize the cost of service disruptions.

3) CMDB: Configuration management database is the heart of effective IT processes and Capacity management is not an exception. What makes it more important for Capacity management process is the fact that all the IT configuration items being managed by the process should be there in CMDB along with latest configuration. Failing of which can potentially impact service stability. I have experienced this issue up to an extent that it took 6-8 months to associate all the core IT assets with the service in the CMDB.

I would say this is more of a governance issue. If there are proper processes defined and tight controls exist around them, it could be achieved pretty fast. For past sins, a remediation program can be roll-out which should have management buying. Technology can also be handy as they can automatically discover the components and do most of the leg work for you.

4) Data Issues (resource utilization, workload metrics): This I guess is low-hanging fruit in this list because it is technology dependent. Here my point is with respect to the very basic ingredient of Capacity management – component utilization data and workload metrics; how much is used, who is using it and what is left.

The reason this got qualified in this list is because in a large IT setup there is a standard way of doing anything. If there is a standard to use a particular monitoring toolset, you will be at the mercy of the tool to be rolled out to each IT component in the estate. This is quite opposite to small IT shops where you can exploit native commands/tools to extract utilization data and put it into a repository, after all even tools do the same but in a sophisticated way. Extracting workload metrics like average number of transactions per second of a particular type and their resource consumption could be tricky to get in certain situations if no basic monitoring or measurement practices have been followed during code development and this is where tools can make your life easier.

5) Business demand (forward view of workload data): This is about getting workload estimation for the future. This is a most important ingredient for capacity planning where we estimate future capacity requirements. What I observed is that, business either don’t share it or share the wrong estimations. For services whose utilization or growth trend is more or less monotonous or static, this should be fine. But web based services where growth is exponential or seasonal and depends so much on marketing/sales drives, it becomes vital important to estimate demand with a reasonable degree of accuracy.

Some of the issues can be resolved by statistically analyzing the historical workload data and forecast the numbers purely based on past trends. By sharing back these numbers with the business or other relevant teams will provide them additional data source and this way the variance between the projections and actuals can be reduced. It goes without saying that, it all could happen when business and IT work in close collaboration and not two isolated departments.

7 Comments »

  1. Neeraj – This is very nicely worded description of challenges faced by IT Capacity Planner. I guess you have covered most of the challenges in above Top 5 list.

    Only input I’d like to provide here is for Point 4 – Data Issues (resource utilization, workload metrics)

    You can incorporate somewhere in this point, the significance of CMIS (as defined by ITIL standard). It’s very important for Capacity Management team to establish a mechanism to store their performance or workload data in a common repository and that too, for some historical period. This period may vary from 12 months to 18 months depending on Organisation.

    CMIS indeed form a baseline for Capacity Planner to play around with historical performance/business data, prepare analysis and also project the forward usage of the IT Infrastructure.

    Appreciate your article. Keep it up!

    Cheers
    Mahendra

    Comment by Mahendra Hukkeri — March 2, 2015 @ 15:43 | Reply

  2. Good article

    Comment by Deven Puri — March 3, 2015 @ 16:18 | Reply

  3. Very nice Neeraj!!

    Comment by Kapil Goyal — May 13, 2015 @ 06:21 | Reply

  4. Hi Neeraj — This is a nice piece of work to which I would like to add some commentary. First, I will give my background. I have over 30 years experience in IT operational areas for a variety of organizations. Most recently I worked for 19 years for what has become the world’s largest provider of financial transactions. I led the capacity management efforts for this organization for over 10 years. 2 years ago I joined TeamQuest. At TeamQuest, we provide solutions in Capacity Planning and Performance Management. We are also very focused on helping our customers move up the maturity curve.

    1) Re: the Importance of Capacity Management. I would add a few points. First, the performance data (which you cover in point 4) needed for capacity planning is also needed for performance management. Performance management is often involved with the daily ITIL disciplines of problem management and incident management. Get involved with performance management. There are a lot of natural synergies between capacity and performance. Second, find an early win in a capacity issue. Go above and beyond in identifying and resolving the issue. You will gain an ally (the owner of the problem you resolve). Build on this success. Third, be right in your predictions. Prove you are right. We have seen this work with many of our customers.

    2) Get involved in your organization’s performance testing process. If they don’t have one, get one started. If there is an established process, find out what the focus is. Often performance testing centers around completing x transactions in y time without regard to how many resources are used. If you can get them to also measure resources used, you can have a huge impact on the business. Consider an application which rolls out a new feature and plans to offer it for free to existing customers and it is not expected to attract any new customers. It is a customer satisfaction issue they are trying to resolve. The new feature consumes so many resources that you will need to upgrade the infrastructure 3 months after the new feature is rolled out rather than the 2 years you are currently planning on. Armed with this information, the business can make better decisions such as charging for the new feature or leaving the feature in development until they can figure out a way to reduce the resource consumption associated.

    3) CMDB is critically important to Capacity Management as you point out. Grouping IT resources into their associated business services is a fundamental building block of a capacity management process for the business as opposed to one which just services IT. However, this is also a 2-way street. Often the metrics (for example hardware configurations) collected as part of the capacity management process are more accurate than other sources. Feed the metrics from the capacity management practice back into the CMDB,

    4) What I would add here as pointed out by Mahendra is that there are a couple facets to consider. Data collection and data storage. Make sure you collect enough data (what interval are you collecting at, are you collecting process level data as but 2 examples) and store it for the appropriate length of time. Very detailed data (say 1 minute or even 1-second samples) are needed for a few days as is process level data. After that, you can aggregate the data to larger sample sizes, say 15 minutes for a couple months and then to an hour for 12 to 18 months.

    5) Find out what business metrics are available and map them to resource utilization. You may find a metric correlation you didn’t know existed.

    Comment by John Miecielica — July 3, 2015 @ 03:33 | Reply

    • Thanks John for evolving it further.

      Comment by neerajbhatia — July 4, 2015 @ 11:27 | Reply

  5. Good insight and well written.

    Comment by Anees — July 4, 2015 @ 11:18 | Reply

  6. Very nice article, I totally agree with all points especially in point 1 where you have mentioned ” Many times Capacity planners spend their office time in providing data to be consumed by IT service management community, Infrastructure people or business teams”. This is so true and to avoid that one must act and prove other teams how Capacity management can help in stabilizing the IT infrastructure and smooth business operations.

    Comment by Manoj — February 8, 2016 @ 12:17 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.