Let me first confess. I don’t have double-digit years of experience in IT Capacity Management. But I would say, during last 9 years I have defined the process multiple times from scratch. I have seen capacity management process getting traction. Multiple times I was part of the journey where the process got matured from chaotic to efficient. I have also witnessed cases where management couldn’t justify the investment in the process and it got a natural death. Over the years I have talked to and interacted with numerous people working in the same or related domains and this helped me to understand how others are doing. I have also actively involved in the interview process for many candidates and this gave me an insight about how matured is the process in their organizations and what challenges they are facing.
All this makes me eligible to write this post where I have highlighted (or tried to) key challenges an IT Capacity professional faces. To make it readable, interesting and short, I have restricted myself to top 5, but it doesn’t mean no other challenges exist. My criteria for picking these was one which constitutes a foundation of the process and within the technical boundaries. Of course, political and cultural aspects, organization structure and all such things also play an important role in the success.
I would request the readers to make comments at the bottom of the post, should they have an experience, ideas or stories to share. If you feel there is an aspect which should have a space in this list, please comment and post review I will tweak this list.
So let’s start with the list of top 5 challenges.
1) Importance of Capacity Management: You may be surprised to find this at the top of my list, but believe me it’s a fact. Among a very small number of organizations where there is a dedicated team/resources responsible for Capacity management, very few actually understand the value addition an effective Capacity management can bring to the table. What I observed is that the reason of their existence is to support audit/regulatory requirements, or just a tick in the box. Majority of the times, system admins or incident management teams deal with capacity management responsibilities but in almost all case in a purely reactive way. Many times Capacity planners spend their office time in providing data to be consumed by IT service management community, Infrastructure people or business teams.
But this is not what Capacity Management is all about. ITIL defines goal of Capacity Management as
“The goal of the Capacity Management process is to ensure that cost-justifiable IT in all areas of IT always exists and is matched to the current and future agreed needs of the business, in a timely manner”.
I often say, it is also about “Doing more with less” and this we can achieve only when you use the techniques up to its full potential, only when you work in a proactive fashion. Firefighting is a part of capacity management professionals process but the true value can be only realized when work in a proactive way.
Many times I have seen established and working process to take natural death because management start realizing that there is no value add and capacity related incidents can be resolved pretty quickly by incident management teams. There is a part problem on Capacity Management professionals as well who at multiple occasions fail to highlight the benefits. I understand cost has a big role to play in the overall equation and to that, you can highlight cost in either of two categories: cost saving OR cost avoidance. Other than cost, effective process can reduce panic buying which saves cost, escalations and service disruptions. In today’s digital world there are great expectations from IT for super fast time to service (TTS), unlimited capacity at demand (perception from the Cloud), it is more important to have effective and efficient capacity management practices.
No story can be better told than the one with the data and evidence. Imagine the achievements are communicated to the management in the form of reduction in incidents, improved utilization of IT infrastructure, cost avoidance with the help of code optimization, configuration changes, release of unused capacity and reusing it elsewhere to reduce pressure on additional capacity buying.
So the crux is, it is a challenge to justify the need for Capacity management process but supported by real meat in terms of numbers, metrics it would become easier.
2) Bad start (under/over provisioning): This particular one is an interesting challenge as it gives you difficulties in inheritance. You were not involved when services were designed, architecture was defined and provisioning had happened. Mostly people out of fear that future capacity upgrade requests which not be entertained in view of economic challenges or will go through rigorous process which most of the cases are too bureaucratic and in order to play safe w.r.t potential performance or capacity issues, take defensive approach and ask for capacity for next couple of years on day 1 itself by exaggerating the capacity requirements. If capacity management process was not consulted or had a say during the provisioning, it becomes extremely difficult to reduce the capacity at a later stage. Remember in majority of the cases, the real value of Capacity management can be realized by improving the capacity utilization, identify cold areas in your IT estate and release the spare capacity and use it somewhere else where it is actually required.
This is also related to other side of the story. There was not enough thought given to the planning which resulted in capacity issues during early life of the service. This might be due to an inadequate capacity or support life of underlying IT Infrastructure. Due to the complexity it brings during the running services, it becomes extremely difficult to upgrade the capacity (scalability issues) or migrate the service.
This I would suggest can be avoided by proper control mechanisms in the provisioning process. For existing issues, biting the bullet is the only option and fix the issues once and for all, if the efforts and cost neutralize the cost of service disruptions.
3) CMDB: Configuration management database is the heart of effective IT processes and Capacity management is not an exception. What makes it more important for Capacity management process is the fact that all the IT configuration items being managed by the process should be there in CMDB along with latest configuration. Failing of which can potentially impact service stability. I have experienced this issue up to an extent that it took 6-8 months to associate all the core IT assets with the service in the CMDB.
I would say this is more of a governance issue. If there are proper processes defined and tight controls exist around them, it could be achieved pretty fast. For past sins, a remediation program can be roll-out which should have management buying. Technology can also be handy as they can automatically discover the components and do most of the leg work for you.
4) Data Issues (resource utilization, workload metrics): This I guess is low-hanging fruit in this list because it is technology dependent. Here my point is with respect to the very basic ingredient of Capacity management – component utilization data and workload metrics; how much is used, who is using it and what is left.
The reason this got qualified in this list is because in a large IT setup there is a standard way of doing anything. If there is a standard to use a particular monitoring toolset, you will be at the mercy of the tool to be rolled out to each IT component in the estate. This is quite opposite to small IT shops where you can exploit native commands/tools to extract utilization data and put it into a repository, after all even tools do the same but in a sophisticated way. Extracting workload metrics like average number of transactions per second of a particular type and their resource consumption could be tricky to get in certain situations if no basic monitoring or measurement practices have been followed during code development and this is where tools can make your life easier.
5) Business demand (forward view of workload data): This is about getting workload estimation for the future. This is a most important ingredient for capacity planning where we estimate future capacity requirements. What I observed is that, business either don’t share it or share the wrong estimations. For services whose utilization or growth trend is more or less monotonous or static, this should be fine. But web based services where growth is exponential or seasonal and depends so much on marketing/sales drives, it becomes vital important to estimate demand with a reasonable degree of accuracy.
Some of the issues can be resolved by statistically analyzing the historical workload data and forecast the numbers purely based on past trends. By sharing back these numbers with the business or other relevant teams will provide them additional data source and this way the variance between the projections and actuals can be reduced. It goes without saying that, it all could happen when business and IT work in close collaboration and not two isolated departments.