The SANMAN: December 2011

vSphere's Virtual CPUs - Avoiding the vCPU to pCPU ratio trap

Posted by Archie Hendryx on Friday, December 30, 2011

2011 was a year where despite the economic constraints everything Big was seemingly good; Big Data, Big Clouds, Big VMs etc. Caught in the industry’s lust for this excess, 2011 was also the year I lost count of how many overprovisioned resources to ‘Big’ Production VMs I witnessed. More often than not this was a typical reaction from System Admins trying to alleviate their fears of potential performance problems to important VMs. It was the year where I began to hear justifications such as “yes we are overprovisioning our production VMs..but apart from the cost savings, overallocating our available underlying resources to a VM isn’t a bad thing, in fact it allows it to be scalable”. Despite this 2011 was also the year where I lost count of the amount of times I had to point out that sometimes overprovisioning a VM does lead to performance problems - specifically when dealing with Virtual CPUs.

VMware refers to CPU as pCPU and vCPU. pCPU or ‘physical’ CPU in its simplest terms refers to a physical CPU core i.e. a physical hardware execution context (HEC) if hyper-threading is unavailable or disabled. If hyperthreading has been enabled then a pCPU would consitute a logical CPU. This is because hyperthreading enables a single processor core to act like two processors i.e. logical processors. So for example, if an ESX 8-core server has hyper-threading enabled it would have 16 threads that appear as 16 logical processors and that would constitute 16 pCPUs.

As for a virtual CPU (vCPU) this refers to a virtual machine’s virtual processor and can be thought of in the same vein as the CPU in a traditional physical server. vCPUs run on pCPUs and by default, virtual machines are allocated one vCPU each. However, VMware have an add-on software module named Virtual SMP (symmetric multi-processing) that allows virtual machines to have access to more than one CPU and hence be allocated more than one vCPU. The great advantage of this is that virtualized multi-threaded applications can now be deployed on multi vCPU VMs to support their numerous processes. So instead of being constrained to a single vCPU, SMP enables an application to use multiple processors to execute multiple tasks concurrently, consequently increasing throughput. So with such a feature and all the excitement of being ‘Big’ it was easily assumed by many that taking advantage of such a feature by provisioning additional vCPUs could only ever be beneficial – but if only it was that simple.

The typical examples I faced entailed performance problems that were either being blamed on the Storage or the SAN and not CPU constraints especially as overall CPU utilization for the ESX server that hosted the VMs would be reported as low. Using Virtual Instruments’ VirtualWisdom I was able to quickly conclude that the problem was not at all related to the SAN or Storage but the hosts themselves. By being able to historically trend and correlate the vCenter, SAN and Storage metrics of the problematic VMs on a single dashboard it was apparent that the high number of vCPUs to each VM was the cause. This was indicated by a high reading of what is termed the 'CPU Ready' metric.

To elaborate, CPU Ready is a metric that measures the amount of time a VM is ready to run against the pCPU i.e. how long a vCPU has to wait for an available core when it has work to perform. So while it’s possible that CPU utilization may not be reported as high, if the CPU Ready metric is high then your performance problem is most likely related to CPU. In the instances that I saw, this was caused by customers assigning four vCPUs and in some cases eight to each Virtual Machine. So why was this happening?

VirtualWisdom Dashboard indicating high CPU Ready

Well firstly the hardware and its physical CPU resource is still shared. Coupled with this the ESX Server itself also requires CPU to process storage requests and network traffic etc. Then add the situation that sadly most organizations still suffer from the ‘silo syndrome’ and hence there still isn’t a clear dialogue between the System Admin and the Application owner. The consequence being that while multiple vCPUs are great for workloads that support parallelization but this is not the case for applications that don’t have built in multi-threaded structures. So while a VM with 4 vCPUs will require the ESX server to wait for 4 pCPUs to become available, on a particularly busy ESX server with other VMs this could take significantly longer than if the VM in question only had a single vCPU.

To explain this further let’s take an example of a four pCPU host that has four VMs, three with 1 vCPU and one with 4 vCPUs. At best only the three single vCPU VMs can be scheduled concurrently. In such an instance the 4 vCPU VM would have to wait for all four pCPUs to be idle. In this example the excess vCPUs actually impose scheduling constraints and consequently degrade the VM’s overall performance, typically indicated by low CPU utilization but a high CPU Ready figure. With the ESX server scheduling and prioritising workloads according to what it deems most efficient to run, the consequence is that smaller VMs will tend to run on the pCPUs more frequently than the larger overprovisioned ones. So in this instance overprovisioning was in fact proving to be detrimental to performance as opposed to beneficial. Now in more recent versions of vSphere the scheduling of different vCPUs and de-scheduling of idle vCPUs is not as contentious as it used to be. Despite this, the VMKernel still has to manage every vCPU, a complete waste if the VM’s application doesn’t use them!

To ensure your vCPU to pCPU ratio is at its optimal level and that you reap the benefits of this great feature there are some straightforward considerations to make. Firstly there needs to be dialogue between the silos to fully understand the application’s workload prior to VM resource allocation. In the case of applications where the workload may not be known, it’s key to not overprovision virtual CPUs but rather start with a single vCPU and scale out as and when is necessary. Having a monitoring platform that can historically trend the performance and workloads of such VMs is also highly beneficial in determining such factors. As mentioned earlier CPU Ready is a key metric to consider as well as CPU utilization. Correlating this with Memory and Network statistics, as well as SAN I/O and Disk I/O metrics enables you to proactively avoid any bottlenecks and correctly size your VMs and hence avoid overprovisioning. This can also be extended in considering how many VMs you allocate to an ESX Server and in ensuring that its physical CPU resources are sufficient to meet the needs of your VMs. As businesses’ key applications become virtualized it’s an imperative that whether they are old legacy single threaded workloads or new multi threaded workloads the correct vCPU to pCPU ratio is allocated. In this instance size isn’t always everything it’s what you do with your CPU that counts.

Exchange Completion Time - SAN Storage Performance Redefined

Posted by Archie Hendryx on Wednesday, December 21, 2011

Roll back several years and certain vendors had you believe that Fibre Channel was dead and that the future would be iSCSI. A few years later and certain vendors were then declaring that Fibre Channel was dead again and that the future was FCoE. So while this blog is not a iSCSI vs FC or FC vs FCoE comparison list (there’s plenty of good ones out there and both iSCSI or FCoE each have immense merit), the point being made here is that Fibre Channel unlike Elvis really is alive and well. Moreover Fibre Channel still remains the protocol of choice for most Mission Critical Applications despite the FUD that surrounds its cost, manageability and future existence. Most Storage folk who run Enterprise class infrastructures are advocates of Fibre Channel not only because of its high performance connectivity infrastructure but also due to its reliability, security and scalability. Incredibly this is all with the majority of Fibre Channel implementations being vastly under utilized, poorly managed (due to lack of visibility) and running at a far from optimized state due to the constant day to day operations of most SAN Storage administrators. Indeed if Storage folk were empowered with a metric that could enable them to gain a better insight and understanding of their SAN Storage’s performance and utilization the so called impending death of Fibre Channel may have to take an even further rain check. Well that metric does exist; cue what is termed the “Exchange Completion Time.”

It’s now common for me to visit customer environments that run Fibre Channel SANs yet have various factions that complain they are suffering performance issues due to lack of bandwidth or throughput, whether that's server, VM, Network or Storage teams. In every single instance FC utilization has actually been incredibly low with peaks of 10% at the most and that's with 4GB/s environments not 8GB/s! At worst there may be an extremely busy backup server that singlehandedly causes bottlenecks and creates the impression that the whole infrastructure is saturated but even these occasions are often rare. What seems to be the cause of this misconception is the lack of clarity between what is deemed throughput and what is an actual cause of bottlenecks and performance slow downs i.e. I/O latency.

Sadly (and I am the first to admit that I was also once duped), Storage folk have been hoodwinked into accepting metrics that just aren’t sufficient to meet their requirements. Much like the folklore and fables of Santa Claus that are told to children during Christmas, storage administrators, architects and engineers have also been spun a yarn that MB/s and IOPS are somehow an accurate determination of performance and design considerations. In a world where application owners, server and VM admins are busily speaking the language of response times, Storage folk are engrossed in a foreign vocabulary that revolves around RAID levels, IOPS and MB/s and then numerous calculations to try and correlate the two languages together. But what if an application owner requested Storage with a 10ms response time that the Storage Administrator could then allocate with a guarantee of that performance? That would entail the Storage engineer not just looking at a one dimensional view from the back end of the Storage Array but one that incorporated the comprehensive transaction time i.e. from the Server to the Switch port to the LUN. That would mean considering the Exchange Completion Time.

To elaborate, using MB/s as a measurement of performance is almost akin to how people used to count cars as a measurement of road traffic. Harking back to my days as a student and before all of the high tech cameras and satellites that now monitor road traffic, I was ‘lucky’ enough to have a job of counting the amount of cars that went through Trafalgar Square at lunchtime. It was an easy job, I'd see five cars and I'd click five times but this was hardly accurate as when there was a traffic jam and all of the lanes were occupied I was still clicking five cars. Here also lies the problem with relying on MB/s as a measurement of performance. As with the counting car situation a more accurate way would have been to instead watch each single car and measure it's time from its origin to its destination. In the same vein, to truly measure performance in a SAN Storage infrastructure you need to measure how long a transaction takes from being initiated by the host, received by the storage and acknowledged back by the host in real-time as opposed to averages. This is what is termed the Exchange Completion Time.

While many storage arrays have tools that provide information on IOPS and MB/s to get a better picture of a SAN Storage environment and it’s underlying latency it's also key to consider the amount of Frames per second. In Fibre Channel a Frame is comparable to a word, a Sequence a sentence and an Exchange the conversation. A Standard FC Frame has a Data Payload of 2112 bytes i.e. a 2K payload. So for example an application that has an 8K I/O will require 4 FC Frames to carry that data portion. In this instance this would equate to 1 IOP being 4 Frames and subsequently 100 IOPS of the same size equating to 400 Frames. Hence to get a true picture of utilization looking at IOPS alone is not sufficient because there exists a magnitude of difference between particular applications and their I/O size with some ranging from 2K to even 256K. With backup applications the I/O sizes can be even larger. Hence it's a mistake to not take into consideration the amount of Frames/sec when trying to measure SAN performance or if trying to identify whether data is being passed efficiently. For example even if you are witnessing a high throughput in MB/s you may be missing the fact that there is a minimum payload of data and the Exchange (conversation) is failing to complete. This is often the case when there’s a slow draining device, flapping SFP etc. in the FC SAN network where instead of data frames causing the traffic you have a number of management frames dealing with issues such as logins and logouts, loss of sync or some other optic degradation or physical layer issue. Imagine the scenario, a Storage Administrator is measuring the performance of his infrastructure or troubleshooting a performance issue and is seeing lots of traffic via MB/s – unaware that many of the environment’s transactions are actually being cancelled across the Fabric!

This lack of visibility into transactions has also led to many storage architects being reluctant to aggressively use lower tiers of storage as poor I/O performance is often attributed to the storage arrays when often bottlenecks in the storage infrastructure are actually the root cause. Measuring performance via Exchange Completion Times enables measurement and monitoring of storage I/O performance, hence ensuring that applications can be correlated and assigned to their most cost- effective storage tier without sacrificing SLAs. With many Storage vendors adopting automated tiering within their arrays some would feel this challenge has now been met. The reality of automated tiering though is that LUNs or sub-LUNs are only dynamically relocated to different tiers based on the frequency of data access i.e. frequently accessed is more valuable so should reside on a higher tier and infrequently accessed data should be moved to lower tiers. So while using historical array performance and capacity data may seem a sufficient way to tier, it’s still too simplistic and lacks the insight for more optimized tiering decisions. Such an approach may have been sufficient to determine optimum data placement in the days of DAS when the I/O performance bottleneck was disk transfer rate but in the world of SANs and shared storage to look just at external transfer rates between SSD, Fibre Channel or SATA drives is a detached and inaccurate way to measure the effect of SAN performance on an application’s response time. For example congestion/problems in the SAN can result in severely degraded response times or cancelled transactions that fail to be acknowledged by the back end of the array. Furthermore incorrect HBA queue depths, the difference between sequential and random requests, link and physical layer errors all have an impact on response times and in turn application latency. By incorporating the Exchange Completion Time metric i.e. measuring I/O conversations across the SAN infrastructure into your tiering considerations, tiering can now accurately be based on comprehensive real time performance as opposed to device specific views.

Monitoring your FC SAN Storage environment in a comprehensive manner that incorporates the SAN fabric and provides metrics such as the Exchange Completion Time rapidly changes FC SAN troubleshooting from a reactive to proactive exercise. It also enables Server, Storage and Application administrators to have a common language of ‘response times’ thus eliminating any potential silos. With the knowledge of application I/O latency down to the millisecond, FC SAN Storage administrators can quickly be transformed from the initial point of blame to the initial point of resolution, while also ensuring optimum performance and availability of your mission critical data.