vSphere's Virtual CPUs - Avoiding the vCPU to pCPU ratio trap


2011 was a year where despite the economic constraints everything Big was seemingly good; Big Data, Big Clouds, Big VMs etc. Caught in the industry’s lust for this excess, 2011 was also the year I lost count of how many overprovisioned resources to ‘Big’ Production VMs I witnessed. More often than not this was a typical reaction from System Admins trying to alleviate their fears of potential performance problems to important VMs. It was the year where I began to hear justifications such as “yes we are overprovisioning our production VMs..but apart from the cost savings, overallocating our available underlying resources to a VM isn’t a bad thing, in fact it allows it to be scalable”. Despite this 2011 was also the year where I lost count of the amount of times I had to point out that sometimes overprovisioning a VM does lead to performance problems - specifically when dealing with Virtual CPUs.
VMware refers to CPU as pCPU and vCPU. pCPU or ‘physical’ CPU in its simplest terms refers to a physical CPU core i.e. a physical hardware execution context (HEC) if hyper-threading is unavailable or disabled. If hyperthreading has been enabled then a pCPU would consitute a logical CPU. This is because hyperthreading enables a single processor core to act like two processors i.e. logical processors. So for example, if an ESX 8-core server has hyper-threading enabled it would have 16 threads that appear as 16 logical processors and that would constitute 16 pCPUs.
As for a virtual CPU (vCPU) this refers to a virtual machine’s virtual processor and can be thought of in the same vein as the CPU in a traditional physical server. vCPUs run on pCPUs and by default, virtual machines are allocated one vCPU each. However, VMware have an add-on software module named Virtual SMP (symmetric multi-processing) that allows virtual machines to have access to more than one CPU and hence be allocated more than one vCPU. The great advantage of this is that virtualized multi-threaded applications can now be deployed on multi vCPU VMs to support their numerous processes. So instead of being constrained to a single vCPU, SMP enables an application to use multiple processors to execute multiple tasks concurrently, consequently increasing throughput. So with such a feature and all the excitement of being ‘Big’ it was easily assumed by many that taking advantage of such a feature by provisioning additional vCPUs could only ever be beneficial – but if only it was that simple.

The typical examples I faced entailed performance problems that were either being blamed on the Storage or the SAN and not CPU constraints especially as overall CPU utilization for the ESX server that hosted the VMs would be reported as low. Using Virtual Instruments’ VirtualWisdom I was able to quickly conclude that the problem was not at all related to the SAN or Storage but the hosts themselves. By being able to historically trend and correlate the vCenter, SAN and Storage metrics of the problematic VMs on a single dashboard it was apparent that the high number of vCPUs to each VM was the cause. This was indicated by a high reading of what is termed the 'CPU Ready' metric.
To elaborate, CPU Ready is a metric that measures the amount of time a VM is ready to run against the pCPU i.e. how long a vCPU has to wait for an available core when it has work to perform. So while it’s possible that CPU utilization may not be reported as high, if the CPU Ready metric is high then your performance problem is most likely related to CPU. In the instances that I saw, this was caused by customers assigning four vCPUs and in some cases eight to each Virtual Machine. So why was this happening?
VirtualWisdom Dashboard indicating high CPU Ready

Well firstly the hardware and its physical CPU resource is still shared. Coupled with this the ESX Server itself also requires CPU to process storage requests and network traffic etc. Then add the situation that sadly most organizations still suffer from the ‘silo syndrome’ and hence there still isn’t a clear dialogue between the System Admin and the Application owner. The consequence being that while multiple vCPUs are great for workloads that support parallelization but this is not the case for applications that don’t have built in multi-threaded structures. So while a VM with 4 vCPUs will require the ESX server to wait for 4 pCPUs to become available, on a particularly busy ESX server with other VMs this could take significantly longer than if the VM in question only had a single vCPU.

To explain this further let’s take an example of a four pCPU host that has four VMs, three with 1 vCPU and one with 4 vCPUs. At best only the three single vCPU VMs can be scheduled concurrently. In such an instance the 4 vCPU VM would have to wait for all four pCPUs to be idle. In this example the excess vCPUs actually impose scheduling constraints and consequently degrade the VM’s overall performance, typically indicated by low CPU utilization but a high CPU Ready figure. With the ESX server scheduling and prioritising workloads according to what it deems most efficient to run, the consequence is that smaller VMs will tend to run on the pCPUs more frequently than the larger overprovisioned ones. So in this instance overprovisioning was in fact proving to be detrimental to performance as opposed to beneficial. Now in more recent versions of vSphere the scheduling of different vCPUs and de-scheduling of idle vCPUs is not as contentious as it used to be. Despite this, the VMKernel still has to manage every vCPU, a complete waste if the VM’s application doesn’t use them!

To ensure your vCPU to pCPU ratio is at its optimal level and that you reap the benefits of this great feature there are some straightforward considerations to make. Firstly there needs to be dialogue between the silos to fully understand the application’s workload prior to VM resource allocation. In the case of applications where the workload may not be known, it’s key to not overprovision virtual CPUs but rather start with a single vCPU and scale out as and when is necessary. Having a monitoring platform that can historically trend the performance and workloads of such VMs is also highly beneficial in determining such factors. As mentioned earlier CPU Ready is a key metric to consider as well as CPU utilization. Correlating this with Memory and Network statistics, as well as SAN I/O and Disk I/O metrics enables you to proactively avoid any bottlenecks and correctly size your VMs and hence avoid overprovisioning. This can also be extended in considering how many VMs you allocate to an ESX Server and in ensuring that its physical CPU resources are sufficient to meet the needs of your VMs.  As businesses’ key applications become virtualized it’s an imperative that whether they are old legacy single threaded workloads or new multi threaded workloads the correct vCPU to pCPU ratio is allocated. In this instance size isn’t always everything it’s what you do with your CPU that counts.

Exchange Completion Time - SAN Storage Performance Redefined

Roll back several years and certain vendors had you believe that Fibre Channel was dead and that the future would be iSCSI. A few years later and certain vendors were then declaring that Fibre Channel was dead again and that the future was FCoE. So while this blog is not a iSCSI vs FC or FC vs FCoE comparison list (there’s plenty of good ones out there and both iSCSI or FCoE each have immense merit), the point being made here is that Fibre Channel unlike Elvis really is alive and well. Moreover Fibre Channel still remains the protocol of choice for most Mission Critical Applications despite the FUD that surrounds its cost, manageability and future existence. Most Storage folk who run Enterprise class infrastructures are advocates of Fibre Channel not only because of its high performance connectivity infrastructure but also due to its reliability, security and scalability. Incredibly this is all with the majority of Fibre Channel implementations being vastly under utilized, poorly managed (due to lack of visibility) and running at a far from optimized state due to the constant day to day operations of most SAN Storage administrators. Indeed if Storage folk were empowered with a metric that could enable them to gain a better insight and understanding of their SAN Storage’s performance and utilization the so called impending death of Fibre Channel may have to take an even further rain check. Well that metric does exist; cue what is termed the “Exchange Completion Time.”

It’s now common for me to visit customer environments that run Fibre Channel SANs yet have various factions that complain they are suffering performance issues due to lack of bandwidth or throughput, whether that's server, VM, Network or Storage teams. In every single instance FC utilization has actually been incredibly low with peaks of 10% at the most and that's with 4GB/s environments not 8GB/s! At worst there may be an extremely busy backup server that singlehandedly causes bottlenecks and creates the impression that the whole infrastructure is saturated but even these occasions are often rare. What seems to be the cause of this misconception is the lack of clarity between what is deemed throughput and what is an actual cause of bottlenecks and performance slow downs i.e. I/O latency.

Sadly (and I am the first to admit that I was also once duped), Storage folk have been hoodwinked into accepting metrics that just aren’t sufficient to meet their requirements. Much like the folklore and fables of Santa Claus that are told to children during Christmas, storage administrators, architects and engineers have also been spun a yarn that MB/s and IOPS are somehow an accurate determination of performance and design considerations. In a world where application owners, server and VM admins are busily speaking the language of response times, Storage folk are engrossed in a foreign vocabulary that revolves around RAID levels, IOPS and MB/s and then numerous calculations to try and correlate the two languages together. But what if an application owner requested Storage with a 10ms response time that the Storage Administrator could then allocate with a guarantee of that performance? That would entail the Storage engineer not just looking at a one dimensional view from the back end of the Storage Array but one that incorporated the comprehensive transaction time i.e. from the Server to the Switch port to the LUN. That would mean considering the Exchange Completion Time.

To elaborate, using MB/s as a measurement of performance is almost akin to how people used to count cars as a measurement of road traffic. Harking back to my days as a student and before all of the high tech cameras and satellites that now monitor road traffic, I was ‘lucky’ enough to have a job of counting the amount of cars that went through Trafalgar Square at lunchtime. It was an easy job, I'd see five cars and I'd click five times but this was hardly accurate as when there was a traffic jam and all of the lanes were occupied I was still clicking five cars. Here also lies the problem with relying on MB/s as a measurement of performance. As with the counting car situation a more accurate way would have been to instead watch each single car and measure it's time from its origin to its destination. In the same vein, to truly measure performance in a SAN Storage infrastructure you need to measure how long a transaction takes from being initiated by the host, received by the storage and acknowledged back by the host in real-time as opposed to averages. This is what is termed the Exchange Completion Time.

While many storage arrays have tools that provide information on IOPS and MB/s to get a better picture of a SAN Storage environment and it’s underlying latency it's also key to consider the amount of Frames per second. In Fibre Channel a Frame is comparable to a word, a Sequence a sentence and an Exchange the conversation. A Standard FC Frame has a Data Payload of 2112 bytes i.e. a 2K payload. So for example an application that has an 8K I/O will require 4 FC Frames to carry that data portion. In this instance this would equate to 1 IOP being 4 Frames and subsequently 100 IOPS of the same size equating to 400 Frames. Hence to get a true picture of utilization looking at IOPS alone is not sufficient because there exists a magnitude of difference between particular applications and their I/O size with some ranging from 2K to even 256K. With backup applications the I/O sizes can be even larger. Hence it's a mistake to not take into consideration the amount of Frames/sec when trying to measure SAN performance or if trying to identify whether data is being passed efficiently. For example even if you are witnessing a high throughput in MB/s you may be missing the fact that there is a minimum payload of data and the Exchange (conversation) is failing to complete. This is often the case when there’s a slow draining device, flapping SFP etc. in the FC SAN network where instead of data frames causing the traffic you have a number of management frames dealing with issues such as logins and logouts, loss of sync or some other optic degradation or physical layer issue. Imagine the scenario, a Storage Administrator is measuring the performance of his infrastructure or troubleshooting a performance issue and is seeing lots of traffic via MB/s – unaware that many of the environment’s transactions are actually being cancelled across the Fabric!

This lack of visibility into transactions has also led to many storage architects being reluctant to aggressively use lower tiers of storage as poor I/O performance is often attributed to the storage arrays when often bottlenecks in the storage infrastructure are actually the root cause. Measuring performance via Exchange Completion Times enables measurement and monitoring of storage I/O performance, hence ensuring that applications can be correlated and assigned to their most cost- effective storage tier without sacrificing SLAs. With many Storage vendors adopting automated tiering within their arrays some would feel this challenge has now been met. The reality of automated tiering though is that LUNs or sub-LUNs are only dynamically relocated to different tiers based on the frequency of data access i.e. frequently accessed is more valuable so should reside on a higher tier and infrequently accessed data should be moved to lower tiers. So while using historical array performance and capacity data may seem a sufficient way to tier, it’s still too simplistic and lacks the insight for more optimized tiering decisions. Such an approach may have been sufficient to determine optimum data placement in the days of DAS when the I/O performance bottleneck was disk transfer rate but in the world of SANs and shared storage to look just at external transfer rates between SSD, Fibre Channel or SATA drives is a detached and inaccurate way to measure the effect of SAN performance on an application’s response time. For example congestion/problems in the SAN can result in severely degraded response times or cancelled transactions that fail to be acknowledged by the back end of the array. Furthermore incorrect HBA queue depths, the difference between sequential and random requests, link and physical layer errors all have an impact on response times and in turn application latency. By incorporating the Exchange Completion Time metric i.e. measuring I/O conversations across the SAN infrastructure into your tiering considerations, tiering can now accurately be based on comprehensive real time performance as opposed to device specific views.

Monitoring your FC SAN Storage environment in a comprehensive manner that incorporates the SAN fabric and provides metrics such as the Exchange Completion Time rapidly changes FC SAN troubleshooting from a reactive to proactive exercise. It also enables Server, Storage and Application administrators to have a common language of ‘response times’ thus eliminating any potential silos. With the knowledge of application I/O latency down to the millisecond, FC SAN Storage administrators can quickly be transformed from the initial point of blame to the initial point of resolution, while also ensuring optimum performance and availability of your mission critical data.

Demystifying the Cloud: IaaS, PaaS, SaaS, MaaS, CaaS & XaaS


Generally IT folk, whether in Storage, Virtualization, Change Management or Project Management love the use of acronyms and synonyms to express key concepts amongst each other. What other industry would allow an individual to spurt a line such as "Have SOX seen the BCP and CAB approval for our VDC's DR SAN and will this then be added to the CMDB by CoB today?" without immediately flinching or bringing in a logopaedics specialist for help. More often than not, IT folk have also used these synonyms and acronyms as smokescreens to prevent outsiders from realizing "well this IT stuff is actually quite easy to understand and quite straightforward".


Hence no surprise that when the seemingly simple concept of Cloud Computing took off, so did the emergence of an abundance of acronyms and synonyms reaping a new breed of I.T. professionals who were the only ones that could correctly understand them, i.e., ‘The Cloud Specialist'.  Despite this, the beauty of the Cloud (or as most people are starting to realise the synonym for the Internet) is that it not only encompasses the IT industry and their business demands but also the average end user who's only experience with IT is their iPhone and its App Store. So while EMC's extensive airport advertising may have initially confused a lot of tourists into thinking that the ‘Journey to the Cloud' was a slogan for an up and coming budget airline, the general public are certainly now becoming aware of ‘The Cloud'. End users are now bombarded with Clouds from Microsoft claiming that Windows 7 is your ‘Path to the Cloud', Pizza Restaurants offering free access to ‘the Cloud' and Apple iPhone owners having iCloud enforced upon them (no comment on the security issues of your email contacts and personal photos being uploaded to Apple's database).  So while the idea of Public, Private and Hybrid Clouds become more familiar and understood even amongst the masses, it's with surprise that I often find people within the IT industry who are still unaware or unsure of Cloud Service acronyms such as IaaS, PaaS, SaaS, Maas, Caas or Xaas.


To understand why there are so many acronyms with the Cloud, it is important to appreciate that the Cloud has a number of services which each of these classify. The first of these, IaaS (Infrastructure as a Service) is when the consumer does not deal with the infrastructure, instead the responsibility of the equipment is outsourced to the Service Provider. The Service Provider not only owns the equipment but will also be responsible for its running and maintenance, where the consumer will be charged on a ‘pay as you use' basis. IaaS is often offered as a horizontally integrated service that includes not only the server and storage but also the connectivity domains. For example while the consumer may deploy and run their own applications and operating systems, the Iaas provider would typically provide the replication, backup and archiving (Storage), the powerful computing requirements (Server) or the network load balancing and firewalls (Connectivity domains).
PaaS provides the capability for consumers to have applications deployed without the burden and cost of buying and managing the hardware and software.  In other words these are either consumer created or acquired web applications or services that are entirely accessible from the Internet. Usually created with programming languages and tools supported by the service provider these web applications enable the consumer to have control over the deployed applications and in some circumstances the application-hosting environment but without the complexity of the infrastructure i.e. the servers, operating systems or storage. Offering a quick time to market and services that can be provisioned as an integrated solution over the web, PaaS facilitates immediate business requirements such as application design, development and testing at a fraction of the normal cost.
Software as a service (SaaS) is the ability for a consumer to use on demand software that is provided by the service provider via a thin client device e.g. a web browser over the Internet. With SaaS the consumer has not only no management or control of the infrastructure such as the storage, servers, network, or operating systems, but also no control over the application's capabilities. Culled from what were originally referred to as (ASPs) Application Service Providers, SaaS is a quick and efficient delivery model for key business applications such as customer relationship management (CRM), enterprise resource planning (ERP), HR and payroll.


Monitoring as a Service (MaaS) is at present still an emerging piece of the Cloud jigsaw but an integral one for the future. In the same way that businesses realised that their infrastructure and key applications required monitoring tools that would ensure the proactive elimination of any downtime risks, Monitoring as a Service provides the option to offload a large majority of those costs by having it run as a service as opposed to a fully invested in house tool. So for example by logging onto a thin client or central web based dashboard which is hosted by the service provider, the consumer can monitor the status of their key applications regardless of location. Add the advantages of an easy set up and purchasing process and MaaS could be a key pay as you use model for the de-risking of applications that are initially being migrated to the Cloud.


Communication as a Service (CaaS), enables the consumer to utilize Enterprise level VoIP, VPNs, PBX and Unified Communications without the costly investment of purchasing, hosting and managing the infrastructure. With the service provider responsible for the management and running of these services also, the other advantage the consumer has is that they needn't require their own trained personnel, bringing significant OPEX as well as CAPEX costs.


Finally XaaS or ‘anything as a service' is the delivery of IT as a Service through hybrid Cloud computing and is a reference to either one or a combination of Software as a Service (SaaS), Infrastructure as a Service (IaaS), Platform as a Service (PaaS) Communications as a service (CaaS) or monitoring as a service (Maas). XaaS is quickly emerging as a term that is being readily recognized as services that were previously separated on either private or public Clouds are becoming transparent and integrated.


So as the term ‘The Cloud' finally breaks into the minds of the masses and takes meaning, the next phase will be to take the numerous services that are offered by the Cloud, mature them and enable consumers to fully understand their benefits. From Enterprise to SMB to end users, Cloud Services will inevitably bring immense benefits and cost savings. All that is now required is for consumers to know what all those unnecessarily complicated acronyms mean!

vSphere 5 & VAAI Demand Radical Changes to Storage Arrays


The launch of vSphere 5 and its new storage related features will set the precedent for a complete rethink on how a new datacenter’s storage infrastructure should be designed and deployed. vSphere 5’s launch is not only an unabashed attempt at cornering every single aspect of the server market but is also a result for the growing need for methodical scalability that merges the I.T. silos and consequently combines the information of applications, servers, SANs and storage into a single comprehensive stack. In an almost ironic shift back towards the original principles of mainframe, VMware’s importance has already influenced vendors such as EMC with their VMAX and HDS with their VSP in adopting a scale out as opposed to scale up approach to Storage. With this direction being at the forefront of most storage vendors’ roadmaps for the foreseeable future it subsequently dictates a criterion far beyond the storage capacity requirement I.T. departments are traditionally used to. With such considerations the end of the traditional Storage Array could be sooner than we think as a transformation takes place on the makeup of future datacenters.

With the emergence of the Private Cloud, Storage Systems are already no longer being considered or accepted by a majority of end users as ‘data banks’ but rather ‘on demand global pools’ of processors and volumes. End users have finally matured into accepting dynamic tiering, thin / dynamic provisioning, wide striping, SSDs etc. as standard for their storage arrays as they demand optimum performance. What will dictate the next phase is that the enterprise storage architecture will no longer be accepted as a model for unused capacity but rather one that facilitates further consolidation and on demand scalability. Manoeuvres towards this direction have already taken place as some vendors have taken to adopting Sub-Lun tiering (in which LUNs are now merely containers) and 2.5 inch SAS disks. The next key shift towards this datacenter transformation is the new integration of the server and storage stack as brought about with vSphere 5’s initiatives, such as VAAI, VASA, SDRS, Storage vMotion etc.

Leading this revolution is Paul Maritz, who upon becoming the head of VMware, set his vision clear: rebrand VMware from a virtualization platform to the essential element of the Cloud. In doing that VMware needed to showcase their value beyond the Server Admin’s milieu. Introduced with vSphere 4.1, vSphere API Array integration (VAAI) was the first major shift towards the comprehensive Cloud vision that incorporated the Storage stack. Now with further enhancements in vSphere 5, VAAI compatibility will be an essential feature for any Storage Array.

VAAI has several features named primitives aimed at driving this change forward. There are several primitives associated with the vSphere API Array integration (VAAI) with the first being the Full Copy primitive. Essentially with VMware when copying of data occurs whether via VM cloning or Storage vMotion, the ESX server becomes responsible for first having to read every single block that has to be copied and then writing it back to the new location. Of course this adds a fundamental load and effect on the ESX server. So for example when deploying a 100GB VM from a template the entire 100 GB will have to be read by the vSphere host and then subsequently written requiring a total of 200 GB of I/O.


Instead of this host intensive process the Full Copy primitive operates by issuing a single SCSI command called the XCOPY. The XCOPY is sent for a contiguous set of blocks which in essence is a command to the storage to do the copying of each block from one logical block address to another. So hence a significant load is taken off the ESX server and instead put upon the Storage array that can more than easily deal with the copy operations resulting in very little I/O between the host and array.

The Second Primitive, Block Zeroing is also related to the virtual machine cloning process which in essence is merely a file copy process. When a VM is copied from one datastore to another it would copy all of the files that make up that VM. So for a 100 GB virtual disk file with only 5 GB of data, there would be blocks that are full as well as empty ones with free space i.e. where data is yet to be written. Any cloning process would entail not just IOPS for the data but also numerous repetitive SCSI commands to the array for each of the empty blocks that make up the virtual disk file.

Block zeroing instead removes the need for having to send these redundant SCSI commands from the host to the array. By simply informing the storage array of which blocks are zeros the host offloads the work to the array without having to send commands to zero out every block within the virtual disk.

The third VAAI primitive is named Hardware Lock Assist. The original implementation of VMFS used SCSI reservations to prevent data corruption when several servers shared a LUN. What typically occurs without VAAI in such situations though, are what are termed SCSI reservation conflicts. Being a normal part of the SCSI protocol, SCSI reservations occur to give exclusive access to a LUN so that competing devices do not cause data corruption. This especially used to occur during VMotion transfers and metadata updates of earlier versions of ESX and caused serious performance problems and poor response times as devices had to wait for access to the same LUN. Although these reservations are typically less than 1ms, many of them in rapid succession can cause a performance plateau with VMs on that datastore.


Hardware-Assisted Locking in essence is the elimination of this LUN-level locking based on SCSI reservations. How it works is that initially the ESX server will make a first read on the lock. If the lock is free, the server will then send a Compare and Swap command that is not only the lock data that the server wants to place into the lock but also the original free contents of the lock. The storage array will then read the lock again and compare the current data in the lock to what is in the Compare And Write command. If they are found to be the same, new data will be written into the lock. All of this is treated as a single atomic operation and is applied at the block level (not the LUN level) making parallel VMFS updates possible.

With the advent of vSphere 5, enhancements with VAAI have also come about for infrastructures deploying array-based thin-provisioning. Thin-provisioned LUNs have brought many benefits but also challenges such as the monitoring of space utilization and the reclamation of dead space.
To counter this the VAAI Thin Provisioning primitive has an Out-of-Space Condition which monitors the space usage on thin-provisioned LUNs. Via advanced warning users can prevent themselves from catastrophically running out of physical space.

Another aspect of this primitive is what is termed Dead Space Reclamation (originally coined as SCSI UNMAP). This provides the ability to reclaim blocks of thin-provisioned LUNs whenever virtual disks are deleted or migrated to different datastores.

Previously when a deletion of a snapshot, a VM or a Storage vMotion took place the VMFS would delete a file and sometimes leading to the filesystem reallocating pointers instead of issuing a SCSI WRITE ZERO command to zero out the blocks. This would lead to blocks previously used by the virtual machine being still reported as in use, resulting in the array providing incorrect information with regards to storage consumption.

With vSphere 5 this has subsequently changed as now instead of a SCSI WRITE ZERO command or reallocation of VMFS pointers, a SCSI UNMAP command is used. This in fact enables the array to release the specified Logical Block Address back to the pool, leading to better reporting and monitoring of disk space consumption as well as the reclamation of unused blocks.

Back in 2009, Paul Maritz boldly proclaimed the dawning of ‘a software mainframe’. With vSphere 5 and VAAI, part of that strategy is aiming at transforming the role of the Storage Array from a monolithic box of capacity to an off-loadable virtual resource of processors for ESX servers to perform better. Features such as SRM 5.0, SDRS, VASA (details on these to follow soon in upcoming blogs) are aimed at further enhancing this push. At VMworld 2011, Maritz proudly stated that a VM Is born every 6 seconds and that there are more than 20 million VMs in the world with more than 5.5 vmotions per second. With such serious figures, whether you’re deploying VMware or not, it’s a statement that would be foolish to ignore when budgeting and negotiating for your next Storage Array.

Silos will prevent Tier 1 Apps reaching the Cloud

On a recent excursion to a tech event I had the pleasure of meeting a well-known ‘VM Guru’, (who shall remain nameless). Having read some of this individual’s material I was excited and intrigued to know his thoughts on how he was tackling the Storage challenges related to VMware especially with Fibre Channel SANs.

“Storage, that’s nothing to do with me, I’m a VirtGuy”, he proudly announced.

To which I retorted, “yes but if there are physical layer issues in your SAN fabric, or poorly configured Storage etc. it will affect the performance of your Virtual Machines and their applications, hence surely you also need some visibility and understanding beyond your Server’s HBAs?”

Seemingly annoyed with the question, he answered, “Why? I have SAN architects and a Storage team for that, it’s not my problem. I told you I’m a VirtGuy, I have my tools so I can check esxtop, vCenter etc…” as he then veered off into glorious delusions of grandeur of how he’d virtualized more servers than I’d had hot dinners. As fascinating as it was to hear him, it was at this point that my mind was side tracked into realizing that despite all the industry talk of ‘unified platforms’, ‘Apps, Servers & Storage as a Service’ i.e. the Cloud, the old challenge of bridging the gap between silos still had a long way to go.

Let’s face it Virtualization and the Cloud have brought unprecedented benefits but they’ve also brought challenges. One such challenge that is dangerously being
overlooked is that of the silos that exist within most IT infrastructures. Indeed it’s the silos that have led to the new phenomenon that is coined as, ‘The Virtual Stall’. The Virtual Stall was never an issue several years ago as Virtualization was happily adopted by Application owners to consolidate many of their ‘Crapplications’ that meant little or nothing to them and certainly didn’t carry the burden of a SLA. Storage teams were none the wiser as VM admins requested large capacities of storage for their VMFS and despite the odd performance problem no one was too bothered as these VMs rarely hosted Tier 1 Apps. With the advent of VDI, large VM backups and critical applications such as Exchange and SQL being virtualized, the ordeal of maintaining performance took root, resulting in the inevitable ‘blame game’ between silos. Fast forward to today and despite all the talk of Private Clouds, the fear factor of potential performance degradation resulting in the virtualization of mission critical applications has led to the ‘Virtual Stall’.

Business and Management have been convinced of the benefits of consolidation, reduction in data foot print, power/coolin
g etc. that they initially saw with the virtualization of their low tier applications. This has led them to want more of the same for higher end applications leading to what many organiz
ations are terming a ‘VMware First’ policy. Under pressure from them the silo of the application owners still don’t have a true understanding of server virtualization and hence are reluctant for their Tier 1 apps to be migrated from their physical platforms. At best they may accept two mission critical VMs on a physical server. Under pressure to prove the Application owners wrong and maintain the performance of virtualized applications, the silo of the VMware administrators will often over-provision from their pool of Memory, CPU and storage resources. Furthermore the VM Admin silo also lack a real understanding of Storage and at best will think in terms of capacity for their VMFS stores, while Storage Admin will think in terms of IOPS. As this lack of understanding and communication between the silos exists and grows so too do the challenges of making the most of the benefits of server virtualization.

One of the key mistakes is that it’s often over looked that whether on a virtualized or non-virtualized platform, application performance is heavily affected by its underlying storage infrastructure. The complexity of correctly configuring storage in accordance to application demands can range from deciding the right RAID level, number of disks per LUN, array cache sizes to the correct queue depth and fan-in / fan-out ratio. These and other variables can drastically influence how I/O loads are handled and ultimately how applications respond. With virtualized environments the situation is no different, with Storage related problems often being the cause of most VMware infrastructure mis-configurations that inadvertently affect performance.

Even with the option of Raw Device Mapping, the alternative for VMware storage configuration, VMFS is often the most preferred due to its immediate advantages in terms of provisioning and zoning. In this method several Virtual machines are able to access the same LUN or a pool of LUNs. This becomes far more simplistic as opposed to a one to one mapping ratio that is required for each LUN for each Virtual Machine with the RDM option. Additionally this makes backups far easier as the VMFS for the given Virtual Machines need only be dealt with instead of numerous individual LUNs that are mapped to many Virtual Machines. VMFS volumes can be as big as 2TB and with the concatenation of additional partitions which are termed VMFS extents, this can then be as large as 64TB i.e. 32 extents. With a Storage Admin unaware of such distinctions within VMware, it’s easy to also be unaware of the best practices with extents, such as creating these on new physical LUNs to facilitate additional LUN queues or throughput congestion. Coupled with this, if extents are not assigned the same RAID and disk type you quickly fall into a quagmire of horrendous performance problems. In fact it can be pointed out that the majority of VMware performance problems are in fact initiated at the beginning of the provisioning process or even earlier at the design phase and are a result of the distance between the silos.

As mentioned already application owners will pressure VM administrators to overprovision Memory and CPU to avoid any potential application slowdowns, while the VM administrator will falsely think along the lines of capacity for their VMFS in terms of Storage. At best a VM Admin may request the RAID level and the type of Storage e.g. 15K RPM FC disks but it is here that the discrepancy arises for the Storage administrator. The Storage Admin, used to provisioning LUNs on the basis of application requirements, will instead not be thinking of capacity but rather in terms of IOPS and RAID levels. Eventually though as there is no one to one mapping and the requested LUN is to be merely added to a VMFS, the storage administrator, not wishing to be the bottleneck of the process, will proceed to add the requested LUN to the pool. Herein is also the source of a lot of eventual performance problems as overtly busy LUNs begin to affect all of their aligned virtual machines as well as those that share the same datastore. Moreover if the LUN is part of a very busy RAID group on the backend of the storage array, such saturated I/O will impact all of the related physical spindles and hence all of the LUNS they share. What needs to be appreciated is that the workload of individual applications presented to individual volumes will be significantly different to that of multiple applications being consolidated onto a single VMFS volume. The numerous I/Os of multiple applications alone even if sequential, will push the Storage array to deal with these numerous requests as random, thus requiring different RAID level, LUN layout, cache capacity etc. considerations than those for individual applications.

Once these problems exist there is a customary troubleshooting procedure that VM and Storage administrators often follow which take from the metrics found in vCenter, esxtop, vscsiStats, IOMeter, Solaris IOSTAT, PerfMON and the Array management tool. This somewhat laborious process usually includes measuring the effective bandwidth and resource consumption between the VM and storage, moving and using other paths between the VMs and storage and even reconfiguring cache and RAID levels. To have even got to this point days if not weeks would have been spent in checking for excessive LUN and RAID group demands, understanding the VMFS LUN layout on the backend of the storage’s physical spindles, investigating the array’s front end, cache and processor utilization as well as bottlenecks on the ESX host ports. Some may even go to the lengths of playing around with the Queue Depth settings, which without an accurate insight is at best a guessing game based on rule of thumb. Despite all of these measures there is still no guarantee that this will identify or eliminate the performance issues, leaving VMware to be erroneously blamed as the cause or that the application is ‘unfit’ to be virtualized. Ironically so many of these problems could have been proactively avoided had there been a better understanding and communication between the silos in the design and provision phase.

While it could be argued that Application, Server / VM and Storage teams all have their own expertise and should stick to what they know, in today’s unified Cloud-driven climate remaining in a bat cave of ignorance justified by the knowledge that you’re an expert in your own field is nothing short of disastrous. Application owners, VMware and Storage Admin have to sit and communicate with each other and destroy the first barrier erected by silos i.e. knowledge sharing. This does not require that a Storage Admin set up a DRS cluster or a VM Admin start provisioning LUNs but what it does mean is that as projects roll out a common understanding of the requirements and the challenges be understood. As the technology brings everything into one stack with vStorage APIs, VAAI and terminology such as orchestration that describe single management panes which allow you to provision VMs and their Storage with a few clicks, the need for the ‘experts’ of their field to sit and share their knowledge has never been greater. Unless the challenge of breaking the silos is addressed we could be seeing Kate Bush’s premonition of Cloudbursting sooner than we think.


The True Optimum Queue Depth for VMware / vSphere

An array’s Queue Depth in its most basic terms is the physical limit of exchanges that can be open on a storage port at any one time. The Queue Depth setting on the HBA will specify how many exchanges can be sent to a LUN at one time. Generally most VM Admins leave their Queue Depth settings at the manufacturer’s default with only the requirement to facilitate a small number of I/O intensive VMs/servers leading them to make an increase. The risk with changing or in fact not changing Queue Depths to their optimum can have severe detrimental effects on performance where any outstanding I/O queuing can cause bottlenecks. For example if Queue Depth settings are set too high the Storage ports will quickly become overrun or congested leading to poor application and VM performance or even worse data corruption or loss. Alternatively if Queue Depth settings are set too low, the Storage ports become underutilized thus leading to poor SAN efficiency. On the other hand should the Queue Depth be correctly optimized, performance of VMs and their corresponding LUNs can be vastly improved, hence the requirement for a methodology to accurately determine this is an imperative.

Generally VM Admins use esxtop to check for I/O Queue Depths and latency with the QUED column showing the queuing levels. With VirtualWisdom though, end users are now empowered with the only platform that can measure real-time aggregated queue depth regardless of storage vendor or device i.e. in a comprehensive manner that takes into consideration the whole process from Initiator to Target to LUN. VirtualWisdom’s unique ability to do this ensures accurately that storage ports are optimized for maximum application health, performance, and SAN efficiency.


The esxtop QUED column


So to begin with it is important to prevent the storage port from being over-run by considering both the number of servers that are connected to it as well as the number of LUNs it has available. By knowing the number of exchanges that are pending at any one time it is possible to manage the storage Queue Depths.

In order to properly manage the storage Queue Depths one must consider both the configuration settings at the host bus adapter (HBA) in a server and the physical limits on the storage arrays. It is important to determine what the Queue Depth limits are for each storage array. All of the HBAs that access a storage port must be configured with this limit in mind. Some HBA vendors allow setting HBA and LUN level Queue Depths, while some allow HBA level setting only.

The default value for the HBA can vary a great deal by manufacturer and version and are often set higher than what is optimal for most environments. If you set the queue depths too low on the HBA it could significantly impair the HBA’s performance and lead to under utilization of the capacity on the storage port (i.e. underutilizing storage resources). This occurs both because the network will be underutilized and the storage system will not be able to take advantage of its caching and serialization algorithms that greatly improve performance. Queue Depth settings on HBAs can also be used to throttle servers so that the most critical servers are allowed greater access to the necessary storage and network bandwidth.

To deal with this the initial step should be to baseline the Virtual environment to determine which servers already have their optimal settings and which ones are either set too high or too low. Using VirtualWisdom real time Queue Depth utilization can be reported for a given period. Such a report will show all of the initiators and the maximum queue depths that were recorded during the recording period. This table can be used as a method to compare the settings on the servers to the relative values of the applications that they support. The systems that are most critical should be set to higher Queue Depths than those that are less critical, however Queue Depth settings should still be within the vendor specified range. Unless Storage ports have been dedicated to a server, VirtualWisdom often shows that optimum Queue Depth settings should be between the ranges of 2-8, despite industry defaults tending to be between 32-256. To explain this further, VirtualWisdom can drill off a report that can show in descending order the Maximum Pending Exchanges and their corresponding initiators and server names. The Maximum Pending Exchanges are not only the maximum number of exchanges pending during the interval being recorded but also the exchanges that were opened in previous intervals that have not yet closed.

Report on Pending Exchanges i.e. HBA Queue Depth

So for example if a report such as this was produced for 100 ESX servers it’s important to consider whether your top initiators are hosting your highest priority applications and whether your initiators with low queue depth settings are hosting your lowest priority applications. Once the appropriate Queue Depth settings have been determined, an alarm can be created for any new HBAs that are added to the environment, especially any HBA that violates the assigned Queue Depth policy.

Once this is established the VirtualWisdom dashboard can be then be used to ensure that the combined Pending Exchanges from all of the HBAs are well balanced across the array and SAN fabric.

Making SAN Cheaper than NAS

Are you making the most of your FC SAN?

There is a common myth that FC SAN is expensive, difficult to manage and troubleshoot. Coupled with this there are heavily marketed agendas to move customers away from FC SAN to new and allegedly more cost effective solutions.
But what if there was a way to…

• Reduce the amount of physical adapters, FC cables, SAN ports and Storage ports while concurrently improving your application response time, availability and SLAs?

• Simplify server FC I/O provision, enabling a more agile, scalable and dynamic deployment model?

• Gain the insight that on average most FC SAN and Storage ports are greatly underutilized averaging 5-10%, with only a minority of them needing their full bandwidth and therefore enabling you to drive higher levels of resource utilization and performance out of an already deployed solution?

• Know your FC SAN Storage related problems before they occur enabling you to transform your environment from a reactive to proactive one?

• Reduce the time it takes for troubleshooting your FC SAN Storage environment from days to minutes?

• Reduce the OPEX and CAPEX of your existent FC SAN Storage infrastructure such that it’s deemed not only the most reliable and secure but also the most cost effective solution of your environment?

I'm absolutely delighted to have the opportunity to discuss and answer some of these issues with ex-EMC vSpecialist and VM Guru Stephen Spellicy, in an upcoming webinar on the 30th of March. i welcome all to partake in the Q and A session and discussion as both Stephen and I showcase the way for a complete solution to optimize and consolidate your FC connectivity while ensuring the reliability and performance you have come to know and love with Fibre Channel. Find out how to also improve resource utilization, deployment and ongoing management of your FC connectivity while easily accessing and analyzing critical application-to-storage transaction data that allows you to pinpoint latency down to the millisecond.

Registration is free and available here:
http://info.virtualinstruments.com/webinar-fc-san-myths.html

In the meantime here's an insight into the exciting new technology that is IOV:

CRC Errors, Code Violation Errors, Class 3 Discards & Loss of Sync - Why Storage isn't Always to Blame!


Storage is often

automatically pinpointed as the source of all problems. From System Admins, DBAs, Network guys to Application owners, all are quickly ready to point the figure at SAN Storage given the slightest hint of any performance degradation. Not really surprising though, considering it’s the common denominator amongst all silos. On the receiving end of this barrage of accusation is the SAN Storage team, who are then subjected to hours of troubleshooting only to prove that their Storage wasn’t responsible. On this circle goes until there reaches a point when the Storage team are faced with a problem that they can’t absolve themselves of blame, even though they know the Storage is working completely fine. With array-based management tools still severely lacking in their ability to pinpoint and solve storage network related problems and with server based tools doing exactly that i.e. looking at the server, there really is little if not nothing available to prove that the cause of latency is a slow draining device such as a flapping HBA, damaged cable or failing SFP. Herein lies the biggest paradox in that 99% of the time when unidentifiable SAN performance problems do occur, they are usually linked to trivial issues such as a failing SFP. In a 10,000 port environment, the million dollar question is ‘where do you begin to look for such a miniscule needle in such a gargantuan haystack?’

To solve this dilemma it’s imperative to know what to look for and have the right tools to find them, enabling your SAN storage environment to be a proactive and not a reactive fire-fighting / troubleshooting circus. So what are some of the metrics and signs that should be looked for when the Storage array, application team and servers all report everything as fine yet you still find yourself embroiled in performance problems?

Firstly to understand the context of these metrics / signs and the make up of FC transmissions, let’s use the analogy of a conversation. Firstly the Frames would be considered the words, the Sequences the sentences and an Exchange the conversation that they are all part of. With that premise it is important to first address the most basic of physical layer problems, namely Code Violation Errors. Code Violation Errors are the consequence of bit errors caused by corruption that occur in the sequence – i.e. any character corruption. A typical cause of this would be a failing HBA that would eventually start to suffer from optic degradation prior to its complete failure. I also recently experienced at one site Code Violation Errors when several SAN ports had been left enabled after their servers had been decommissioned. Some might think what’s the problem if they have nothing connected to them? In fact this scenario was creating millions of Code Violation Errors causing a CPU overhead on the SAN switch and subsequent degradation. With mission critical applications connected to the same SAN switch, performance problems became rife and without the identification of the Code Violation Errors could have led to weeks of troubleshooting with no success.

The build up of Code Violation Errors become even more troublesome as they eventually lead to what is referred to as a Loss of Sync. A Loss of Sync is usually indicative of incompatible speeds between points and again this is typical of optic degradation in the SAN infrastructure. For example if an SFP is failing, its optic signal will degrade and hence will not be at for example the 4Gbps it’s set at. Case point: a transmitting device such as a HBA is set at 4Gbps while the receiving end i.e. the SFP (unbeknownst to the end user) has degraded down to 1Gbps. Severe performance problems will occur as the two points constantly struggle with their incompatible speeds. Hence it’s an imperative to be alerted of any Loss of Sync as ultimately they are also an indication of an imminent Loss of Signal i.e. when the HBA or SFP are flapping and are about to fail. This leads to the nightmare scenario of an unplanned path failure in your SAN storage environment and worse still a possible outage if failover cannot occur.

One of the biggest culprits and a sure-fire hit to resolving performance problems is to look for what are termed CRC errors. CRC Errors usually indicate some kind of physical problem within the FC link and are indicative of code violation errors that have led to consequent corruption inside the FC data frame. Usually caused by a flapping SFP or a very old / bent / damaged cable, once CRC errors are acknowledged by the receiver, the receiver would reject the request leaving the Frame having to be resent. For example as an analogy imagine a newspaper delivery boy, who while cycling to his destination loses some of the pages of the paper prior to delivery. Upon delivery the receiver would request for the newspaper to be redelivered with the missing pages. This would entail the delivery boy having to cycle back to find the missing pages and bring back the newspaper as a whole. In the context of a CRC error a Frame that should typically take only a few milliseconds to deliver could take up to 60 seconds in being rejected and resent. Such response times can be catastrophic to a mission critical application and it’s underlying business. By gaining an insight into CRC errors and their root cause one can immediately pinpoint which bent cable or old SFP is responsible and proactively replace them long before they start to cause poor application response times or even worse a loss to your business.

The other FC SAN gremlin is what is termed a Class 3 discard. Of the various services of data transport defined by the Fibre Channel ANSI Standard, the most commonly used is Class 3. Ideal for high throughput, Class-3 is essentially a datagram service based on frame switching and is a connectionless service. Class 3’s main advantage comes from not giving an acknowledgement that a frame has been rejected or busied by a destination device or Fabric. The benefits of this are that it firstly significantly reduces the overhead on the transmitting device and secondly allows for more bandwidth availability for transmission which would otherwise be reduced. Furthermore the lack of acknowledgements removes the potential delays between devices caused by round-trips of information transfers. As for data integrity, Class 3 Flow control has this handled by higher-level protocols such as TCP due to Fibre Channel not checking the corrupted or missing frames. Hence any discovery of a corrupted packet by the higher-level protocol on the receiving device instantly initiates a retransmission of the sequence. All of this sounds great until the non-acknowledgement of rejected frames starts to also bring about Class 3’s disadvantage. This is that inevitably a Fabric will become busy with traffic and will consequently discard frames, hence the name Class 3 discards. Due to this the receiving device’s higher-level protocol’s subsequent request for retransmission of sequences will then degrade the device and fabric throughput.

Another indication of Class 3 discards are zoning conflicts where a frame has been transmitted and cannot reach a destination, hence concluding in the SAN initiating a Class 3 discard. This is caused by either legacy or zoning mistakes where for example a decommissioned Storage system was not unzoned from a server or vice versa leading to continuous frames being discarded and degraded throughput as sequences are retransmitted. This then results in performance problems, potential application degradation and automatic finger pointing at the Storage System for a problem that can’t automatically be identified. By resolving the zoning conflict and spreading the load of the SAN throughput across the right ports, the heavy traffic or zoning issues which cause the Class 3 discards can be quickly removed bringing immediate performance and throughput improvements. By gaining an insight into the occurrence and amount of Class 3 discards, huge performance problems can be quickly remediated before they occur and thus another reason as to why the Storage shouldn’t automatically be blamed.

These are just some of the metrics / signs to look for which can ultimately save you from weeks of troubleshooting and guessing. By first acknowledging these metrics, identifying when they occur and proactively eliminating them, the SAN storage environment will quickly evolve and transform into a healthy, proactive and optimized one. Furthermore by eliminating each of these issues you also empower yourself by eliminating their consequent problems such as application slowdown, poor response times, unplanned outages and long drawn out troubleshooting exercises which eventually lead to fingerpointing fights. Ideally what will occur is a paradigm shift where instead of application owners complaining to the Storage team, the Storage team will proactively identify problems prior to their existence. Here lies the key to making the ‘always blaming the Storage’ syndrome a thing of the past.