The SANMAN

The Rise of the x86 Mainframe

Posted by Archie Hendryx on Friday, June 08, 2012

I'm delighted to have been asked by BrightTalk to present a webinar for their upcoming Enterprise Storage Summit. I hope you can join and look forward to your feedback!

Here are the details:
http://www.brighttalk.com/webcast/679/49777

If it wasn’t for the cost and complexity of Mainframes would the industry have ever shifted towards Open Systems? The performance, maintenance and security advantages of Mainframes were never disputed yet the complexity of running them required a unique expertise while the cost could rarely be justified except for high scale data warehousing. But what if it was possible to apply the advantages of the Mainframe model to an x86 infrastructure? What if that cost and complexity could be replaced by simplicity and CAPEX & OPEX reduction? These are the questions that are being asked by businesses across the globe as they look to reassess how they approach their IT with decreasing budgets and increasing demands. The once common debates related to speeds and feeds of different components are now giving way to discussions on how IT can quickly deliver a suitable service to the business. It is this debate that is leading the industry to an inflection point and the consequent rise of the x86 Mainframe. Join Archie Hendryx as he discusses how a new approach to IT infrastructure is needed to solve the incumbent challenges that are being faced within the industry.

Vblock, VSPEX, FlexPod, PureSystems, CloudSystem Matrix...Converged Infrastructure or Reference Architectures?

Posted by Archie Hendryx on Sunday, April 22, 2012

This last fortnight there’s been a cacophony of hyperbole and at times marketing fluff from vendors and analysts with regards to Reference Architectures and Converged Infrastructures. As IBM launched PureSystems, NetApp & Cisco decided it was also a good time to reiterate their strong partnership with FlexPod. In the midst of this, EMC decided to release their new and rather salaciously titled VSPEX. From the remnants and ashes of all these new product names and fancy launch conferences, the resultant war blogs and Twitterati battles ensued. As I poignantly watched on from the trenches in an almost Siegfried Sassoon moment, it was quickly becoming evident that there was now an even more ambiguous understanding of what distinguishes a Converged Infrastructure from a Reference Architecture, what it’s relation was with the Private Cloud and more importantly whether you, the end user should even care.

IBM's PureSystems - Converged Infrastructure or a well marketed Single Stack?

There’s a huge and justified commotion in the industry over Private Cloud because with lower costs, reduced complexity and greater data center agility, the advantages are compelling for any business looking to streamline and optimize its IT. In the pursuit of attaining such benefits and ensuring a successful Private Cloud deployment, one of the most critical components that need to be considered is that of the infrastructure and its underlying resource pools. With resource pools being the foundation of rapid elasticity and instantaneous provisioning, a Private Cloud’s success ultimately depends on the stability, reliability, scalability and performance of its infrastructure. With existent datacenters commonly accommodating legacy servers that require a refresh or new multiprocessor servers that are entrenched between an old and insufficient network infrastructure, one of the main challenges of a Private Cloud deployment is how to upgrade it without introducing risk. With this challenge and the industry’s pressing need for an economically viable answer, the solution was quickly conceived and baptized as “Converged Infrastructure”. Sadly like all great ideas and concepts, competition and marketing fluff quickly tainted the lucidity of such an obvious solution by introducing other terms such as “Reference Architectures” and “Single Stack Solutions”. Even more confusing was the launch of vendor products that used such terms synonymously, together or as separate distinct entities. So what exactly differentiates these terms and which is the best solution to meet the infrastructure challenge of a Private Cloud deployment?

EMC's new VSPEX - Reference Architecture with a variety of options of components

Reference Architectures for all intents and purposes are essentially just whitepaper-based solutions that are derived from previously successful configurations. Using various vendor solutions and leveraging their mutual partnerships & alliances, Reference Architectures are typically integrated and validated platforms built from server, network and storage components with an overlying hypervisor. NetApp’s FlexPod and EMC’s VSPEX fall into this category and both invariably point to their flexibility as a major benefit as they enable end users to mix and match as long as there remains a resemblance to the reference. With open APIs to various management tools, Reference Architectures are cleverly marketed as a quick, easy to deploy and risk free infrastructure solution for Private Clouds. Indeed Reference Architectures are a great solution for a low budget SMB that is looking to introduce itself to the world of Cloud. As for a company that is either in or bordering on the Enterprise space and looking to seriously deploy their workloads onto a Private Cloud, it's important to remember that sometimes things that are great on paper can still end up being a horrible mess in reality – anyone who's watched Lynch's Dune can pay testament to that.

The difficulty with Reference Architectures is that fundamentally they still have no hardened solution configuration parameters and ironically what they term an advantage i.e. flexibility, is actually their main flaw as their piece by piece approach of using solutions from many different vendors merely masquerades the same old problems. Due to being whitepaper solutions, integration of specific components is only documented as a high level overview with component ‘a’ being detailed as compatible with component ‘c’. With regards to the specifics and how these components integrate in detail, these are simply not available or realized until the Reference Architecture is cobbled together by the end user, who ultimately assumes all of the risk and financial obligation to ensure it not only works correctly but is also performing at optimum levels. This haphazard trial and error approach is counterproductive to the accelerated, pre-integrated, pretested and optimized model that is required by the infrastructure of a Private Cloud.

Furthermore Reference Architectures are based on static deployments of sizing and architecture that typically has little relation to the end users actual environment or needs, posing a problem whenever reconfiguration or resizing is required. With end users being left to resize and consequently reconfigure & reintegrate their solution, they also have to constantly find a way to integrate their existing toolsets with the open APIs. This subsequently eliminates a lot of the benefits associated with “quick time to value” as many deployment projects get caught up in the quagmire of such triviality. Added to this, once you’ve begun resizing or customizing your architecture, you’ve actually made changes that are a deviation from the proposed standard and hence no longer recognizable to the original reference. This leads to the other complication with Reference Architectures, namely support issues.

NetApp's FlexPod Reference Architecture uses Cisco UCS Blades & VMware

With more than 90% of support calls being related to logical configuration issues, they are more often than not an occurrence of bugs or incompatibility issues. When the vendor has no responsibility or knowledge of that logical build based on the fact that they meet your “requirements” to be flexible, the situation doesn’t bode any better than when you have a traditional infrastructure deployment. Vendor finger pointing is one the most frustrating experiences you inevitably have to face when deploying an IT infrastructure in the traditional way. Being on a 4am conference call during a Priority 1 with the different organizational silos and the numerous vendors that make up the infrastructure is a painful experience I’ve personally had to face. It’s not a pretty sight when you’re impatiently waiting for a resolution while the networking company blames the firmware on the Storage and the Storage vendor blames the bugs with the servers while all the time you are sitting their watching your CEO’s face turn into a tomato while the vein in his neck throbs incessantly. When you log a support call for your reference architecture who is actually responsible? Is it the company you bought it from or one of the many manufacturers that you used to assemble your self-built masterpiece? Furthermore which of those manufacturers or vendors will take full responsibility when you’ve ended up building, implementing and customizing the architecture yourself? Even at the point of deployment, the Reference Architecture carries elements of ambiguity for the end user ranging from which software and firmware releases to run to who is responsible for the regression testing of the logical build. For instance what if you decide to proactively update to one of your components’ latest firmware releases and then find out it’s not compatible with another of your components? Who owns the risk? Also for example if you buy a “flexible” Reference Architecture from vendor X, how will vendor X be able to distinguish what it is you’ve actually deployed and how it’s configured without having to spend an aeon on the phone doing a fact finding session, all while your key applications are down? Reference Architectures are great for a test environment or simple cheap and cheerful solution but using them as a platform to take key applications to the Cloud reeks of more 4am conference calls and exploding tomatoes.

Oracle Exalogic - Virtualization with OracleVM not VMware

Single Stack Infrastructures on the other hand while also sometimes marketed as a Converged Infrastructure or a “flexible” Reference Architecture (or sometimes both!) are another completely distinct offering in the market. These solutions are typically marketed as “All-in-one” solutions, and come in a various number of guises. Products such as Oracle’s Exadata and Exalogic, Dell’s vStart, HP’s CloudSystem Matrix and IBM’s PureSystems are all examples of the Single Stack solution where the vendors have tightly defined software stacks above the virtualization layer. Such solutions will also combine a bundled infrastructure and service offerings making them potential “Clouds in a Box”. While on the outset these seem ideal and quick to deploy and manage, there are actually a number of challenges with the Single Stack solution. The first challenge is that the Single Stack will always provide you their own inherent components regardless of whether they are inferior to other products in the market. So for example, instead of having network switches from the well established Cisco or Brocade, if you opt with the HP solution you’re looking at HP’s ProCurve, 3Com, H3C and TippingPoint. Worse still is if you go with the Oracle stack you’re condemned to have OracleVM as opposed to the market leading and technically superior VMware. Another challenge is that you’re also tied down to that one vendor and are now a victim of vendor lock-in. Instead of just having infrastructure that will fit your existing software toolset and service management, you will inevitably have to rip and replace these with the Single Stack’s product set. Additionally these complex and non-integrated software and hardware stacks require significant time to deploy and integrate, reducing a considerable amount of the value that comes from an accelerated deployment.

HP's CloudSystem Matrix - A Single Stack that will also bundle in HP's Service offerings with the Infastructure

A true converged infrastructure is one that is not only pretested and preconfigured but also and more importantly pre-integrated; in other words it ships out as a single SKU and product to the customer. While it may use different components from different vendors, they are still components that are from market leaders and are well established in the Enterprise space. Furthermore while it may not have the “flexibility” of a Reference Architecture, it’s the rigidity and adherence to predefined standards that make the Converged Infrastructure the ideal fit for serious contenders who are looking for a robust, scalable, simply supported and accelerated Private Cloud infrastructure. The only solution that is on the market that fits that category is VCE's Vblock. By being built, tested, pre-integrated and configured before being sent to the end user as a single product, the Converged Infrastructure for the Amsterdam datacenter will be exactly the same as the deployment in Bangalore, Shanghai, Dubai, New York and London. In this instance the shipped Converged Infrastructure merely requires the end user to plug in and supply network connectivity.

VCE's unique Vblock 700LX - A true Converged Infrastructure that ships out as a pre-tested & pre-integrated solution

With such a model, support issues are quickly resolved and vendor finger-pointing is eliminated. For example the support call is with one vendor (the Converged Infrastructure manufacturer) and they alone are the owner of the ticket because the Converged Infrastructure is their product. Moreover once a product model of a converged infrastructure has been shipped out, problems that may potentially be faced by a customer in Madrid can easily be replicated and tested on a like for like lab with the same product in London, rapidly resolving performance issues or trouble tickets.

Deploying a preconfigured, pretested and pre-integrated standardized model can also quickly eliminate issues with firmware updates and patching. With traditional deployments, keeping patches and firmwares up to date with multiple vendors, components and devices can be an operational role by itself. You would first have to assess the criticality of each patch and relevance to each platform as well as validate firmware compatibility with other components. Additionally you’d also need to validate the patches by creating ‘mirrored’ Production Test Labs and then also have to figure out what your rollback mechanism is if there are any issues. By having a pre-integrated Converged Infrastructure all of this laborious and tedious complication is removed. All patches and firmwares can be pretested and validated on standardized platforms in labs that are exactly the same as the standardized platforms that reside in your datacenter. Instead of a multitude of updates from a multitude of vendors each year, a converged infrastructure offers the opportunity to have a single matrix that upgrades the infrastructure as a whole and risk free.

A Converged Infrastructure offers a standardized model making patching & firmware upgrades seamless regardless of location or number

The other distinctive feature of a Converged Infrastructure is its accelerated deployment. By being shipped to the customer as a ready assembled, logically configured product and solution, typical deployments can range from only 30-45 days i.e. from procurement to production. In contrast other solutions such as Reference Architectures could take twice as long if not longer as the staging, racking and logical build is still required once delivered to the customer. It’s this speed of deployment which makes the Converged Infrastructure the ideal solution for Private Cloud deployments and an immediate reduction in your total cost of ownership, especially when the business or application owners demands an instant platform for their new projects.

The other benefit of having a company that continuously builds standardized and consistent infrastructures that are configured and deployed for key applications such as Oracle, SAP or Exchange is that you end up with an infrastructure that not only consolidates your footprint and accelerates your time to deployment but also optimizes and in most cases improves the performance of your key apps. I’ve recently seen a customer gain a 300% performance improvement with their Oracle databases once they decided to migrate them off their Enterprise Storage Arrays, SPAARC servers and SAN switches in favour of a Converged Infrastructure, i.e. the Vblock. Of course there were a number of questions, head scratching and pontifications as to what was seemingly inexplicable; “how could you provide such performance when we’ve spent months optimizing our infrastructure?” The answer is straightforward in that regardless of how good an engineering team you have, it is rare that they are solely focused on building a standardized infrastructure on a daily basis that is customized for a key application and is factoring all of the components comprehensively.

To elaborate, typically customers will have an in house engineering department where they’ll have a Storage team, a Server team, a Network team, an Apps team, a SAN team etc. All of these silos then need to share their expertise and somehow correlate them together prior to building the infrastructure. Compare this to VCE and the Converged Infrastructure approach, where instead there are dedicated engineering teams for each step of the building process whose expertise is centred and focused upon a single enabling platform, i.e. the Vblock. Firstly there’s the engineering team that does the physical build (including thermals, power efficiency, cooling, cabling, equipment layout for upgrade paths etc.). This is then passed on to another dedicated engineering team that takes that infrastructure and certifies the software releases as well as test the logical build configurations all the way through to the hypervisor. There’s then another engineering organization that’s sole purpose is to test applications that are commonly deployed on these Vblock infrastructures such as Oracle, SAP, Exchange, VDI etc. This enables the customer that orders for example an “Oracle Vblock” to have an infrastructure that was specifically adapted both logically and physically to not only meet the needs of their Oracle workloads but also optimize its performance. This is just a glimpse of the pre-sales aspect; post sales you have a dedicated team responsible for the product roadmap of the entire infrastructure ensuring that software or component updates are checked and advised to customers once they are deemed suitable for a production environment. The list of dedicated teams goes on but the common denominator is that they are all part of a seamless process that aims at delivering and supporting an infrastructure designed and purpose built for mission critical application optimization.

So whether you’re feeling Pure, Flexy or Spexy the key thing is to distinguish between Reference Architectures, Single Stack Solutions and the Vblock i.e. a Converged Infrastructure and align the right solution to the right business challenge. For fun and adventure I'd always purchase a kit car over a factory built car. I'd have great fun building it from all the components available to me and have it based on my Reference handbook. I could even customize my kit car with a 20 inch exhaust pipe, Dr. Dre hydraulics and fluffy dice because it's flexible just like a Reference Architecture. Alternatively because I love Audi so much I could buy an Audi car that has all of its components made by Audi. So that means ripping out the Alpine CD player for an Audi one, the BOSE speakers for Audi ones and even removing the Michelin tyres for some new Audi ones, regardless of whether they're any good or if they’re just OEM’d from a budget manufacturer - just like a Single Stack Solution. Ultimately if I'm serious about performance and reliability I'll just buy a manufactured Audi S8 that's pre-integrated and deployed from the factory with the best of breed components. Sure I can choose the colour, I can decide on the interior etc. but it's still built to a standard that's designed and engineered to perform. Much like a Converged Infrastructure, while I may choose to have a certain amount of CPU for my Server blades and a certain amount of IOPS and capacity for Storage, I still have a standardized model that's designed and engineered to perform and scale at optimum levels. For a Private or Hybrid Cloud infrastructure that successfully hosts and optimizes critical applications as well as de-risk their virtualization, the solution can only mean one thing - it's Converged.

Storage According to the VMware Admin: SDRS, SIOC, VASA & Storage vMotion

Posted by Archie Hendryx on Sunday, March 25, 2012

System Admins were generally the early embracers and end users of VMware ESX as they immediately recognized the benefits of virtualization. Having been bogged down with the pains of running physical servers such as downtime for maintenance, patching and upgrades, they were the natural adopters of the bare metal hypervisor. The once Windows 2003 system admin was soon configuring virtual networks and VLANs as well as carving up Storage datastores, quickly empowering them as the master of this new domain that was revolutionizing the datacenter. As the industry matured in its understanding of VMware, so did VMware’s recognition that the networking, security and storage expertise should be broadened to those that had been involved in such work in the physical world. Along came features such as the Nexus 1000v and VM vShield that enabled the network and security teams to also plug into the ‘VM world’ enabling them to add their expertise and participate in the configuration of the virtual environment. With vSphere 5, VMware took the initiative further by bridging the Storage and VMware gap with new features that Storage teams could also take advantage of. Despite this terms such as SIOC, Storage DRS, VASA and Storage vMotion still seem to draw blanks from most Storage folk or are looked down upon as ‘a VMware thing’. So what exactly are these features and why should Storage as well as VM admin take note of them as well as work together to take full advantage of their benefits?

Firstly there’s Storage DRS (SDRS) and in my opinion the most exciting new feature of vSphere 5. SDRS enables the initial placement and on-going space & load balancing of VMs across datastores that are part of the same datastore cluster. Simply put think of a datastore cluster as an aggregation of multiple datastores into a single object and SDRS balancing the space and I/O load across it.

In the case of space utilization, this takes place by ensuring that a set threshold is not exceeded. So should a VM reach say 70% space threshold, then SDRS will move the VMs via Storage vMotion to other datastores to balance out the load.

Storage DRS based on Space utilisation

The other balancing feature which is load balancing based on I/O metrics, uses the vSphere feature Storage I/O Control (SIOC). In this instance SIOC is used to evaluate the datastores in the cluster by continuously monitoring how long it takes an I/O to do a round trip and then feeds this information to Storage DRS. If the latency value for a particular datastore is above a set threshold value for a period of time, then SDRS will rebalance the VMs across the datastores in the cluster via Storage vMotion. With many Storage administrators operating ‘dynamic tiering’ or ‘fully automated tiering’ at the backend of their storage arrays, it’s vital that a co-operative design and decision is made to ensure that the right features are utilized at the right time.

Storage DRS based on I/O latency

While most are aware of vMotion’s capabilities of seamlessly migrating VMs across hosts, Storage vMotion is a slightly different feature that allows the migration of running VMs from one datastore to another without incurring any downtime. In vSphere 5.0, Storage vMotion has been improved by enabling the operation to take place a lot quicker.

It does this by using a new Mirror Driver mechanism that keep blocks on the destination synchronized with any changes made to the source after any initial copying. The migration process then does a single pass of the disk, copying all the blocks to the destination disk. If there are any blocks that have changed this copy, the mirror driver will then synchronise from the source to the destination. It’s this single pass block copy that enables Storage vMotion to take place a lot quicker, enabling the end user to reap the benefits immediately.

Storage vMotion & the new Mirror Driver

As for the new feature named VASA, this has a focus around providing insight and information to the VM admin about the underlying storage. To explain VASA in its simplest terms is a new set of APIs that enables storage arrays to provide vCenter with visibility into the storage’s configuration, health status and functional capabilities. VASA also allows the VM admin to see the features and capabilities of their underlying physical storage. It allows the admin to see details such as the number of spindles for a volume, the number of expected IOPS or MB/s, the RAID levels, whether the LUNs are either thick or thin provisioned or even if there are any deduplication or compression details. By leveraging the information provided by VASA, SDRS can also utilize this to make its recommendations on space and I/O load balancing. Basically VASA is a great feature that ensures VM admins can quickly provision storage to VMs that are most applicable to them.

This leads onto the feature termed Profile Driven Storage. Profile Driven Storage is a feature that enables you to select the correct datastore on which to deploy your VMs based on that datastore’s capabilities. So building a Storage Profile, can happen in two ways, one is that the storage device has its capability associated automatically via VASA. The other way is that the storage device’s capability is user-defined and manually associated.

VASA & Profile Driven Storage

With the User-Defined option you can apply labels to your storage, such as Bronze, Silver & Gold based on the capabilities of that Storage. So for example once a profile is created and the user-defined capabilities are added to a datastore, you can then use that profile to select the correct storage for a new VM. If the profile is compatible with the VM’s requirements it is said to be compliant, if they do not, then the VM is said to be non-compliant. So while VASA and profile driven storage are still a new feature, their potential is immense especially in the future, as storage admin can potentially work alongside VM admins to help classify and tier their data.

As mentioned before Storage I/O Control or SIOC is a feature that enables you to configure rules and policies to help specify the business priority of each VM. It does this by dynamically allocating I/O resources to your critical application VMs whenever there’s an I/O congestion detected. Furthermore by enabling SIOC on a datastore you can trigger the monitoring of device latency as observed by the hosts. As SIOC takes charge of I/O allocation to VMs it also by default ignores

Disk.SchedNumReqOutstanding (DSNRO). Typically it’s DSNRO that sets the Queue Depth on the hypervisor layer but once SIOC is enabled it consequently takes on this responsibility basing its judgements on the I/O congestion and policy settings. This offloads a significant amount of performance design tasks from the admins but ultimately still requires the involvement of the Storage team to ensure that I/O contention is not falsely coming from poorly configured Storage and highly congested LUNs.

SIOC ignores Disk.SchedNumReqOutstanding to set the Queue Depth at the hypervisor level

So while these new features are ideal for the SMB they may still not be the sole answer to every Storage / VMware related problem related to virtualizing mission critical applications. As with any new feature or technology their success relies in their correct planning, design and implementation and for that to happen a siloed VM or Storage only approach needs to be evaded.

Cisco's UCS - The Prime Choice for Cloud & Big Data Servers?

Posted by Archie Hendryx on Monday, March 19, 2012

Back in March 2009, when Cisco announced the launch of their UCS platform and subsequent intention to enter the world of server hardware, eyebrows were raised including my own. There was never any disputing that the platform would be adopted by some customers, certainly after seeing how Cisco successfully gatecrashed the SAN market and initially knocked Brocade off their FC perch. We’d all witnessed how Cisco used its IP datacenter clout and ability to propose deals that packaged both SAN MDS and IP switches with a consequent single point of support to quickly take a lead in a new market. Indeed it was only after Brocade’s 2007 acquisition of McData and when Cisco started to focus on FCoE that Brocade regained their lead in FC SAN switch sales. Where mine and others’ doubts lay were whether the UCS was going to be good enough to compete with the already proven server platforms of HP, IBM and Dell. Well, roll on three years and the UCS now boasts 11,000 customers worldwide and an annual run rate of £822m making it the fastest growing product in Cisco’s history. Amazingly Cisco is already third in worldwide blade server market share with 11%, closely behind HP and IBM. So now with this week’s launch of the UCS’ third generation and its integration of the new Intel Xeon processor E5-2600, it’s time to accept that all doubts have been swiftly erased.

Unlike other server vendors, Cisco’s UCS launch was from a fresh-fields approach that recognized the industry’s shift towards server virtualization and consolidation. Not tied down by legacy architectures, Cisco entered the server market at the same time Intel launched their revolutionary Intel Xeon 5500 processors and immediately took advantage with their groundbreaking memory extension feature. By creating a way to map four distinct physical memory modules (DIMMs) to a single logical DIMM that would be seen by the processor’s memory channel, Cisco introduced a way to have 48 standard slots as opposed to the 12 found in normal servers. With the new B200 M3 blade server, there’s now support for up to 24 DIMM slots for memory running up to 1600 MHz and up to 384 GB of total memory as well as 80 Gbits per second of I/O bandwidth.This is even more impressive when you factor in that with the Cisco UCS 5108 Chassis also being able to accommodate up to eight of these blades, scalability can go up to a remarkable 320 per Cisco Unified Computing System. Added to this Cisco took convergence further by making FCoE the standard with Fabric Interconnects that not only acted as the brains for their servers but also helped centralize management. With the ability to unite up to 320 servers as a single system, they also supported line-rate, low latency lossless 10 Gigabit Ethernet as well as FCoE. This enabled a unified network connection for each blade server with just a wire-once 10Gigabit Ethernet FCoE downlink, reducing cable clutter and centralizing network management via the UCS Manager GUI. Now with the newly launched UCS 6296UP, the Fabric Interconnect will double the switching capacity of the UCS fabric from 960Gbps to 1.92Tbps as well as the number of ports from 48 to 96.

Cisco UCS' Memory Extension

Other features such as FEX introduced the ability to ease management. FEX (Fabric Extenders) are platforms that act almost like remote line cards for the parent Cisco Nexus switches. Hence the Fabric Extenders don’t perform any switching and are managed as an extension of the fabric interconnects. This enables the UCS to scale to many chassis without increasing the amount of switches, as switching is removed from the chassis. Furthermore there is no need for separate chassis management modules as the fabric extenders alongside the fabric interconnects manage the chassis’ fans, power supplies etc. This means there’s no requirement to individually manage each FEX as everything is inherited from the upstream switch therefore allowing you to simply plug in and play a FEX for a rack of pre-cabled servers. Regardless of configured policies, upgrading or deploying of new features would simply require a change on the upstream switch because the FEX inherits from the parent switch, leading everything to be automatically propagated across the racks of servers.

The UCS components

With the aforementioned B200 M3 blade, there is also two mezzanine I/O slots, one that is coincidentally used by the newly launched 1240 virtual interface card. The VIC1240 provides 40 Gbps capacity which can of course be sliced up into virtual interfaces delivering flexible bandwidth to the UCS blades. Moreover with a focus on virtualization and vSphere integration, the VIC 1240 implements Cisco’s VM-FEX and supports VMware's VMDirectPath with vMotion technology. The concept of VM-FEX is again centered on the key benefits of consolidation, this time around the management of both virtual and physical switches. With the advent of physical 10GB links being standard, VM-FEX enables end users to move away from the complexity of managing standard vSwitches and consequently a feature that was designed and introduced when 1GB links were the norm. It does this by providing VM virtual ports on the actual physical network switch hence avoiding the hypervisor’s virtual switch. The VM’s I/O is therefore sent directly to the physical switch, making the VM’s identity and positioning information known to the physical switch, eliminating local switching from the hypervisor. Unlike the common situation when trunking of the physical ports was a requirement to enable traffic between VMs on different physical hosts, the key point here is that the network configuration is now specific to that port. That means once you’ve assigned a VLAN to the physical interface, there is no need for trunking and you’ve also ensured network consistency across your ESX hosts. The VM-FEX feature also has two modes, the first mode being called emulated mode where the VM’s traffic is passed through the hypervisor kernel. The other ‘high-performance’ mode utilizes the VMDirectPath I/O and bypasses the hypervisor kernel going directly to the hardware resource associated with the VM.

High Performance Mode utilises VMware's VMDirectPath I/O feature

Interestingly the VMDirectPath I/O feature is another key vSphere technology that often gets overlooked but one that adds great benefit by allowing VMs to directly access hardware devices. First launched in vSphere 4.0, one of its limitations was that it didn’t allow you to vMotion the VM, which may explain its lack of adoption. Now though with vSphere 5.0 and the UCS, vMotion is supported. Here the VIC sends the VM’s I/O directly to the UCS fabric interconnect, which then offloads the VM’s traffic switching and policy enforcement. By interoperating with VMDirectPath the VIC transfers the I/O state of a VM as well as its network properties (VLAN, port security, rate limiting, QoS) to vCenter as it vMotions across ESX servers. So while you may not get an advantage on throughput, where VMDirectPath I/O’s advantage lies is in its ability to save on CPU workloads by freeing up CPU cycles that were needed for VM switching, making it ideal for very high packet rate workloads that need to sustain their performance. Of course you can also now transition the device from one that is paravirtualized to one that is directly accessed and the other way around. VM-FEX basically merges the virtual access layer with the physical switch, empowering the admin to now provision and monitor from a consolidated point.

As well as blade servers, Cisco are also serving up (excuse the pun) new rack servers which update their C-class range; the 1U C220 M3 and the 2U C240 M3 server. With the announcement that the UCS Manager software running in the Fabric Interconnect will now be able to manage both blade and rack servers as a common entity, there is also news that this will eventually scale out as a single management domain for thousands of servers. Currently under the moniker of “Multi-UCS Manager”, the plan is to expand the current management domain limit of 320 servers to up to 10,000 servers spread across data centers around the world, empowering server admin to centrally deploy templates, policies, and profiles as well as manage and monitor all of their servers. This would of course bring huge dividends in terms of OPEX savings, improved automation and orchestration setting the UCS up as a very hard to ignore option in any new Cloud environment.

A single management pane for up to 10,000 servers

As well as Cloud deployments, the UCS is also being set up to play a key role in the explosion of big data. With the recent announcement that Greenplum and Cisco are finally teaming together to utilize the C-class rack servers, there is already talk of pre-configured Hadoop stacks. With Greenplum’s MR Hadoop distribution integrating with Cisco's C-class rack servers, it’s pretty obvious that the C-class UCS servers will also quickly gain traction in the market much like their B-series counterparts.

Incredibly it was not long ago that Cisco was just a networking company that’s main competitor was Brocade. Fast forward to March 2012 and Brocade’s CEO Mike Klayko is stating "If you can run Cisco products then you can run ours" to justify Brocade's IP credentials. When their once great competitor inadvertently admits they’re entering the IP world as a reaction to Cisco rather than a perceived demand from the market it really does showcase how far Cisco have come. It also speaks volumes that alternatively, Cisco proactively entered the server world when no perceived demand existed within that market. Three years later and with 11% market share and groundbreaking features built for the Cloud and Big Data, Cisco has moved far beyond its networking competitors and is well placed to be a mainstay powerhouse in the server milieu.

Disaster Recovery Monitoring

Posted by Archie Hendryx on Saturday, March 10, 2012

The key to a Disaster Recovery investment is being able to test and failover i.e. check that it actually works. Hence it is vital that the SAN that is being used for this replication be optimized and provide a RTO that meets with the business’ demands. While there are sufficient tools to monitor the IP or DWDM links for cross site replication, it is still best practice to TAP at the replication links for your Disaster Recovery infrastructure to incorporate the monitoring of the FC SAN.

Here's what was my final video for Virtual Instruments that quickly explains how you can proactively take out the "Disaster" from Disaster Recovery....

Understanding IOPS

Posted by Archie Hendryx on Friday, March 09, 2012

IOPS is commonly recognized as a standard measurement of performance whether to measure the Storage Array’s backend drives or the performance of the SAN. In its most basic terms IOPS are the number of operations issued per second, whether, read, writes or other and admins will typically use their Storage Array tools or applications such as Iometer to monitor IOPS.

IOPS will vary on a number of factors that include a system’s balance of read and write operations, whether the traffic is sequential, random or mixed, the storage drivers the OS background operations or even the I/O Block size.

Block size is usually determined by the application with different applications using different block sizes for various circumstances. So for example Oracle will typically use block sizes of 2 KB or 4 KB for online transaction processing and larger block sizes of 8 KB, 16 KB, or 32 KB for decision support system workload environments. Exchange 2007 may use an 8KB Block size, SQL a minimum of 8KB and SAP 64KB or even more.

IOPS and MB/s both need to be considered

Additionally it is standard practice that when IOPS is considered as a measurement of performance, the throughput that is to say MB/sec is also looked at. This is due to the different impact they have with regards to performance. For example an application with only a 100MB/sec of throughput but 20,000 IOPs, may not cause bandwidth issues but with so many small commands, the storage array is put under significant exertion as its front end processors have an immense workload to deal with. Alternatively if an application has a low number of IOPS but significant throughput such as long sustained reads then the exertion will occur upon the SAN’s links.

Despite this MB/s and IOPS are still not a good enough measure of performance when you don’t take into consideration the Frames per second. To elaborate, referring back to the FC Frame, a Standard FC Frame has a Data Payload of 2112 bytes i.e. a 2K payload. So in the example below where an application has an 8K I/O block size, this will require 4 FC Frames to carry that data portion. In this instance this would equate to 1 IOP being 4 Frames. Subsequently 100 IOPS in this example would equate to 400 Frames. Hence to get a true picture of utilization looking at IOPS alone is not sufficient because there exists a magnitude of difference between particular applications and their I/O size with some ranging from 2K to even 256K, with some applications such as backups having even larger I/O sizes and hence more Frames.

Frames per second give you a better insight of demand and throughput

Looking at a metric such as the ratio of frames/sec to Mb/sec as is displayed below, we will actually get a better picture and understanding of the environment and it’s performance.

To elaborate, the MB/sec to Frames/Sec ratio is different to the IOPS metric. So with reference to this graph of MB/sec to Frame/sec ratio, the line graph should never be below the 0.2 of the y-axis i.e. the 2K data payload.

If the ratio falls below this, say at the 0.1 level, we can identify that data is not being passed efficiently despite the throughput being maintained (MB/sec).

Given a situation where you have the common problem of slow draining devices, the case that MB/s and IOPS alone are not sufficient is even more compelling as you can actually be misled in terms of performance monitoring.

To explain, Slow draining devices are devices that are requesting more information than they can consume and hence cannot cope with the incoming traffic in a timely manner. This is usually because the devices such as the HBA have slower link rates then the rest of the environment, or the server or device are being overloaded in terms of CPU or memory and thus having difficulty in dealing with the data requested. To avoid performance problems it is imperative to proactively identify them before they impact the application layer and consequently emanate to the business’ operations.

Slow Draining devices - requesting more information than they can consume

In such a situation, looking again at the MB/S Frames per Sec ratio graph below we can now see that the ratio is at the 0.1 level, in other words we are seeing a high throughput but minimum payload. This enables you to proactively identify if there are a number of management frames being passed instead of data as they are busily reporting on the physical device errors that are occurring.

Management Frames being passed can mislead

So to conclude without taking Frames per second into consideration and having an insight into this ratio it is an easy trap to falsely believe that everything is ok and data is being passed as you see lots of traffic as represented by MB/S, when in actuality all you are seeing are management frames reporting a problem.

Here's an animated video to further explain the concept: