Monitoring the SAN shine with Virtual Instruments

It was about three months ago that one of my friends had informed me he was leaving HDS to join a company named Virtual Instruments. ‘Virtual Instruments?’ I asked myself, trying to fathom if I’d have heard of them before only to realize that I had once seen a write up on their SAN monitoring solution, which was then termed NetWisdom. I was then inadvertently asked to mention Virtual Instruments in one of my blogs - nice try pal but I had made it clear several times before to vendors requesting the same that I didn’t want my blog to become an advertising platform. Despite this though I was still intrigued by what could have persuaded someone to leave a genuinely stable position at HDS to a company I hadn’t really had much exposure to myself. Fast forward a few months, several whitepapers and numerous discussions and I find myself writing a blog about the very said company.

Simple fact is it’s rare to find a solution or product in the storage and virtualization market that can truly be regarded as unique. More often than not most new developments fall victim to what I term the ‘six month catch up’ syndrome in which a vendor brings out a new feature only for its main competitor to initially bash it and then subsequently release a rebranded and supposedly better version six months later. The original proponents of thin provisioning, automated tiered storage, deduplication, SSD flash drives etc. can all pay testament to this. It is hence why I have taken great interest in a company that currently occupies a niche in the SAN monitoring market and as yet doesn’t seem to have worthy competitor, namely Virtual Instruments.

My own experience of Storage monitoring has always been a pain in the sense that nine times out of ten it was a defensive exercise in proving to the applications, database or server guys that the problem didn’t lie with the storage. Storage most of the time is fairly straightforward, wherein if there are any performance problems with the storage system they’ve usually stemmed from any immediate change that may have occurred. For example provision a write intensive LUN to an already busy RAID group and you only have to count the seconds before your IT director rings your phone on the verge of a heart attack at how significantly his reporting times have increased. But then there was always the other situation when a problem would occur with no apparent changes having been made. Such situations required the old hat method of troubleshooting supposed storage problems by pinpointing whether the problem was between the Storage and the SAN fabric or between the Server and the SAN but therein dwelled the Bermuda Triangle at the centre of it all i.e. the SAN. Try to get a deeper look into the central meeting point of your Storage Infrastructure and to see what real time changes have occurred on your SAN fabric and you’d subsequently enter a labyrinth of guesses and predictions.

Such a situation occurred to me when I was asked to analyze and fix an ever-slowing backup of an Oracle database. Having bought more LTO4 tapes, incorporating a destaging device, spending exorbitant amounts of money on man days for the vendor’s technical consultants, playing around with the switches buffer credits and even considering buying more FC disks, the client still hadn’t resolved the situation. Now enter yours truly into the labyrinth of guesses and predictions. Thankfully I was able to solve the issue by staying up all night and running a Solaris IOSTAT, while simultaneously having the storage system up on another screen. Eventually I was able to pinpoint (albeit with trial and error tactics) the problem to rather large block sizes and particular LUNs that were using the same BEDs and causing havoc on their respected RAID groups. With several more sleepless nights to verify the conclusion, the problem was finally resolved. Looking back surely there was a better, cost effective and more productive way to have solved this issue. There was but I just wasn’t aware of it.

Furthermore ask any Storage guy that’s familiar with SAN management/monitoring software such as HDS’ Tuning Manager, EMC’s ControlCenter, HP’s Storage Essentials and their like and they’ll know full well that despite all the SNIA SMI-S compliancy they still fail to provide metrics beyond the customary RAID group utilization, historic IOPS/sec, cache hit rate, disk response times etc. in other words from the perspective of the end-user there really is little to monitor and hence troubleshoot. Frustratingly such solutions still fail to provide performance metrics from an application to storage system view and thus also fail to allow the end user to verify if they are indeed meeting the SLAs for that application. Put this scenario in the ever growing virtual server environment and you are further blinded by not knowing the relation between the I/Os and the virtual machines from which they originated.

Moreover Storage vendors don’t seem to be in a rush to solve this problem either and the pessimist in me says this is understandable when such a solution would inevitably lead to a non-procurement of unnecessary hardware. With a precise analysis and pinpointing of performance problems/degradation and you have the consequent annulment of the haphazard ‘let’s throw some more storage at it’, ‘let’s buy SSDs’ or ‘let’s upgrade our Storage System’ solutions that are currently music to the ears of storage vendor sales guys. So amidst these partial viewing vendor provided monitoring tools, which lack that essential I/O transaction-level visibility, Virtual Instruments (VI) pushes forth it’s solution, which boldly claims to encompass the most comprehensive monitoring and management of end-to-end SAN traffic. From the intricacies of a virtual machine’s application to the Fibre Channel cable that’s plugged into your USPV, VMax etc. VI say they have an insight. So looking back had I had VI’s ability to instantly access trending data on metrics such as MB/sec, CRC errors, log ins and outs etc. I could have instantly pinpointed and resolved many of the labyrinth quests I had ventured through so many times in the past.

Looking even closer at VI, there are situations beyond the SAN troubleshooting syndrome in which it can benefit an organization. Like most datacenters if you have one of the Empire State Building-esque monolithic storage systems it is more than likely being under utilized with the majority of its residing applications not requiring the cost and performance of such a system. So while most organizations are aware of this and look to saving costs by tiering their infrastructure onto cheaper storage via the alignment of their data values to the underlying storage platform, it’s seldom a seen reality due to the headaches and lack of insight related to such operations. Tiering off an application onto a cheaper storage platform requires the justification from the Storage Manager that there will be no performance impact to the end users but due to the lack of precise monitoring information, many are not prepared to take that risk. In an indirect acknowledgement to this problem, several storage vendors have looked at introducing automated tiering software for their arrays which in essence merely looks at the LUN utilization before migrating them to either higher-performance drives or cheaper SATA drives. In reality this is still a rather crude way of tiering an infrastructure when you consider it ignores SAN fabric congestion or improper HBA queue depths. In such a situation a monitoring tool that tracks I/Os across the SAN infrastructure without being pigeonholed to a specific device is axiomatic in the enablement of performance optimization and the consequent delivery of Tier I SLAs with cheaper storage – cue VI and their VirtualWisdom 2.0 solution.

In the same way that server virtualisation exposed the under utilization of physical server CPU and Memory, the VirtualWisdom solution is doing the same for the SAN. While vendors are more than pleased to further sell more upgraded modules packed with ports for their enterprise directors, it is becoming increasingly apparent that most SAN fabrics are significantly over-provisioned with utilization rates often being less than 10%. While many SAN fabric architects seem to overlook fan in ratios and oversubscription rates in a rush to finish deployments within specified project deadlines, underutilized SAN ports are now an ever-increasing reality that in turn bring with them the additional costs of switch and storage ports, SFPs and cables.

Within the context of server virtualisation itself, which has undoubtedly brought many advantages with it, one irritating side affect has been the rapid expansion of FC traffic to accommodate the increased number of servers going through a single SAN switch port and the complexity now required to monitor it. Then there’s the virtual maze which starts with applications within the Virtual Machines that are in turn running on multi-socket and multi-core servers, which are then connected to a VSAN infrastructure only to finally end up on storage systems which also incorporate virtualization layers whether that be with externally attached storage systems or thinly-provisioned disks. Finding an end-to-end monitoring solution in such a cascade of complexities seems an almost impossibility. Not so it seems for the team at Virtual Instruments.

Advancing upon the original NetWisdom premise, VI’s updated Virtual Wisdom 2.0 has a virtual software probe named ProbeV. The ProbeV collects the necessary information from the SAN switches via SNMP and on a port to port basis metrics on information such as the number of frames and bytes are collated alongside potential faults such as CRC errors synchronization loss, packet discards or link resets /failures. Then via the installation of splitters (which VI name TAPs - Traffic Access Points) between the storage array ports and the rest of the SAN, a percentage of the light from the fibre cable is then copied to a data recorder for playback and analysis. VI’s Fibre-Channel probes (ProbeFCXs) then analyze every frame header, measuring every SCSI I/O transaction from beginning to end. This enables a view of traffic performance whether related to the LUN, HBA, read/write level, or application level, allowing the user to instantly detect application performance slowdowns or transmission errors. The concept seems straightforward enough but it’s a concept no one else has yet been able to put in practice, despite growing competition from products such as Akorri's BalancePoint, Aptare's StorageConsole or Emulex's OneCommand Vision.

Added to this VI’s capabilities can also provide a clear advantage in preparing for a potential virtualization deployment or dare I fall for the marketing terminology – a move to the private cloud. Lack of insight of performance metrics has evidently led to the stagnation of the majority of organizations virtualising their tier 1 applications. Server virtualization has reaped many benefits for many organizations, but ask those same organizations how many of them have migrated their IO intensive tier 1 applications from their SPAARC based physical platforms to an Intel based virtual one and you’re probably looking at a paltry figure. The simple reason is risk and fear of performance degradation, despite logic showing that a virtual platform with resources set up as a pool could potentially bring numerous advantages. Put this now in the context of a world where cloud computing is the new buzz as more and more organizations look to outsource many of their services and applications and you then have even fewer numbers willing to launch their mission critical applications from the supposed safety and assured performance of the in-house datacenter to the unknown territory of the clouds. It is here where VirtualWisdom 2.0 has the potential to be absolutely huge in the market and at the forefront of the inevitable shift of tier 1 applications to the cloud. While I admittedly I find it hard to currently envision a future where a bank launches it’s OLTP into the cloud based on security issues alone, I’d be blinkered to not realize that there is a future where some mission-critical applications will indeed take that route. With VirtualWisdom’s ability to pinpoint virtualized application performance bottlenecks in the SAN, it’s a given that the consequences will lead to an instantly significant higher virtual infrastructure utilization and subsequent ROI.

The VI strategy is simple in that by recognizing I/O as the largest cause of application latency, VirtualWisdom’s inclusion of baseline comparisons of I/O performance, bandwidth utilization and average I/O completions comfortably provide the necessary insight fundamental to any major virtualization or cloud considerations an organization may be planning for. With its ProbeVM, a virtual software probe that collects status from VMware servers via vCenter, the data flow from virtual machine through to the storage system can be comprehensively analyzed with historical and real-time performance dashboards leading to an enhanced as well as accurate understanding of resource utilization and performance requirements. With a predictive analysis feature based on real production data the tool also provides the user the ability to accurately understand the effects of any potential SAN configuration or deployment changes. With every transaction from Virtual Machine to LUN being monitored, latency sources can quickly be identified whether it’s from the SAN or the application itself, enabling a virtual environment to be easily diagnosed and remedied should any performance issues occur. With such metrics at their disposal and the resultant confidence given to the administrator, the worry of meeting SLAs could quickly become a thing of the past while also rapidly hastening the shift towards tier 1 applications being on virtualized platforms. So despite growing attention being given to other VM monitoring tools such as Xangati or Hyperic, they’re solutions still lack the comprehensive nature of VI.

The advantages to blue-chip, big corporate customers are obvious and as their SAN and virtual environments continue to grow, an investment into a VirtualWisdom solution should soon become compulsory for any end of year budget approval. In saying that though, the future of VI also quite clearly lies beyond the big corporates with benefits which include the enablement of an organization to have real- time proactive monitoring and alerting, consolidation, preemptive analysis of any changes within the SAN or Virtual environment and comprehensive trend analysis of application, host HBA, switches, virtualization appliances, storage ports and LUN performance. Any company therefore looking to either consolidate their costly over-provisioned SAN, accelerate troubleshooting, improve their VMware server utilization & capacity planning, implement a tiering infrastructure or migrate to a cloud would find the CAPEX improvements that come with VirtualWisdom a figure too hard to ignore. So while Storage vendors don’t seem to be in any rush to fill this gap, they too have an opportunity to undercut their competitors by working alongside VI by promoting its benefits as a complement to their latest hardware, something which EMC, HDS, IBM and most recently Dell have cottoned on to having signed an agreement to sell the VI range as part their portfolio. Despite certain pretenders claiming to take its throne, FC is certainly here to stay for the foreseeable future. If the market/customer base is allowed to fully understand and recognize its need, then there’s no preventing a future when just about every SAN fabric comes part and parcel with a VI solution ensuring its optimal use. Whether VI eventually get bought out by one of the large whales or continue to swim the shores independently, there is no denying that companies will need to seriously consider the VI option if they’re to avoid drowning in the apprehensive nature of virtual infrastructure growth or the ever increasing costs of under-utilized SAN fabrics.