Posted by: Dan Israel
Today’s post is one you don’t often find in the blogosphere, see today’s post is a collaborative effort initiated by me, Chad Sakac (EMC), which includes contributions from Andy Banta (VMware), Vaughn Stewart (NetApp), Eric Schott (Dell/EqualLogic), and Adam Carter (HP/Lefthand), David Black (EMC) and various other folks at each of the companies.
Together, our companies make up the large majority of the iSCSI market, all make great iSCSI targets, and we (as individuals and companies) all want our customers to have iSCSI success.
I have to say, I see this one often - customer struggling to get high throughput out of iSCSI targets on ESX. Sometimes they are OK with that, but often I hear this comment: "…My internal SAS controller can drive 4-5x the throughput of an iSCSI LUN…"
Can you get high throughput with iSCSI with GbE on ESX? The answer is YES. But there are some complications, and some configuration steps that are not immediately apparent. You need to understanding some iSCSI fundamentals, some Link Aggregation fundamentals, and know some ESX internals – none of which are immediately obvious…
If you’re interested (and who wouldn’t be interested with a great topic and a bizzaro-world “multi-vendor collaboration”... I can feel the space-time continuum collapsing around me :-), read on...
We could start this conversation by playing a trump card; 10GbE, but we’ll save this topic for another discussion. Today 10GbE is relatively expensive per port and relatively rare, and the vast majority of iSCSI and NFS deployments are on GbE. 10GbE is supported by VMware today (see the VMware HCL here), and all of the vendors here either have, or have announced 10GbE support.
10GbE can support the ideal number of cables from an ESX host – two. This reduction in port count can simplify configurations, reduce the need for link aggregation, provide ample bandwidth, and even unify FC using FCoE on the same fabric for customers with existing FC investments. We all expect to see rapid adoption of 10GbE as prices continue to drop. Chad has blogged on 10GbE and VMware here.
This post is about trying to help people maximize iSCSI on GbE, so we’ll leave 10GbE for a followup.
If you are serious about iSCSI in your production environment, it’s valuable to do a bit of learning, and it’s important to do a little engineering during design. iSCSI is easy to connect and begin using, but like many technologies which excel in terms of their simplicity the default options and parameters may not be robust enough to provide an iSCSI infrastructure which can support your business.
With that in mind, this post is going to start with sections called “Understanding” which will walk through protocol details and ESX Software Initiator internals. You can skip them if you want to jump to configuration options, but a bit of learning goes a long way into understanding the WHY of the HOWs (which I personally always think makes them easier to remember).
Understanding your Ethernet Infrastructure
Do you have a “bet the business” Ethernet infrastructure? Don’t think of iSCSI (or NFS datastores) use here as “it’s just on my LAN”, but “this is the storage infrastructure that is supporting my entire critical VMware infrastructure”. IP storage needs the same sort of design thinking applied to FC infrastructure. Here are some things to think about:
Are you separating you storage and network traffic on different ports? Could you use VLANs for this? Sure. But is that “bet the business” thinking? Do you want a temporarily busy LAN to swamp your storage (and vice-versa) for the sake of a few NICs and switch ports? If you’re using 10GbE, sure – but GbE?
Think about Flow-Control (should be set to receive on switches and transmit on iSCSI targets)
Enable spanning tree protocol with either RSTP or portfast enabled
Filter / restrict bridge protocol data units on storage network ports
If you want to squeeze out the last bit, configure jumbo frames (always end-to-end – otherwise you will get fragmented gobbledygook)
Use Cat6 cables rather than Cat5/5e. Yes, Cat5e can work – but remember – this is “bet the business”, right? Are you sure you don’t want to buy that $10 cable?
You’ll see later that things like cross-stack Etherchannel trunking can be handy in some configurations.
Each Ethernet switch also varies in its internal architecture – for mission-critical, network intensive Ethernet purposes (like VMware datastores on iSCSI or NFS), amount of port buffers, and other internals matter – it’s a good idea to know what you are using.
If performance is important, have you thought about how many workloads (guests) are you running? Both individually and in aggregate are they typically random, or streaming? Random I/O workloads put very little throughput stress on the SAN network. Conversely, sequential, large block I/O workloads place a heavier load.
In the same vein, be careful running single stream I/O tests if your environment is multi-stream / multi-server. These types of tests are so abstract they provide zero data relative to the shared infrastructure that you are building.
In general, don’t view “a single big LUN” as a good test – all arrays have internal threads handling I/Os, and so does the ESX host itself (for VMFS and for NFS datastores). In general, in aggregate, more threads are better than fewer. You increase threading on the host with more operations against that single LUN (or file system), and every vendor’s internals are slightly different, but in general, more internal array objects are better than fewer – as there are more threads.
Not an “Ethernet” thing, but while we’re talking on the subject of performance generally and not skimping, there’s no magic on the brown spinny things – you need enough array spindles to support the IO workload – often not enough drives in total, or an under-configured specific sub/group of drives – every vendor does this differently (aggregates/RAID groups/pools), but all have some sort of “disk grouping” out of which LUNs (and file systems in some cases) get their collective IOPs.
Understanding: iSCSI Fundamentals
We need to begin with a prerequisite nomenclature to establish a start point. If you really want the “secret decoder ring” then start here: http://tools.ietf.org/html/rfc3720
This diagram is chicken scratch, but it gets the point across. The red numbers are explained below.