iSCSI storage arrays
It is well documented by now that there is an issue with ESXi 5.0 GA code that causes excessively long boot times when using hosts that are connected due to iSCSI storage arrays, documented in VMware KB 2007108.
What isn’t well documented is another issue with the ESXi 5.0 GA code and software iSCSI, which has much more serious symptoms than long boot times – like losing access to your storage with the right configuration.
If you are using ESXi 5.0 with software iSCSI that is connected to an array with iSCSI targets on separate network fabrics (specifically separate subnets), then you must use 2 different vSwitches for the vmkernel port groups to avoid losing access to the storage.
An example configuration with screenshots is below. It shows a single ESXi 5.0 GA host with 2 physical NICs dedicated to iSCSI storage traffic, vmnic3 and vmnic7. These NICs are connected to separate physical switches for redundancy and multipathing. Each array storage processor is also connected to each iSCSI switch, which creates a fully redundant fabric. vmnic3 and vmnic7 are each bound to their own dedicated vmkernel port group, iSCSI1 and iSCSI2. Both vmkernel port groups belong to the same vSwitch (vSwitch1), and we ensure that iSCSI1 is exclusively used by vmnic3 and iSCSI2 is exclusively used by vnmic7 by overriding the vSwitch failover settings on the properties of the vmkernel port group. Diagrams below:
To the left is a quick visio I drew to help illustrate how the system is physically configured. Note the drawing only shows the relevant storage network infrastructure and iSCSI NICs, this host has other NICs that are used for other things. The snapshot above shows the vSwitch and vmkernel configuration.
Below are screenshots of the vmkernel port group properties which shows that we have pinned each port group to the correct corresponding physical NIC.
Both vmkernel port groups are bound to the same iSCSI software initiator, in this case vmhba38, and all 4 array front end ports are configured as iSCSI targets.
Typically this is a sound configuration, one that I have configured in many 4.x environments, but this was the first vSphere 5 environment. At first everything was fine, I was able to connect to 2 different volumes I had configured on the storage and formatted them VMFS, and started creating virtual machines.
After a few hours the host lost access to the storage. The only way I found to get it back was to either restart the host, or go into the vmkernel port group properties and remove/reset the “override switch failover order” settings.
The only place on the internet that I found anything that comes close to mentioning this is on the vSphere blog, here, which mentions issues with using different subnets for iSCSI but does not specifically say there could be disconnect issues after startup.
As it turns out there is an issue with the GA build (build 469512) of ESXi 5 when using this configuration that causes the disconnect (confirmed by VMware support).
The FIX is to apply the same patch that fixes the long iSCSI bootup times, express patch 1, found here. The build # for the express patch is built 515841.
The workaround (if you have not deployed the patch) is to break the vmkernel port groups out into 2 separate vSwitches.