VMWare iSCSI Issues

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
Helllo,

i'm facing some iSCSI issues that I just cant seem to figure out and really need to better understand. I have a FreeNAS server that is hosting some iSCSI storage for my VMWare lab, I went ahead and shut down all of the VMs running on the storage to run some updates (11.2U4 - U5) which seemed to have run without issue After the FreeNAS server came back online I noticed right away that I couldn't get the datastores to come back online.

I went ahead and rebooted one of my hosts and after doing so it now again sees the two LUNs but they both show as not consumed and I cant get the host to recognize the existing datastores. Am I somehow missing some config that could have broke the datastore or corrupted the LUNs when I rebooted the NAS?

Rebooted host showing available unused LUNs:
Empty Target.PNG


Non-rebooted hosts still showing LUNs offline:
Target Missing.PNG
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I'm reminded of this other thread where a user is reporting issues with iSCSI and missing extents after an upgrade to 11.2-U5

https://www.ixsystems.com/community...rading-to-freenas-11-2-u5-from-11-1-u7.78522/

Can you try rolling back/booting your previous 11.2-U4 environment and see if the problem resolves?

If it doesn't, and you rebooted your FreeNAS unit without unmounting the datastores, the VMware hosts may have flagged it for PDL (Permanent Device Loss) or APD (All Paths Down) and put it into a bad state.

Check the following KB here:
https://kb.vmware.com/s/article/2004684

As another note, have you forced sync=always on the ZVOLs? If not, the unexpected loss of the devices might have resulted in uncommitted/lost metadata from VMware's perspective. If all VMs were shut down they shouldn't be damaged, but you may need to unregister them, unmount and reimport the datastores, and then register the .vmx files again.
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
@HoneyBadger My build does not have the sync=always flag set, but that's because i'm using a dedicated log drive instead. Now moving forward I was inspecting the vmkernel log files and the last entry says:
2019-08-28T04:07:22.158Z cpu0:2097490)StorageApdHandlerEv: 117: Device or filesystem with identifier [naa.6589cfc00000082cb1794194b7f74361] has exited the All Paths Down state.

so on that note it doesnt appear that it put the drive in a PDL state.
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
I was trying to perform some file system repair operations as noted in This VMWare article, and received Error: Connection timed out during read on /dev/disks/naa.6589cfc0000001a8daa5f8580f01aebc and another for Error: Connection timed out during write... on the same volume trying to update the partition header.

You mentioned I should roll back to U4 but I am unfamiliar with the process, is there a published guide to revert patches somewhere that I can reference?
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
Also noticing that the console is being flooded with a generic message about ping availability however I am able to ping the addresses in question with 0 issue from the FreeNAS shell.

Error.PNG
result.PNG
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
Finally got it fixed, the timeout error was a typo on MTU size for a single interface in one of the paths - then I was able to repair the file systems. Still need to figure out what actually happened and how to prevent this in the future/before rolling back to U5, but at least i'm back up!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Glad to hear it got sorted (although I'd have expected the MTU size issue to have cropped up before) and that things are working again, but just to sidebar for a second:

@HoneyBadger My build does not have the sync=always flag set, but that's because i'm using a dedicated log drive instead.

The sync=always property is still required in order to make ZFS actually use that separate log (SLOG) device - having it attached to the pool makes it there as an option, but doesn't enforce its usage. At this point you aren't actually using sync writes, which can be verified via zilstat which will show all 0's.

Before you make that switch, what is the exact vendor, model, and size of the log device? SLOGs have certain requirements in order to provide adequate performance, and I'd rather you ensure you have the right hardware before forcing sync writes and causing your performance to absolutely crater.

The alternative, of course, is that you don't run sync writes, set up periodic ZFS snapshots, and accept the potential of having to roll back and lose some data.
 

HANDLEric

Dabbler
Joined
May 6, 2019
Messages
47
@HoneyBadger I went to check and it looks like the pool itself has sync always set in the UI which is pushing down to the ZVOLs that have inheritance enabled. The SLOG device i'm using is an Intel DC P3700.

@jgreco Thank for for the info, the no ping reply message was found to be a single port that had a bad MTU setting which was catching some of the traffic and causing that issue. After correcting that port it was resolved.
 
Top