ISCSI and VMware performance issues. Any ideas?

squidgy
Here to help

ISCSI and VMware performance issues. Any ideas?

So we replaced our catalyst switch with a ms210 - we are replicated the previous switch settings to reduce any likelyhood of issues - such as ip addresses and vlan ids. except for the MTU size, it was MTU 9000 now it's the Meraki default of 

MTU 9578

 

Anyway  we're getting the below warnings logged in vmkwarnings on our esxi hosts 

21-09-20T13:18:23.456Z cpu11:2101770)WARNING: E1000: 4325: End of the next rx packet
2021-09-20T17:38:19.540Z cpu27:2151116)WARNING: SVM: 5761: scsi0:0 VMX took 1094 msecs to send copy bitmap for offset 12884901888. This is greater than expected latency. If this is a vvol disk, check with array latency.
2021-09-20T18:08:29.233Z cpu8:2098079)WARNING: ScsiDeviceIO: 1564: Device eui.85c313515621eb466c9ce900f866ee86 performance has deteriorated. I/O latency increased from average value of 1222 microseconds to 24462 microseconds.
2021-09-20T18:08:32.069Z cpu9:2098079)WARNING: ScsiDeviceIO: 1564: Device eui.85c313515621eb466c9ce900f866ee86 performance has deteriorated. I/O latency increased from average value of 1222 microseconds to 72442 microseconds.
2021-09-20T18:08:35.337Z cpu3:2098079)WARNING: ScsiDeviceIO: 1564: Device eui.85c313515621eb466c9ce900f866ee86 performance has deteriorated. I/O latency increased from average value of 1275 microseconds to 150626 microseconds.
2021-09-20T18:08:44.014Z cpu2:2098079)WARNING: ScsiDeviceIO: 1564: Device eui.85c313515621eb466c9ce900f866ee86 performance has deteriorated. I/O latency increased from average value of 1324 microseconds to 302286 microseconds.
2021-09-20T18:12:40.401Z cpu15:2098077)WARNING: ScsiDeviceIO: 1564: Device eui.85c313515621eb466c9ce900f866ee86 performance has deteriorated. I/O latency increased from average value of 2811 microseconds to 125025 microseconds.
2021-09-20T20:19:55.415Z cpu27:2154604)WARNING: SVM: 5761: scsi0:0 VMX took 1552 msecs to send copy bitmap for offset 2147483648. This is greater than expected latency. If this is a vvol disk, check with array latency.
2021-09-20T20:19:59.314Z cpu27:2154604)WARNING: SVM: 5761: scsi0:0 VMX took 1440 msecs to send copy bitmap for offset 17179869184. This is greater than expected latency. If this is a vvol disk, check with array latency.

 

does anyone have any suggestions on how we can address this? I've logged it with vmware support but i can't help but think it's a problem with the switch, it was flawless prior to the change. 

 

ultimately there are 3 issues here, 

 

1. VMX offset

2.Deterated I/O Latency 

3. E1000: 4325: End of the next rx packet

 

It's impacting postgresql performance and our erp is running slower as a result. Please ask any questions if you need more info. 

 

thanks

8 Replies 8
PhilipDAth
Kind of a big deal
Kind of a big deal

If you watch the port throughput in tthe Meraki Dashboard - is it flat-lining at all - as in, maxing the port out?

PhilipDAth
Kind of a big deal
Kind of a big deal

I think this is a very low probability, but you could also have a play with QoS and mark the iSCSI traffic as a higher priority.

 

https://documentation.meraki.com/MS/Other_Topics/MS_Switch_Quality_of_Service_Defined 

PhilipDAth
Kind of a big deal
Kind of a big deal

I assume you have checked all the speed/duplexes and made sure they have all detected 1000/full?

 

I';d make sure everything is set to auto/auto.  You don't want to mix locked speed/duplex and auto/auto especially.

Brash
Kind of a big deal
Kind of a big deal

That's some decent latency.

Nothing right off the bat but to confirm a few things:
 - Is the iSCSI data running over L2 or L3?

 - In regards to the MTU, did you make MTU changes on the host/storage or are you just noting that the Meraki MTU is higher than the previous switch had configured?

 

As @PhilipDAth mentioned, definitely for any check layer 1 issues as well (speed/duplex, drops, CRC's etc).

squidgy
Here to help

Morning Everyone. 

 

We did a couple of items last night to try and resolve. I'll explain those then give you some answers to your questions, thanks for taking an interest and your assistance!

 

So, we changed the MTU size on the Meraki to 9000 - this was the lesser of two evils. We also realised there is a very specific way the SAN should be connected to the stacked switches, one of 2 of the 8 cables needed a swap. 

 

It didn't sort it though. Continuing reviewing everything, we saw the gateway ip for the vnic adapters was incorrect (it's a valid gateway ip for our network, but it's on a different vlan) would this make much difference you think? 

 

Anyway - questions. 

 

@Brash Brash - ISCSI data is running over L2. 

Yes - I was just noting it, but wince last night the MTU was changed across the network to 9000. 

 

@PhilipDAth No, not in the slightest. The peaks are overnight, when the backups are running and we're not seeing more than 450 Mb/s and office hours peak is about 90Mb/s.

 

we'll take a look at QoS thanks for the suggestion - I concur, it's worth checking but i doubt it's that cause - you never know sometimes!

 

duplexs are all auto/auto and 1000/full. 

 

Interestingly, when checking the port, it shows you the device connected to it, which is one of the nics on the SAN, but doesn't resolve the ip address right away,. it has a 169.254 address, then resolves the ip.

 

latency warnings after making the changes...

 

2021-09-21T23:09:51.551Z cpu3:2218975)WARNING: SVM: 5761: scsi0:1 VMX took 1145 msecs to send copy bitmap for offset 154618822656. This is greater than expected latency. If this is a vvol disk, check with array latency.
2021-09-21T23:13:51.162Z cpu14:2218975)WARNING: SVM: 5761: scsi0:1 VMX took 1780 msecs to send copy bitmap for offset 204010946560. This is greater than expected latency. If this is a vvol disk, check with array latency.
2021-09-21T23:16:20.894Z cpu1:2218975)WARNING: SVM: 5761: scsi0:1 VMX took 1449 msecs to send copy bitmap for offset 242665652224. This is greater than expected latency. If this is a vvol disk, check with array latency.
2021-09-22T00:05:29.144Z cpu16:2097955)WARNING: vmnic5:TxI:461/461
2021-09-22T00:45:17.018Z cpu2:2097206)WARNING: ScsiDeviceIO: 1564: Device eui.560509d099ed7da86c9ce900f866ee86 performance has deteriorated. I/O latency increased from average value of 9011 microseconds to 180891 microseconds.
2021-09-22T02:04:49.023Z cpu16:2098107)WARNING: ScsiDeviceIO: 1564: Device eui.85c313515621eb466c9ce900f866ee86 performance has deteriorated. I/O latency increased from average value of 2423 microseconds to 59700 microseconds.
2021-09-22T05:52:54.503Z cpu17:2097959)WARNING: vmnic7:TxI:219/219
2021-09-22T06:04:24.672Z cpu17:2097959)WARNING: vmnic7:TxI:445/440
2021-09-22T06:14:44.691Z cpu19:2097955)WARNING: vmnic5:TxI:33/33

 

 

squidgy
Here to help

Bit of an update.

 

It looks like the MTU was part of the problem - ESX has a max supported MTU of 9000, and the E1000 warning has gone, which was expected. The performance is starting to get back to normal on the postgres db. I still want to resolve the latency warnings obviously.

Brash
Kind of a big deal
Kind of a big deal

Good to hear you were able to make some progress.

 

Dropping the MTU on the Meraki switch shouldn't have made a difference. It just needs to be the same as or higher than the MTU at the source and destination endpoints.

When working with MTU, also make sure to check whether the value to be input includes the Ethernet header or not. From memory ESXi takes the payload size (9000) but some products will expect payload+header (9216).

 

One other test you can do is path isolation. Do the eui's in the latency alerts indicate a specific destination or path? I don't remember off the top of my head if they're path specific or device specific identifiers.

If you have the ability to do so, you can isolate down to a single path and then work your way up re-enabling additional links/paths until you hit issues.

squidgy
Here to help

@Brash

 

I like the idea and understand what you're saying but i'm going to need to research it and see if I can work it out.

 

 

Get notified when there are additional replies to this discussion.