Hello there dear community,
I see an odd behaviour with the card in the title and I would like to ask if anyone had the same experience or an idea?
Description of the setup:
The card is a Mellanox ConnectX-3 Pro 40G card, jumbo frames enabled. The server is an Intel R2208WFTZS.
The card's firmware is on the latest stable version from mellanox.
The vsan's vmk interface is connected to one of the ports of the 40G card, jumbo frames enabled.
I'm using a non customised ESXI 6.7 image with the nmlx4_en native driver, all the latest patches installed.
Problem:
The setup works well, however occasionally totally randomly the vmk interface can't reach the other vsan nodes in the same cluster, can't ping anyone.
When I was doing packet capture on the vmk interface, I could see the arp requests arriving from the other nodes to the vmk interface, the interface responds to them, however the responses don't make it out of the server. Seemed similar to a unidirectional link.
If I reload the vmnic interface where the vsan traffic meant to go through, it all returns to normal, the traffic from vsan vmk starts to flow, arp responses to through.
Obviously the logs don't say much other than this interesting part:
2018-08-13T20:26:23.187Z cpu25:2097436)<NMLX_INF> nmlx4_en: vmnic2: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2294) MAC RX filter (class 1) at index 0 is removed from
2018-08-13T20:26:23.187Z cpu25:2097436)<NMLX_INF> nmlx4_en: vmnic2: nmlx4_en_QueueRemoveFilter - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2301) RX ring 1, QP[0x49], Mac address 00:50:56:6c:23:xx
2018-08-13T20:26:23.187Z cpu25:2097436)<NMLX_INF> nmlx4_en: vmnic2: nmlx4_en_RxQFree - (vmkdrivers/native/BSD/Network/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 1 is freed
2018-08-13T20:26:39.594Z cpu43:2098358)VLANMTUCheck: NMVCCheckUplinkResultDetail:880: Vlanmtuchk EVENT: DV port 408 uplink vmnic2MTU check RESULT_CHANGED
2018-08-13T20:26:39.594Z cpu43:2098358)VLANMTUCheck: NMVCCheckUplinkResultDetail:898: Vlanmtuchk EVENT: DV port 408 uplink vmnic2VLAN check RESULT_CHANGED
Seems like the vsan queue is being destroyed? Not sure why is that happening. And you see after that the dvs alarms come on.
Other than this line, no signs of driver and firmware related logs or crashes.
It happened on 2 servers already, identical ones so I'm afraid it's not just specific to one server. Don't really know where to start or how to dig deeper.
Any idea or suggestion?
Thank you very much!
Zoltan