Hi all,
New to Cisco and Meraki, started a sysadmin job a bit over a month ago that is heavy on networking - not something I've been exposed to before. Have a peculiar problem that we can't yet figure out - hope you can point me in the right direction.
We have a VMware ESXi server on the network that most of the other computers on the network cannot access where the connection is immediately dropped with "The connection was reset", "ERR_CONNECTION_RESET" to port 443 on the target. (Pinging and SSH-ing into it - works fine. It's just port 443.) Dozens of other ESXi hosts on the network with identical network configuration - have no issues. Just this one.
Question: on a network where all switchgear is Meraki (MS225-48FP, MS250-48FP, MX100-HW, MS220-8-HW, MX64, etc.), how do we figure out where that connection is dropped, i.e. which device and which policy?
Some context:
- Only port 443 (i.e. https://<IP address>) appears to be affected. ICMP, SSH (port 22) - work fine from the same machine that cannot access port 443 on the target.
- Some hosts on the network - can access port 443 on the target, and some (most) - can't.
- Changing the target IP address (in case there's a policy selectively blocking that IP from various parts of the network) didn't help.
- Changing the ports on the switch to which the target is wired (in case the ports are misconfigured or have a blocking policy we somehow can't see) - didn't help.
- Resetting the network configuration on the ESXi (via "reset to factory defaults" and then setting the static IP, gateway, etc. to what they were before) - didn't help, same exact behavior.
- My usual MO in a case like this would be put the host in a known good subnet and test there. I haven't yet tried it as this requires some help from my fellow IT people - yet my hunch is that the box itself is fine and the issue is outside of it - likely in the switch or other network configuration.
- We have dozens of identical ESXi hosts on identical hardware connected to identical switches with identical policies - and they all have no connectivity issues. There is even an identical ESXi host physically right next to the problematic one, on the same subnet, wired to the same switch, with an identical DNS and network configuration (with the exception of the actual IP address, of course) - no issues.
- The problem is quite critical: our VCSA (vCenter appliance) can't connect to the ESXi host over port 443, and thus it's effectively offline.
- Per my IT team, the problem has existed for a while. They had other things to work on, and I was basically handed this issue knowing very little about the network configuration, what can cause such an issue, and how to troubleshoot it. I.e. "here is an issue; go fix it".
Tools available to me:
- Meraki dashboard with read-only access to the entire network configuration including individual ports.
- I can ask my network admins to make changes (but can't make them myself).
- Physical access to the rack where the hardware is located.
Questions:
- Where in Meraki dashboard could I see events, log entries indicating a connection to port 443 on the target was dropped (if one of the Meraki devices or policies is responsible for it)? If this is the right question to ask, would you be willing to help me craft the right search criteria to ID those events?
- What are the usual best practices in troubleshooting this type of an issue?
- What are the usual tools used to pinpoint which device or policy is responsible for dropping a network connection to a known good target? E.g. I am trying to access port 443 on a specific IP from my desktop, there are 4 switches in-between, the connection is not working (dropped). If I connect directly to the target (with a crossover cable or something) - it's fine. How do I figure out where on the network the connection is dropped, i.e. what is responsible for dropping it? (trace route, ICMP, SSH indicate no issues in the network connection)
- If none of the above questions is the right one to ask, what would be the right question(s)? 🙂
Thank you!