I think I am getting closer, but I am going to give support a call to verify. I think it is a combination of things and after many tests, reboots, and port enable/disables, I think I have narrowed it down to something with the Access Policies.
I started seeing this after I tried to make changes to a switch I installed a week prior. I'd make a change (enable/disable an unused port for example) and that would cause pings to the switch at the exact time of the configuration change to jump up from 30-40ms to 1000+ms. then the switch would stop responding to pings and potentially soft reboot and cause all types of issues. This has happened twice and after about two hours of support, both times they concluded it was a power supply issue, an issued an RMA.
I wasn't seeing this in my test lab, but I was only using a single PoE phone and Thin Client for testing. I went back to my test lab and I made a change. While I didn't see packet loss, I did see my ping's latency jump to 1000+ms for at least 2 pings. So I ramped up my device and now have 10 PoE phones connected. I disabled one port and I saw the 1000ms pings, but no loss. I then disabled 28 ports and BOOM! the issue we recreated! I saw an uptick in my latency to 1000+ms, then the pings were dropped for a least 15 pings, and then it came back up. What is interesting is that the phones say, network disconnected during the loss, but power remains. It isn't until the pings return that the devices actually turn off and the ports are disabled.
Things I have tried to resolve / pin down the issue:
1. I was using SFP ports with SFP copper modules and using automated STP that Meraki has to connect the MS to the MX. Port's 3 & 4 on the MX are connected to ports 51 & 52 on the switch.
2. I configured two native ethernet ports (45 & 46) on the MS and moved the uplink ports to these and received the same results.
3. I went down to just Port 3 on the MX to port 46 on the MS to take STP out of the mix. Still seeing the same issues.
4. I connected my laptop to the MX directly to remove potential VPN issues, and I still am seeing the same issues.
5. I rebooted the MS, MX, and the ISP for good measure multiple times and I still see the same results when disabling ports.
6. I updated the MS from 12.28, to 12.33, and even 14.16 and I see the issue on any firmware version. I will note that 12.33 & 14.16 handle the pings returning to normal a lot better than 12.28.
7. I changed the Access Policy (set as Multi-Domain and myRadius configuration with Cisco ISE) to an open policy and that seemed to remove the ping loss! Getting closer. I tested this a few times and I still saw the 1000ms pings, but did not see the loss.
8. I re-enabled my Access Policy and performed the test and it had 1000+ms pings and dropped pings. Recreation of the issue successful!
This is where I am at so far. I'll keep everyone up-to-date based off what support says, but let me know if you have any other ideas.
If anyone else has a test lab and wants to set this up to try, here is my setup:
MG with Verizon 4G for my ISP (Same results on any ISP, but this is just for testing convenience).
MX65 as the Security Appliance Router
MS120-48FP
ISE radius server back at HUB network for 802.1x EAP-TLS authentication.
10 PoE phones (Mitel 6920s)
2 thin clients or PCs daisy-chained to a couple phones.
CMNO, CCNA R+S