Tuesday 22nd September 2020

Virtualization Unexpected Reboot of host "de-fra2-host3.level66.network" / IPv6 ND problems / Packet loss on the Proxmox Hosts

The virtualization host "de-fra2-host3.level66.network" suffered a unexpected reboot. The system looks stable since ~1h, we've restarted all virtual machines and are monitoring the situation.

Update (16:00 CEST): We're still experiencing problems with some BGP Sessions at our "de-fra2-edge1.as209844.net" router. We're still unsure regarding the exact problem but currently only IPv6 sessions are affected.

Update (21:00 CEST): As we're still having problems with the router "de-fra2-edge1.as209844.net", we'll remove the system from our routing and update the system to latest patch-level. Once that is done we'll troubleshoot the error more deeply.

Update (22:15 CEST): We have to reboot the whole host once again. The VM will be unavailable for around 5-10 minutes.

Update (23:00 CEST): The problem is solved. All VMs and BGP sessions are back up and running. In the end the problem was a faulty bridge on the Proxmox VE host. We'll do some checks on the host during the next days to find the reason for the unexpected reboot.

Update (23.09.2020, 07:00 CEST): The error occurred again during the night. We'll keep the IPv6 BGP Sessions on the router disabled until we've finally found the problem. We'll investigate further to find the reason why the router keeps loosing it's IPv6 neighbors.

Update (25.09.2020, 10:00 CEST): We tried to move all VMs to another host without success, the error keeps returning. As a last idea we'll try to reboot the switches in the datacenter this evening.

Update (26.09.2020, 13:45 CEST): Sadly the reboot of the switch did not change the behavior as the IPv6 neighbor entries kept flapping during the night. We've now tuned the Proxmox configuration regarding IPv6 Multicast and continue to monitor the problem (https://forum.proxmox.com/threads/ipv6-connection-lost.62932/#post-288135). During the last 15 minutes we have not seen any flapping IPv6 ND entries and we hope this finally solved the problem.

Update (30.09.2020, 12:00 CEST): Sadly the issue seems to go on. We're currently preparing to downgrade all effected hosts to the old Proxmox version as it looks like the issue is caused by the latest Proxmox update. We're working closely together with the Proxmox team to find the source of that issue.

Update (01.10.2020, 12:00 CEST): We're currently reinstalling the affected nodes and move the VMs to the downgraded nodes.

Update (02.10.2020, 16:00 CEST): We've finished moving all the machines and the performance looks way better now. We'll continue to monitor the situation and update this post once there are any changes.

Update (02.10.2020, 20:00 CEST): The problem seems to be the kernel version 5.4.65 used on our Proxmox hosts in combination with the 5.4.8 kernel version on our software routers. Since we've downgraded both systems to the older kernel-versions, we do not see any loss on of packets on the systems.