Sorry to hear about the outage—dealing with a DC power failure while mid-migration is a difficult spot to be in.
Based on my experience building production data centers for critical financial institutions, I highly recommend auditing the rack's power distribution. Ideally, every server should have dual power supplies (PSUs) connected to independent "A" and "B" PDU feeds.
For any legacy or specialized hardware with only a single power supply, you can bridge the gap by installing a Rack-Mount ATS (Automatic Transfer Switch). The ATS pulls from both PDUs and ensures the server never sees an interruption, even if one side of the rack's power fails completely.
If you need a quick second set of eyes on the redundancy architecture once the DC provides their RCA (Root Cause Analysis), feel free to reach out.
@sraby thanks for you feedback. All our servers have dual PSUs. The picture of what happened this morning is not clear yet.
There were two things for sure:
The DC had a BGP connections problem twice. One for 15 min , then another for a few minutes. This should have only caused a momentary loss of access to the website, but not affect our servers
We had several servers reboot for a yet unknow reason. We thought it was a power loss to one of the racks, now we are not so sure. One critical server did not come back, which caused a backlog in Ceph, which then froze up again.
Could they be related? Unknown right now. Investigation is ongoing. I guess getting punched twice in the face is shaking out the problems we are having to reboot.
The recent 15-minute outage is concerning because BGP is specifically designed to be resilient; it should automatically reroute traffic around failures in seconds, not minutes. To understand why this failed, we should ask the DC provider for a post-mortem and review our contract on these points:
Failover Logic: Since BGP is built for automatic path redundancy, why was there a 15-minute "black hole" instead of a near-instant switch to a secondary provider?
Infrastructure Diversity: Does the DC have truly "multi-homed" ISP connections and diverse hardware, or is there a single point of failure (like one edge router) undermining BGP’s native resilience?
SLA Compliance: Does our contract's definition of "resiliency" include a guaranteed Time to Repair (TTR), and does this event trigger an SLA credit for the service gap?
Well, it turned out that it was all 100% our fault. DC was not at fault.
We were provisioning a new Ceph pool for the AI workload so it has room to grow, but we miscalculated just how much space we had. Ceph got to 95% full and all sorts of bad things started happening. The PVE HA watchdog forced several nodes to restart. This behavior is called "fencing" and is intentional for data integrity. But then one node didn't restart, etc, etc.
The BGP issue is a red herring somehow caused by the mayhem that was happening.
Very sorry about this. We are trying to fix things from last week and make everything more resilient.