Problem this morning

Hello,

We had a power issue in one of our racks this morning that caused several servers to reboot. We're still investigating this with the DC.

Unfortunately we were in the middle of tuning our cloud infrastructure from last week problem, so the reboot process did not go smooth.

Very sorry about this. Let us know if any services you use are not back up.

1 Like

Cannot rebalance live portfolios: “Date 12/16/2025 not loaded on server RANK300D:23002”

Same error for me

same. “Cannot open connection to RANK300A:23002”

ClientException: API authentication failed: <html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

same.

API is back up, rank servers are currently loading, should be back up in around 30-45 minutes.

1 Like

still getting same API error…

That’s due to the rank servers still loading.

I can’t run any screens. Getting “ERROR: No server available for request. Please try again later.”

Hi Marco,

Sorry to hear about the outage—dealing with a DC power failure while mid-migration is a difficult spot to be in.

Based on my experience building production data centers for critical financial institutions, I highly recommend auditing the rack's power distribution. Ideally, every server should have dual power supplies (PSUs) connected to independent "A" and "B" PDU feeds.

For any legacy or specialized hardware with only a single power supply, you can bridge the gap by installing a Rack-Mount ATS (Automatic Transfer Switch). The ATS pulls from both PDUs and ensures the server never sees an interruption, even if one side of the rack's power fails completely.

If you need a quick second set of eyes on the redundancy architecture once the DC provides their RCA (Root Cause Analysis), feel free to reach out.

Best regards,

Stéphane Raby CMC, MBA, P.Eng., CISSP Management Consultant, Senior Cyber Advisor, Compliance and Risk (613) 762-6343 | me@stephaneraby.ca

1 Like

and a UPS to so it can shut down properly worst-case?

How common are these technical issues? I have been a member for 2 weeks and this is already the second outage I have experienced.

been here awhile, quite rare.

1 Like

@sraby thanks for you feedback. All our servers have dual PSUs. The picture of what happened this morning is not clear yet.

There were two things for sure:

  • The DC had a BGP connections problem twice. One for 15 min , then another for a few minutes. This should have only caused a momentary loss of access to the website, but not affect our servers
  • We had several servers reboot for a yet unknow reason. We thought it was a power loss to one of the racks, now we are not so sure. One critical server did not come back, which caused a backlog in Ceph, which then froze up again.

Could they be related? Unknown right now. Investigation is ongoing. I guess getting punched twice in the face is shaking out the problems we are having to reboot.

1 Like

With racks with dual PDUs (A+B) we rely on the DC for UPS. We are not sure yet if the reboot is power related. We're still gathering the facts.

Hi Marco,

The recent 15-minute outage is concerning because BGP is specifically designed to be resilient; it should automatically reroute traffic around failures in seconds, not minutes. To understand why this failed, we should ask the DC provider for a post-mortem and review our contract on these points:

  • Failover Logic: Since BGP is built for automatic path redundancy, why was there a 15-minute "black hole" instead of a near-instant switch to a secondary provider?

  • Infrastructure Diversity: Does the DC have truly "multi-homed" ISP connections and diverse hardware, or is there a single point of failure (like one edge router) undermining BGP’s native resilience?

  • SLA Compliance: Does our contract's definition of "resiliency" include a guaranteed Time to Repair (TTR), and does this event trigger an SLA credit for the service gap?

1 Like

Well, it turned out that it was all 100% our fault. DC was not at fault.

We were provisioning a new Ceph pool for the AI workload so it has room to grow, but we miscalculated just how much space we had. Ceph got to 95% full and all sorts of bad things started happening. The PVE HA watchdog forced several nodes to restart. This behavior is called "fencing" and is intentional for data integrity. But then one node didn't restart, etc, etc.

The BGP issue is a red herring somehow caused by the mayhem that was happening.

Very sorry about this. We are trying to fix things from last week and make everything more resilient.

6 Likes