Problem this morning

marco · December 17, 2025, 1:31pm

Hello,

We had a power issue in one of our racks this morning that caused several servers to reboot. We're still investigating this with the DC.

Unfortunately we were in the middle of tuning our cloud infrastructure from last week problem, so the reboot process did not go smooth.

Very sorry about this. Let us know if any services you use are not back up.

Superlopez · December 17, 2025, 1:35pm

Cannot rebalance live portfolios: “Date 12/16/2025 not loaded on server RANK300D:23002”

fmarek · December 17, 2025, 1:38pm

Same error for me

robertaxx99 · December 17, 2025, 1:38pm

same. “Cannot open connection to RANK300A:23002”

philjoe · December 17, 2025, 1:41pm

ClientException: API authentication failed: <html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

portfoliologic · December 17, 2025, 1:48pm

same.

valmarv · December 17, 2025, 1:51pm

API is back up, rank servers are currently loading, should be back up in around 30-45 minutes.

philjoe · December 17, 2025, 1:55pm

still getting same API error…

valmarv · December 17, 2025, 2:02pm

That’s due to the rank servers still loading.

travhopp2 · December 17, 2025, 2:06pm

I can’t run any screens. Getting “ERROR: No server available for request. Please try again later.”

sraby · December 17, 2025, 2:28pm

Hi Marco,

Sorry to hear about the outage—dealing with a DC power failure while mid-migration is a difficult spot to be in.

Based on my experience building production data centers for critical financial institutions, I highly recommend auditing the rack's power distribution. Ideally, every server should have dual power supplies (PSUs) connected to independent "A" and "B" PDU feeds.

For any legacy or specialized hardware with only a single power supply, you can bridge the gap by installing a Rack-Mount ATS (Automatic Transfer Switch). The ATS pulls from both PDUs and ensures the server never sees an interruption, even if one side of the rack's power fails completely.

If you need a quick second set of eyes on the redundancy architecture once the DC provides their RCA (Root Cause Analysis), feel free to reach out.

Best regards,

Stéphane Raby CMC, MBA, P.Eng., CISSP Management Consultant, Senior Cyber Advisor, Compliance and Risk (613) 762-6343 | me@stephaneraby.ca

philjoe · December 17, 2025, 2:29pm

and a UPS to so it can shut down properly worst-case?

robertaxx99 · December 17, 2025, 2:35pm

How common are these technical issues? I have been a member for 2 weeks and this is already the second outage I have experienced.

philjoe · December 17, 2025, 2:36pm

been here awhile, quite rare.

marco · December 17, 2025, 2:52pm

@sraby thanks for you feedback. All our servers have dual PSUs. The picture of what happened this morning is not clear yet.

There were two things for sure:

The DC had a BGP connections problem twice. One for 15 min , then another for a few minutes. This should have only caused a momentary loss of access to the website, but not affect our servers
We had several servers reboot for a yet unknow reason. We thought it was a power loss to one of the racks, now we are not so sure. One critical server did not come back, which caused a backlog in Ceph, which then froze up again.

Could they be related? Unknown right now. Investigation is ongoing. I guess getting punched twice in the face is shaking out the problems we are having to reboot.

marco · December 17, 2025, 3:23pm

With racks with dual PDUs (A+B) we rely on the DC for UPS. We are not sure yet if the reboot is power related. We're still gathering the facts.

sraby · December 17, 2025, 6:52pm

Hi Marco,

The recent 15-minute outage is concerning because BGP is specifically designed to be resilient; it should automatically reroute traffic around failures in seconds, not minutes. To understand why this failed, we should ask the DC provider for a post-mortem and review our contract on these points:

Failover Logic: Since BGP is built for automatic path redundancy, why was there a 15-minute "black hole" instead of a near-instant switch to a secondary provider?
Infrastructure Diversity: Does the DC have truly "multi-homed" ISP connections and diverse hardware, or is there a single point of failure (like one edge router) undermining BGP’s native resilience?
SLA Compliance: Does our contract's definition of "resiliency" include a guaranteed Time to Repair (TTR), and does this event trigger an SLA credit for the service gap?

marco · December 17, 2025, 7:32pm

Well, it turned out that it was all 100% our fault. DC was not at fault.

We were provisioning a new Ceph pool for the AI workload so it has room to grow, but we miscalculated just how much space we had. Ceph got to 95% full and all sorts of bad things started happening. The PVE HA watchdog forced several nodes to restart. This behavior is called "fencing" and is intentional for data integrity. But then one node didn't restart, etc, etc.

The BGP issue is a red herring somehow caused by the mayhem that was happening.

Very sorry about this. We are trying to fix things from last week and make everything more resilient.