IMPORTANT: Please delete your old AI Factors

Dear All,

AI Factor is working again, but we need your help

We had to ditch our distributed network storage (for now) to get AI Factor functional again. We tried many things, but we still have issues using distributed CephFS network storage from inside a VM (we think that's the root cause). And this is with a 100% healthy CEPH, updated to a recent stable build. It will take longer to figure out why, so today we switched to a more traditional file server, backed by RAID, and everything is working now.

We are running out of space

We have copied all your data and all looks good. But we are running out of space in the new file server, with limited ways to increase it. If we run out of space, using existing model should still work, buy you will no longer be able to create new datasets, or train new models.

Please delete unused AI Factors

We need you to go through your AI Factors, sort them by resource units, and delete what you can. Tomorrow we will see where we are, but if space is still a problem, we will have to increase the cost of storage for AI Factors, which will deplete your resource units.

Sorry for the good/not-so-good news, and thank you for your patience.

Marco

4 Likes

Currently we are 96% full in the new interim file server. Please do not load a new dataset or train a new predictor without first deleting some old stuff.

Thanks

6 Likes

Thank you Marco for this news. I deleted all unused AI Factors.

1 Like

I deleted 4 AI Factors and I will only need to create one. Thank you Marco!!

1 Like

Thanks for the update Marco. I deleted what I can.

However, one of the validations I attempted to start last night is currently hanging.

I tried to cancel it, but the status is stuck on "Cancel Requested" and won't clear.

How can I delete AI Factors that I created during the AI Factor Beta testing ?
I do not not need them anymore but I can not delete them because I do not have access to AI Factor.

Hi @pitmaster

Enabling AI factor does not cost anything. It only costs extra if you train or have the prediction add-on. I enabled it for you and you can now view them and delete them. Thanks

2 Likes

Thanks Marco, I deleted all 62 AI Factors.

Thanks for the quick fix. Working like a charm again. I just deleted 19 unused AIFactors to help with the temporary storage problem.

Is there a way to see how much storage each users are using? Can you set limits per uers? - so all users have the same “fair” storage usage i.e. one user using 50% of all storage, etc.

I will also take the time to delete unused AI Factors.

We bought ourselves some time now. We are also readying another "traditional" bare-metal file server, with 50% more capacity than the one we spun up yesterday, so space should not be an issue for a while

We do want to go back to CephFS which is easily expandable, more robust, self-healing, etc. When used from bare-metal machines it works like a charm. But when used from inside a VM some expert level tuning and testing is required. We just assumed it worked.

This past week was a crazy combination of problems all happening at the same time. Here is the sequence:

  • We plugged in an innocent looking hub to the network
  • A shit storm of packets flooded the network
  • Two servers crashed and their bootloader corrupted. They never came back.
  • The main Ceph pool froze to protect itself (which it amazingly did). This killed the website for 24h.
  • Then, out of nowhere, the AI Factor filesystem started randomly hanging (we think this problem always existed, just not as bad)

This last one was very frustrating because things "kind of worked". When things "kind of work" it's hard to pinpoint the problem. We poked everything: network layer, OS versions, VM versions, Ceph versions, Ceph setup, mounting technology, other things I forget now.

The downtime could have been shorter if we knew exactly what to do from the start, but such is tech, always changing and capricious. But through it all we did not lose data, so we have a renewed confidence in the choices we made.

Cheers.

6 Likes

Investigating, thanks

I deleted over 2000 units worth trying to delete more stuck on cancel requested.