Unfortunately we are still experiencing problems with AI Factor filesystem (CephFS). We seem to have identified the root cause: the Ceph MDS server hangs when we simply ask for recursive list of directories in the AI factor filesystem (we have one directory per AI Factor in the filesystem).
This operation is very low-level and has nothing to do with our software. It all points to some a problem with the version of the Ceph MDS server that we have. We are trying to upgrade it to a newer version.
We're also not sure why this problem surfaced after the recent problem, but likely it was always there. Several times we experienced timeouts and in either rendering lift charts, or visualizing the logs of a training model, and did not investigate further.
Thanks for your patience
PS. Weekend update ran and rebalances were run, but models with AI factors likely failed.
Message for new Portfolio123 users:
The issues the service has faced recently are an exceptional, extremely rare case. Under normal circumstances, the platform maintains an exceptionally high uptime. Just give it a little time, and you’ll see the outstanding quality of the service and the team behind it!
As a new user myself (~1.5 months) I can say despite the recent issues I remain very impressed with P123’s capabilities and performance. The provided statistics for models are excellent in that they allow a pretty good judgment of the historical performance - very rare to see in a retail setting as many websites like this will selectively give you stats that obscure the true performance history. And all my rebalances and account updates with IBKR have gone off without a hitch. All in all, quite happy.
Thank you for the support. Made significant progress today and the Ceph storage no longer has errors, just warnings. It has been scrubbing and backfilling for several hours.
We are not out of the woods yet since:
There are still some version mismatches
We are still seeing MDS freeze up.
We are seeing some spurious write/access errors to the AI filesystem
On the positive side, we have been able to run several training jobs, but consistency still not there. Lets see how it is once it's done scrubbing.
Yes correct. We are getting hiccups in the distributed storage filesystem for AI models.
We're removing the banner for now since AI factors is generally working. Still investigating the hiccups, and also we will be crediting back any training credits used since last week , and until it's all 100% resolved
Unfortunately, no change for me since this morning. My manual rebalancing of live strategies incorporating AIFactor result in time out 90% of the time (It worked shortly 6-7 hours ago). Also simulated strategies incorporating ranking systems with AIFactorValidation() get stuck at 0% block the queue for quite some time.
Just informing you. I am sure it will be resolved soon.
Not seeing any improvement compared to this morning. Screens & strategies using saved AI factor predictions are not running. Just wanted to share. Thank you! @marco
Yeah, was just going to ask if anyone else was noticing more timeouts trying to load model recommendations/predictions today. I know it’s all still being worked on, just kind of found it strange it seems worse in that regard than yesterday (to me at least, I can barely get anything to load).
I can see the inconvinence of not having the AI factor working 100% for simulation/rebalancing. It is fortunate that my portfolio is not affected since I haven’t picked up on AI factor.
Hope everything goes back to normal and 100% soon.
Don't get me wrong. Imo no strategy should be so fragile that downtimes of days to weeks will kill it. I always keep this in mind when building strategies and this situation is the perfect reminder. Actually impressed how stable P123 works 99.9% of the time. I am also fully committed to AiFactor and won't go back.
I am sure the team will fix it soon. Just hope any updates by Users like me on how the errors/bugs evolve will help the team in doing so.
The gremlin in the AI CephFS file system is still there. It's very easy to see the problem. Just go to an AI factor with many validations and scroll though your lift charts. At some point one of the charts will hang. But when I inspect the directory the data is there.
So the problem appears to be some glitchy network path which are very hard to troubleshoot.
We have a contingency plan to go back to a traditional file system , on a single server with RAID, which should eliminate all complexities of a large, multi server, fault tolerant network storage. This would be a temporary fix since it introduces a single point of failure which we can mitigate in the interim with traditional backups.
Very sorry about this. I will leave the banner on the website until we resolve this.
Thanks Marco. Still impressed with how stable AI Factor has been generally. I am sure it was an exponential increase in amount of files & data stored. Looking forward to continuing with P123 AI Factor soon.
An affirmation of confidence in the P123 team based on 20 years of P123 history.
I first subscribed to P123 in 2005 and over the last 20 years have observed a very competent team gradually year by year improving their product while other competitors faded away. I’ve been impressed that they have accomplished what they have with very few overall problems in the past. Just wanted to reaffirm new members as others already have that the system is in good hands. I’m 100% confident this problem is a very temporary situation.