Today, my customer called to alert me that their SSO infrastructure is "not stable". We had OAuth2 Provider configured for this customer. OAuth2 clients' authentication hit into unauthorized error. A little while later, he called me again telling me the Login Page went into an infinite loop.
But ... but ... everything went back to normal after a while.
OpenAM/OpenDJ are very stable products from our experience dated way back to Sun Microsystems. They can't go into crash mode out of the sudden.
In the end, we found out that the network team is tech-refreshing the network. Wow! During office hours? Yes. Well done!
Anyway, until we found out the root cause, the first tool I used was to quickly go to the ELK which we have set up for this customer. Nothing unusual.
And I blogged about how my team use ELK platform to monitor ForgeRock Open Identity Stack here and here.
"Invalid Password Server Trend Live" - This tracks the user invalid password events.
In my previous blog, I wrote that we noticed huge Invalid Password spike and subsequently identified that it was due to a malfunction automated application.
Below is a trend for the past 30 days. It was obvious the application went haywire on 29th April and was subsequently rectified on 11th May.
I went to the same dashboard and selected the trend for past 7 days. As observed, on a normal day, the max Invalid Password should be in the 65 range (per OpenAM server).
This is something we are working on right now. With this threshold on mind, we are building a Notification Service to trigger proactive alert to the Ops team.