I encountered a Priority 1 case in a customer site 2 weeks ago with an OpenSSO deployment for one of our government ministries.
What really happened was the Single Sign-On infrastructure became unstable after a Solaris Patch Cluster was applied, especially during high-load.
No one realised the issue right after patching. There was even a round of internal testing conducted and the system was given the go-ahead by the administrator.
But on the 1st working day after the patching, the help-desk received numerous calls that authentication was "sometimes OK, sometimes not OK". They thought the farm of OpenSSO servers were restarting one at a time. But no one touched the OpenSSO servers at all.
We observed that CPU consumptions were on the high side and connection timeouts were often encountered.
I was consulted. As I was helping customers with tuning in another environment where our Solaris servers were recently patched and encountered poor performance, I thought it was a good bet to roll back the Solaris Patch Cluster.
As the fire was on the Production side, a decision was made to perform the roll-back on the servers in the Production environment first. That solved the issue!
Root cause? I'll let the Oracle experts tell us since they are paid to service us.
Once the fire was put off, we also subsequently roll-back the patch on our QAT environment and we were back in business.
Lesson learnt is OS Patching does have impact on the performance of the software that is installed on top of it. Do not ignore this fact. Load testing after patching will be ideal if time permits. (well, I would say no one does this 90% of the time. :> )
.
No comments:
Post a Comment