How did it happened?
Background
In my customer's site, we have a couple of sites. Within each site, we have multiple OpenAM servers and we have AM Session Failover configured.
Now, each OpenAM server stores its configuration data in the embedded OpenDJ server.
In total, we have 8 OpenAM servers. Thus, we have 8 corresponding embedded OpenDJ servers. And we know that Multi-Master Replication (MMR) is auto-magically configured during installation. (It is automatically taken care of by the OpenAM installation and configuration script)
OpenAM 6 running port 8180 - Server ID 09
InternetSSO site - Server ID 10
OpenAM 7 running port 8180 - Server ID 11
OpenAM 8 running port 8180 - Server ID 12
What was encountered?
When the 3 final OpenAM servers for Internet was added and InternetSSO site was created, all OpenAM servers were restarted one after another.
When I performed a log-in via the load-balancer (https://www.abc.com/openam/UI/Login), I would intermittently encountered "Session Activation Failed" error!
So, I shut down OpenAM 6, 7 and 8. And started them one at a time. I realized the error came from OpenAM 6 and 7. Everything was OK with OpenAM 8.
This is super strange!
Analysis
This took me quite a while and it was my colleague who found out the root cause. By the way, the logging level has to be switched to MESSAGE. Otherwise, there's no way to detect the problem.
On OpenAM 6 server, the Session debug log was showing :
amSession:05/15/2013 12:18:53:118 PM ICT: Thread[main,5,jboss]
UserName=amuser : clusterServerList=09 : connectionMaxWaitTime=5000 :jdbcDriverClass=com.iplanet.dpro.session.jdbc.HADBConnectionImpl : jdcbURL=172.2.3.4:7676,172.2.3.3:7676 : minPoolSize=8 : maxPoolSize=32
amSession:05/15/2013 01:34:11:710 PM ICT: Thread[main,5,jboss]
UserName=amuser : clusterServerList=09 11 : connectionMaxWaitTime=5000 :jdbcDriverClass=com.iplanet.dpro.session.jdbc.HADBConnectionImpl : jdcbURL=172.2.3.4:7676,172.2.3.3:7676 : minPoolSize=8 : maxPoolSize=32
amSession:05/15/2013 02:45:42:465 PM ICT: Thread[main,5,jboss]
UserName=amuser : clusterServerList=09 11 : connectionMaxWaitTime=5000 :jdbcDriverClass=com.iplanet.dpro.session.jdbc.HADBConnectionImpl : jdcbURL=172.2.3.4:7676,172.2.3.3:7676 : minPoolSize=8 : maxPoolSize=32
The same is observed on OpenAM 7 server.
However, on OpenAM 8 server (the good behaving server), the Session debug log was showing:
However, on OpenAM 8 server (the good behaving server), the Session debug log was showing:
amSession:05/15/2013 12:19:58:447 PM ICT: Thread[main,5,jboss]
UserName=amuser : clusterServerList=09 11 12 : connectionMaxWaitTime=5000 :jdbcDriverClass=com.iplanet.dpro.session.jdbc.HADBConnectionImpl : jdcbURL=172.2.3.4:7676,172.2.3.3:7676 : minPoolSize=8 : maxPoolSize=32
Something is really strange. The configuration data on OpenAM 6 and 7 did not seem to be in-sync with OpenAM 8. But wasn't MMR auto-magically configured?
Resolution
A few more restart of OpenAM 6 and 7 did the trick!
The Session debug logs were finally showing the correct clusterServerList!!
UserName=amuser : clusterServerList=09 11 12 : connectionMaxWaitTime=5000 :jdbcDriverClass=com.iplanet.dpro.session.jdbc.HADBConnectionImpl : jdcbURL=172.2.3.4:7676,172.2.3.3:7676 : minPoolSize=8 : maxPoolSize=32
The Session debug logs were finally showing the correct clusterServerList!!
UserName=amuser : clusterServerList=09 11 12 : connectionMaxWaitTime=5000 :jdbcDriverClass=com.iplanet.dpro.session.jdbc.HADBConnectionImpl : jdcbURL=172.2.3.4:7676,172.2.3.3:7676 : minPoolSize=8 : maxPoolSize=32
Finally the configuration data were in-sync.
Luckily, we are moving away from embedded OpenDJ servers next week. We'll have 2 dedicated standalone OpenDJ servers (with MMR configured) to serve the 8 OpenAM servers (and more to be added few months down the road).
*phew*
No comments:
Post a Comment