Saturday, April 13, 2013

Connection issue from OpenAM to MS Active Directory

I have customers who are using Microsoft Active Directory as the authoritative authentication store, as well as user data store. So do we internally (of course, ours is meant for a live POC/showcase).

We know that the connection to MS AD is via the LDAP Authentication Module. It has been there for years. However, it has its own shortcomings which I blogged here 3 years ago

Starting in OpenAM 10.0, the LDAP Authentication Module has been revamped to add support for the latest LDAP Behera password policy standards (See OPENAM-613). The shortcomings mentioned in my blog previously was resolved in this revamp. However, it brings with it another issue - somehow, the connection pool wasn't working as expected (See OPENAM-590, OPENAM-627OPENAM-1787). 

A fix (in fact, a few iterations in total) was released. The latest being 3.0.0-OPENAMp3. (See maven repository here)  (I'm not too sure about 3.0.0-Xpress1. This must be something new which I have not tested yet.)

A customer of ours in Thailand initially encountered connection timeout issue whenever the OpenAM is left "idle" for a period of time. e.g. the 1st user who attempted to log-in in the morning. It was an easy fix for this customer by replacing with opendj-ldap-sdk-3.0.0-OPENAMp3.jar. The system has been running superbly well since last December.

No good news for our internal SSO infrastructure though. :( In one of our test environment which we simulate the setup in Thailand, everything was running fine. No connection time out issue at all - for days, for weeks. 

However, in our own production system, no luck. On and off, we'll get connection time out issue. It usually happens to the 1st staff who came into the office in the morning. So we started to zoom in and I must say it's super time-consuming. 

When the problem occurred, we run jstack to determine the potential root cause. Pretty lucky here! 

"http-apr-" daemon prio=10 tid=0x00007f8b88010800 nid=0x1841 waiting on condition [0x00007f8b03e73000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007d92a3430> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos( at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos( at java.util.concurrent.LinkedBlockingQueue.poll( at org.forgerock.opendj.ldif.ConnectionEntryReader.getNextResponse( at org.forgerock.opendj.ldif.ConnectionEntryReader.hasNext( at com.sun.identity.authentication.modules.ldap.LDAPAuthUtils.searchForUser( at com.sun.identity.authentication.modules.ldap.LDAPAuthUtils.authenticateUser( at com.sun.identity.authentication.modules.ldap.LDAP.process( at com.sun.identity.authentication.spi.AMLoginModule.wrapProcess( at com.sun.identity.authentication.spi.AMLoginModule.login( at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( at sun.reflect.DelegatingMethodAccessorImpl.invoke( at java.lang.reflect.Method.invoke( at com.sun.identity.authentication.jaas.LoginContext.invoke( at com.sun.identity.authentication.jaas.LoginContext.login( at com.sun.identity.authentication.service.AMLoginContext.runLogin( at com.sun.identity.authentication.server.AuthContextLocal.submitRequirements( at com.sun.identity.authentication.AuthContext.submitRequirements( at com.sun.identity.authentication.AuthContext.submitRequirements( at org.apache.jsp.getServerInfo_jsp._jspService( at org.apache.jasper.runtime.HttpJspBase.service( at javax.servlet.http.HttpServlet.service( at org.apache.jasper.servlet.JspServletWrapper.service( at org.apache.jasper.servlet.JspServlet.serviceJspFile( at org.apache.jasper.servlet.JspServlet.service( at javax.servlet.http.HttpServlet.service( at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( at org.apache.catalina.core.ApplicationFilterChain.doFilter( at org.forgerock.openam.validation.ResponseValidationFilter.doFilter( at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( at org.apache.catalina.core.ApplicationFilterChain.doFilter( at com.sun.identity.setup.AMSetupFilter.doFilter( at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( at org.apache.catalina.core.ApplicationFilterChain.doFilter( at org.apache.catalina.core.StandardWrapperValve.invoke( 

We zoomed into, in particular the searchForUser method. This will attempt to query a user from the Active Directory via a LDAP connection from the Admin Connection Pool - adminPool.

So, this is how the original code section looks like :

 if (adminPool) {
                            BindRequest bindRequest = Requests.newSimpleBindRequest(bindingUser, bindingPwd.toCharArray());

                            connFactory = Connections.newAuthenticatedConnectionFactory(connFactory, bindRequest);

We suspected the connection object is stale within the Admin Connection Pool. Somehow, the fix in 3.0.0-OPENAMp3 resolves the stale connections if the backend is a OpenDJ. It was not tested rigorously against a MS AD backend. ( Our guess here. Do not point a gun on us. :> ) Well, it's named opendj-ldap-sdk-3.0.0-OPENAMp3.jar for a reason right? 

So, how do we resolve this issue? Firstly, we do not want to touch opendj-ldap-sdk-3.0.0-OPENAMp3.jar for sure. This to us is external to OpenAM core. 

The safest approach is to add a 1-liner fix to

 if (adminPool) {
                            BindRequest bindRequest = Requests.newSimpleBindRequest(bindingUser, bindingPwd.toCharArray());
                            connFactory = Connections.newHeartBeatConnectionFactory(Connections.newAuthenticatedConnectionFactory(connFactory, bindRequest),10, TimeUnit.SECONDS);

The 1-liner fix attempts to perform a heartbeat on each connection object every 10 seconds. ( Of course, 10 seconds is debatable. You can choose your own magic number )

Our production OpenAM has been running smoothly since.  I think about a month already.

PS: Is this a bug? I do not think so. That's why I never raise a bugster. Strange thing is against some MS AD, no problem. Against another set of MS AD, problem occurs. This has been haunting me for a while since last year after OpenAM 10.0 was released.


No comments:

Post a Comment