Wednesday, May 25, 2016

Stable Release

Some customers have this notion that they should start deploying the latest stable release as and when it is made available, even to the extend of making an upgrade during project implementation.

These are the type of customers with no operational experience. They also happen to be the group who has very little product development background. 

On the other hand, I have a customer who is still running OpenAM 10.0.2. We talked about upgrading last November. I told him OpenAM 13.0.0 was scheduled to be released in December. 

Straight-away, he made the decision to request for upgrade budget from Management, but the upgrade job is only to be carried out this August. Why? 

"Never go for the initial release of a major release. I am going to wait for the next minor release to 13.0.0." Wise man! I like working with this type of customer. Same frequency. Some customers question/doubt you when you tell them from your field experience. :)

By the way, OpenAM 13.5 is on the way soon.... 

  • Summer 2016 
  • Push Authentication offering password-less operation 
  • Further Admin Console ease of use work 
  • Stateless OAuth2/OIDC


Tuesday, May 24, 2016

ELK . OpenAM . OpenDJ - Part 3

Today, my customer called to alert me that their SSO infrastructure is "not stable". We had OAuth2 Provider configured for this customer. OAuth2 clients' authentication hit into unauthorized error. A little while later, he called me again telling me the Login Page went into an infinite loop. 

But ... but ... everything went back to normal after a while. 

OpenAM/OpenDJ are very stable products from our experience dated way back to Sun Microsystems. They can't go into crash mode out of the sudden. 

In the end, we found out that the network team is tech-refreshing the network. Wow! During office hours? Yes. Well done!

Anyway, until we found out the root cause, the first tool I used was to quickly go to the ELK which we have set up for this customer. Nothing unusual. 

And I blogged about how my team use ELK platform to monitor ForgeRock Open Identity Stack here and here.

"Invalid Password Server Trend Live" - This tracks the user invalid password events.

In my previous blog, I wrote that we noticed huge Invalid Password spike and subsequently identified that it was due to a malfunction automated application. 

Below is a trend for the past 30 days. It was obvious the application went haywire on 29th April and was subsequently rectified on 11th May.

I went to the same dashboard and selected the trend for past 7 days. As observed, on a normal day, the max Invalid Password should be in the 65 range (per OpenAM server).

This is something we are working on right now. With this threshold on mind, we are building a Notification Service to trigger proactive alert to the Ops team. 


Saturday, May 21, 2016

OpenAM Security Advisory #201604

I slept at almost 2:00 AM last night. I was busying upgrading OpenAM 11.0.2 to OpenAM 11.0.3 for a customer. I woke up receiving an email sent from ForgeRock at 6:47 AM regarding OpenAM Security Advisory #201604.


This advisory affects OpenAM 11.0.3 and I'm supposed to patch it! Bang my head. I could have delayed upgrading till next week (since today is Vesak Day). :>

Friday, May 20, 2016

Why IAM Projects Fail? Part II

I woke this morning and received an email from SailPoint. It is the leader in identity governance (IGA) market as named by Gartner for Q2, 2016.

By the way, the underlying engine of SailPoint is Waveset from Sun Microsystems. My team is still supporting a customer not yet migrated out from Waveset. We have a long history implementing products from Sun Microsystems.

I'm fairly surprised Dell One Identity solutions are in the Leaders quadrant. It's not picking traction in this Asia South region. My team is trained in this product as well. Still waiting for deal-on. Ha!

But then, so what if a product is in the top quadrant? Will it warrant implementation success?

I said before and I am repeating now:

In the market, there is really not much difference between the various IDM/IDG products. I can safely say their features are almost 85-90% similar. It's the implementors & customers' key stakeholders ("People") that makes the difference between a successful and failure IAM project.

The magic quadrant is usually used by top-decision makers to cover their backside. Simply put.

I talked about the Pain Points previously. 
Again (coincidentally?), the biggest pain is People
  • No ownership/main-driver (no full-time PM) 
  • Not trained
  • Not really know what they really need (Keep changing requirement) 
  • Not enough support from application teams 
  • No well thought-of test plans & not following test plans
If I may, I would like to add on 2 points which I recently observed.

  • Design Documents were signed-off with no intention of following through (this is happening especially frequent in this region & usually causes huge delay)
  • Test plans were not vetted and tests were carried out without support from internal team

These were caused by poor leadership. When a system is not well tested, especially the edge cases, things will break during operational time. And it's common. To blame the product, the implementers and the testers is easy. But do remember, when you point a finger at others, at least 3 other fingers are pointing back to yourself.

On point 2 - "Not trained", this could also mean the implementor is not well-trained as well. Oh really? Yes, it does happen. And it happened to us recently!

But being a responsible implementer, do we want it to happen? No. And do we want to redeem ourselves if given a chance? Definitely.

I still remember many years back when I was tasked by Sun Microsystems to debug a Sun Access Manager (the grandfather of OpenAM btw) issue for a teleco in Vietnam. I was given 22 days. In the end, it took me 4 long months! I was new to Sun Access Manager then. I was exploring. I'm not stupid by the way. When you are new to a product, you need time and patient.

Most importantly, my customer believed in me. He did not doubt me. He did not complain to Sun Microsystems behind my back. I felt bad at the delay but he was understanding. With help from support & product team, the issue was finally resolved. Usually, that's what the support & product team requires, an eye onsite to provide accurate feedback to them.

I keep repeating this story to my team these days, especially now when the morale is low.

In computing, there is no such thing called "not-fixable". Any bug can be fixed, it's just a matter of time. But of course, customers need to give you chance.


To my team-mates: I handpicked you and I trained you, so I believe in you.


Thursday, May 19, 2016

ELK . OpenAM . OpenDJ - Part 2

I blogged about how we use ELK to monitor trends, especially abnormal ones, on ForgeRock Identity Stack. 

"Login Failed Server Trend Live" - This tracks the user login failure events. 

Just a few days ago, we observed that the number of Login Failed events had increased.

So, we zoomed in and found out there were many lines of the following errors:

"2016-05-17 00:09:00"   "Login Failed|module_instance|Application"      "Not Available" golfdigest 202.xx.xx.xx   INFO  AUTHENTICATION-268     "cn=dsameuser,ou=DSAME Users,"    "Not Available" Application     202.xx.xx.xx"

I know if module_instance is Application, then this is not a user authentication. Most likely, it is a Policy Agent in action. By the way, every Policy Agent will need to authentication with OpenAM in order to pull the policies.

So based on the IP address (202.xx.xx.xx), we found out the owner of the application. Ah! This is a defunct site. The SSO administrator has already removed the Policy Agent "golfdigest" from OpenAM as part of the sunsetting process, but the network team has not disabled/removed the Policy Agent on the Apache web server yet.

We have many more useful dashboard in Kibana, which we use for operational purpose.  I'll share later.


Wednesday, May 18, 2016

Ansible . OpenAM . OpenDJ . OpenIDM

For some of our larger deployment of ForgeRock Identity Stack, we will usually request to install Ansible on the development node.

For pure Linux environment, it works like a charm once SSH keys are exchanged during initial setup. (For Windows, it is supported but slightly more complicated to set up)

What do we use Ansible for? Almost every operational task.

Change configuration files; Update custom codes; Update JSP/UI pages; Deploy patches from ForgeRock; Restart Apache/Tomcat servers ... anything.

We used to make human errors occasionally when we managed a farm of over 10 OpenAM and 6 OpenDJ servers. Now that everything is pushed from development node, there is hardly any human error.

I just saw Ansible Tower on Ansible website.

No urgent need to use this feature. Our current Ansible already has logging/audit in place.

Anyway, once a playbook script is tested, there is hardly any error when it is being executed.


Tuesday, May 17, 2016

ELK . OpenAM . OpenDJ

We have ELK (Elasticsearch, Logstash and Kibana) deployed for a long time OpenAM/OpenDJ customer of ours some time back. The idea is not new. Similar solutions have been deployed by some other ForgeRock folks/partners.

We know ELK can only keep trends, but not send notification. (Ok ok, Elastic does offer Watcher for its commercial version) What we intend to do is to add Notification Service side-by-side with ELK. No, we do not intend to keep all data from OpenAM/OpenDJ in Elastic and to trigger alert from there. Some data are not useful to keep in Elastic (e.g. total entries count from all OpenDJ to determine if replication is operating optimally). We just need a simple cache layer (e.g. Ehcache) to keep these types of "data-in-transit" in order to trigger alerts to administrators/operators.

I'll talk more about this next time.

But so far, how useful is ELK to customer? Pretty good feedback.

"Login Failed Server Trend Live" - This is a live trend where by Logstash agents send "live" data from all OpenAM servers by monitoring amAuthentication.error logs. This tracks the user login failure events. 

If the user login failure count is high for a particular day for a particular OpenAM node, we can zoom into amAuthentication.error log to find out more.

"Invalid Password Server Trend Live" - This tracks the user invalid password events. 

This trend is different from the previous. An Invalid Password event happens when a user id is correct, but password is invalid. 

"2016-05-10 13:00:43"   "Invalid Password"      "Not Available" UID=ntustc001,ou=XXXX,   INFO  AUTHENTICATION-201      "cn=dsameuser,ou=DSAME Users,"       "Not Available" LDAP

A quick zoom into amAuthentication.error log reveals a particular user was attempting to log in with an invalid password.

[amuser@f1]$ cat amAuthentication.error.20160510 | wc -l

[amuser@f1]$ cat amAuthentication.error.20160510 | grep -i ntustc001 | wc -l

A total of >18k invalid login attempts. That's quite unusual.

This is where the customer service personnel can call up their paying customer to find out what exactly happened and if he/she requires a password reset service.

Proactive customer engagement model!

By the way, if you look at amAuthentication.error in-depth, you might see some Chinese characters like 登录失败  (Login Failed) and 无效密码 (Invalid Password). These are traffic from Chinese locale browsers.

"2016-05-10 07:00:33"   登录失败        "Not Available" "Not Available"  INFO  AUTHENTICATION-200      "cn=dsameuser,ou=DSAME Users,"    "Not Available" LDAP

"2016-05-10 08:56:22"   无效密码        "Not Available" uid=A480,ou=xxx,    INFO  AUTHENTICATION-201      "cn=dsameuser,ou=DSAME Users,"    "Not Available" LDAP

Thursday, May 12, 2016

LDIF Delta Utility

Recently, one of our customers ran into replication out-of-sync issue. The root cause was one of the 2 OpenDJ replication servers was under stress. The data in the 4 OpenDJ directory servers (connected to the 2 replication servers) was about half a million entries.

Reinitialization for 500k+ records took too long. Customer could not wait. So I did a quick LDIF export (export-ldif) from a good directory server and imported into the out-of-sync directory server. That was much faster - around 5 mins.

After the incident, I continued to monitor the 4 directory servers replication status.

Total Entries in each OpenDJ 
+++++++++++++++++++++ : 551852 : 551853 : 551851 : 551853

Hmm... still 1-2 entries not catching up. But exactly which entries are not the same among the 4 directory servers? Which directory server should I trust as master now, if I decide to re-sync all 4 OpenDJ again?

I have no idea until I discussed my issue with my colleague. He was reading up on CA Directory just yesterday as he was installing CA SiteMinder (which uses CA Directory as the configuration store) for a customer of ours.

He told me there is a utility which I can use - ldifdelta.

Use the ldifdelta tool to calculate the change, or delta, between two LDIF files. The ldifdelta program is an offline directory synchronization tool based on the LDAP directory interchange format. You can use ldifdelta to fully or partially synchronize directories.

Bingo! Exactly what I need.


Tuesday, May 10, 2016

Software Support

Recently, we hit into a production issue and were not able to resolve by ourselves. Thus we raised a support ticket on behalf of our customers. The support engineer needed the core dump when the issue re-surfaced again. So, he suggested to use Process Explorer or Debug Diagnostic tool.

Sure, no problem. But he added: "... If you encounter problem on dump capture, can you please engage Microsoft to assist as both process explorer and debug diagnostic tool are provided by Microsoft to capture dump."

Oh well, you were the one who asked customer to use tools from Microsoft to help you with debugging. Now, if that suggested tool(s) has issue, we need to raise a separate ticket with Microsoft?

Few weeks ago, we raised another ticket to the same engineer. By the way, he is support engineer for Product A in this company. The ticket was about how to integrate Product A with Product B from the same company.

To our astonishment, he responded with the following:

"It seems like some configuration issue on Product B. I suggest you open a ticket with Product B to check the configuration. Beside, from Product A documentation that you pointed out, I didn't find the OAuth provider can be Product B (Facebook and Google are the OAuth Provider mentioned). Therefore, I'm not sure if this can be done with Product B. If you have any additional documentation or details on how to integrate Product A with Product B as OAuth Provider, please share with me."

Hello, who is the customer? If you are not sure, you jolly well walk over (or skype/email) to Product B support team and find out more from them. You asking customer to bridge the communication gap between your two support teams? This is embarrassing.

You better wake up!

Side note: Besides ForgeRock products, my team delivers IAMS products from other principles as well.

Wednesday, May 4, 2016

OpenAM & Facebook Business Manager

We know OpenAM supports OAuth 2.0/OpenID Connect authentication module. OpenAM provides a wizard for configuring common OAuth 2.0/OpenID Connect authentication providers, such as Facebook, Google, and Microsoft.

In most use cases, for example Facebook, customers will go to Facebook Developers and create a new app for the company. 

The whole company will then use a common Facebook App ID.

Now, what if a company has multiple customer-facing websites? Each business unit owner would like to have their own Facebook Analytics for Apps to understand how people access their websites. 

This is where Facebook Business Manager comes in handy.


2 things to change:
1. Create 1 OAuth 2.0/OpenID Connect authentication module for each sub-account
2. Change FB Login icon on each website to authenticate with the approriate authentication module created in Step 1.