Sunday, January 10, 2010

Sun Gathering Debug Data

When we deploy Sun Cluster and have an issue, we usually run Sun Explorer. The backend engineer will be able to digest and analyze what has gone wrong.

When we deploy Sun Software Suite, we will use Sun Gathering Debug Data (Sun GDD or GDD).

GDD tools provide the right approach to problem resolution by leveraging proactive actions and best practices to help you gather the required debug data needed for further analysis. For each product covered, GDD tools provide documentation and scripts which detail the relevant data the Sun Technical Support Center requires for analyzing your problem. The tools gather 90% of the debug data frequently requested by the Sun Technical Support Center - including data for more common critical situations including memory, start/stop, installation, hang, and crash issues. By collecting this data before you initiate a Service Request, you can substantially reduce the time needed to analyze and resolve the problem.

Saturday, January 9, 2010

Fault Monitoring of resource

I'm still in Senegal busy with the Sun Cluster for Oracle HA UAT.

(Pictures below are Novotel Dakar. Very nice cozy hotel just besides the sea.)

Besides configuring Oracle database for HA, we are also responsible for monitoring customer's applications via Sun Cluster.

There are 2 ways to configure Fault Monitoring for Generic Data Service (GDS):
1. Port monitoring (default)
2. Probe command monitoring

Port monitoring is fairly straight-forward. It assumes your application is running on a particular port. If Sun Cluster detects that this port is down, it will assume that your application is faulted. It will then attempt to restart the resource automatically.

The application for this teleco here is pretty complicated. There are times when the port is still alive, but the application has hung. This is exactly what happened here!!

So Port monitoring is not reliable in this case, at least for this application per se.

We need to use Probe command monitoring instead. Probe command will require us to write shell script that return values like 0 (successful), 100 (complete failure) and 201 (immediate failover).

Now, there is an issue - port monitoring is turned on by default. If you have probe command monitoring added, port monitoring is still running. As such, even if probe command returns 100, but if the port is still alive, Sun Cluster still treats the resource to be alive.

This is no good. We need to disable port monitoring and rely totally on probe command monitoring.

How do we achieve that?

-x Network_aware=FALSE

Friday, January 8, 2010

LUN, Device Group and Mount Points are correlated

I have flown 30 hours to reach Senegal for a Sun Cluster for Oracle HA UAT. Customer is a teleco in Senegal.

The following is what I have implemented for them due to a constraint I encountered at the Storage level:

Oracle and App are now residing on the same node at the same time. Always.

Ideally, one would like to have Oracle active on 1 node; and App active on the 2nd node. That would be best in term of performance.

However, there is only 1 LUN being created by the storage engineer before the whole setup was shipped from China. No one here in Senegal knows how to reconfigure the LUN.

What's the implication of 1 LUN?

Now, if one is a Software person like the customer, it would be very hard to appreciate the issue. I need to explain in a more layman term.

Let's look at the table below:

A software person can only understand until column 3 ("Mount Point on Solaris OS"). Beyond that, it would be fairly difficult to grab the terms like Device Group, LUN, etc..

In order to understand, we need to read from right-to-left (yes, the ancient Chinese way of reading).

If there is 1 LUN, there can only be 1 Device Group. If we only have 1 Device Group, then all mount points have to be defined under this group.

As such, as all mount points have to be always together, it implies all applications (Oracle and App) have to be always together.

The ideal architecture will be the one shown below:

Oracle on 1 node; App on the other node

In order to have such a setup, we need to define at least 2 LUN at the Storage level.