Data Collecting for Troubleshooting Oracle Clusterware (CRS or GI) And Real Application Cluster (RAC

Hanh Nguyen

Oct 4, 20154 min read

PURPOSE

This note lists what to collect for different type of Oracle Clusterware and Real Application Cluster issues, it’s not mandatory to upload all the files to open a SR, however, it will speed up the resolution if all relevant info are uploaded.

File Formats for Data Uploaded to Oracle Support

Oracle Support requests that you upload compressed files grouped together by node and labeled as such in a standard format, such as .tar, .gz, .Z or .zip.

Older runs of diagcollection or any other files (i.e. if diagcollection was run a few days or weeks back) may not provide current log information which can delay the resolution.

TROUBLESHOOTING STEPS

1. Data Gathering for All Oracle Clusterware Issues

Provide current diagcollection output from all nodes in the cluster.

Note 330358.1 – CRS 10gR2/ 11gR1/ 11gR2 Diagnostic Collection Guide Note 272332.1 – CRS 10gR1 Diagnostic Collection Guide

2. Data Gathering for Node Reboot/Eviction

Provide files in Section “Data Gathering for All Oracle Clusterware Issues” and the followings:

Approximate date and time of the reboot, and the hostname of the rebooted node
OSWatcher archives which cover the reboot time at an interval of 20 seconds with private network monitoring configured.

Note 301137.1 – OS Watcher User Guide Note.433472.1 – OS Watcher For Windows (OSWFW) User Guide

For pre-11.2, zip of /var/opt/oracle/oprocd/* or /etc/oracle/oprocd/*
For pre-11.2, OS logs – refer to Section Appendix B
For 11gR2+, zip of /etc/oracle/lastgasp/* or /var/opt/oracle/lastgasp/*
CHM/OS data that covers the reboot time for platforms where it is available, refer to Note 1328466.1 for section “How do I collect the Cluster Health Monitor data”
If vendor clusterware is being used, upload the vendor clusterware logs

3. Data Gathering for All Real Application Cluster Issues

From all nodes:

Provide instance alert_{$ORACLE_SID}.log, lmon, lmd*, lms*, ckpt, lgwr, lck*, dia*, lmhb(11g only), and all others traces that are modified around incident time. A quick way to identify all traces and tar them up is to use incident time with the following example:

$ grep "2010-09-02 03" *.trc | awk -F: '{print $1}' | sort -u |xargs tar cvf trace.`hostname`.`date +%Y%m%d%H%M%S`.tar$ gzip trace*.tar
For pre-11g, execute the command in bdump and udump to identify the list of files.
For 11g+, execute the command in ${ORACLE_BASE}/diag/rdbms/$DBNAME/${ORACLE_SID}/trace to identify the list of files

Incident files/packages in alert.log at time of the incident
If ASM is involved, provide same set of files for ASM
OS logs – refer to Appendix B

4. Data Gathering for Real Application Cluster Performance/Hang Issues

Provide files in Section “Data Gathering for All Real Application Cluster Issues” and the following:

systemstate and hanganalyze – refer to Appendix C
awr, addm and ash report, each report covers a period no more than 60 minutes
OSWatcher archives which cover the hang time

Note 301137.1 – OS Watcher User Guide Note.433472.1 – OS Watcher For Windows (OSWFW) User Guide

CHM/OS data what covers the hang time for platforms where it is available, refer to Note 1328466.1 for section “How do I collect the Cluster Health Monitor data”

5. Data Gathering for Oracle Clusterware Installation Issues

5.1. Failure before executing root script:

For 11gR2: note 1056322.1 – Troubleshoot 11gR2 Grid Infrastructure/RAC Database runInstaller Issues

For pre-11.2: note 406231.1 – Diagnosing RAC/RDBMS Installation Problems

5.2. Failure while or after executing root script

Provide files in Section “Data Gathering for All Oracle Clusterware Issues” and the following:

root script (root.sh or rootupgrade.sh) screen output
For 11gR2: provide zip of <$ORACLE_BASE>/cfgtoollogs and <$ORACLE_BASE>/diag for grid user.
For pre-11.2: Note 240001.1 – Troubleshooting 10g or 11.1 Oracle Clusterware Root.sh Problems

Appendix A. RDA

It’s recommended to provide the latest RDA from for all issues from all nodes in the cluster

Note 314422.1 – Remote Diagnostics Agent (RDA)

Appendix B. OS logs

OS logs are in the following directory depending on platform:

Linux: /var/log/messages

AIX: /bin/errpt -a (redirect this to a file called messages.out)

Solaris: /var/adm/messages

HP-UX: /var/adm/syslog/syslog.log

Tru64: /var/adm/messages

Windows: save Application Log and System Log as .TXT files using Event Viewer

Note: From 11gR2, OS logs are part of diagcollection on Linux, Solaris, HP-UX.

Appendix C. systemstate and hanganalyze in RAC

To collect hanganalyze and systemstate in RAC, execute the following on one instance to generate cluster wide dumps:

a – Connect to sqlplus as sysdba: “sqlplus / as sysdba”; if this does not work, use “sqlplus -prelim / as sysdba”

b – Execute the following commands:

For 11g+

SQL> oradebug setospid <ospid of diag process> SQL> oradebug unlimit SQL> oradebug -g all hanganalyze 3 ##..Wait about 2 minutes SQL> oradebug -g all hanganalyze 3 SQL> oradebug -g all dump systemstate 258 If possible, take another one at level 266 instead of 258 If SGA is large or fix for bug 11800959 (fixed in 11.2.0.2 DB PSU5, 11.2.0.3 and above) is not applied, level 266 could take very long time and generate a huge trace file and may not finish in hours.

For 10g

SQL> oradebug setospid <ospid of diag process> SQL> oradebug unlimit SQL> oradebug -g all dump systemstate 266##..Wait about 2 minutes SQL> oradebug -g all dump systemstate 266 Please upload *diag* trace from either bdump or trace directory.

If diag trace is huge or “oradebug -g all …” command is hanging, please collect system state dump from each instance individually at similar time:

SQL> oradebug setmypid SQL> oradebug unlimit SQL> oradebug hanganalyze 3 ##..Wait about 2 minutes SQL> oradebug hanganalyze 3 SQL> oradebug dump systemstate 258 SQL> oradebug tracefile_name

Please upload the trace file listed above.