top of page
Writer's pictureHanh Nguyen

Troubleshoot Grid Infrastructure Startup Issues

Scope

This document is intended for Clusterware/RAC Database Administrators and Oracle support engineers.

Details

Start up sequence:

In a nutshell, the operating system starts ohasd, ohasd starts agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd asm etc), and crsd starts agents that start user resources (database, SCAN, listener etc).

For detailed Grid Infrastructure clusterware startup sequence, please refer to note 1053147.1

Cluster status To find out cluster and daemon status:

$GRID_HOME/bin/crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online

$GRID_HOME/bin/crsctl stat res -t -init ——————————————————————————– NAME           TARGET  STATE        SERVER                   STATE_DETAILS ——————————————————————————– Cluster Resources ——————————————————————————– ora.asm 1        ONLINE  ONLINE       rac1                  Started ora.crsd 1        ONLINE  ONLINE       rac1 ora.cssd 1        ONLINE  ONLINE       rac1 ora.cssdmonitor 1        ONLINE  ONLINE       rac1 ora.ctssd 1        ONLINE  ONLINE       rac1                  OBSERVER ora.diskmon 1        ONLINE  ONLINE       rac1 ora.drivers.acfs 1        ONLINE  ONLINE       rac1 ora.evmd 1        ONLINE  ONLINE       rac1 ora.gipcd 1        ONLINE  ONLINE       rac1 ora.gpnpd 1        ONLINE  ONLINE       rac1 ora.mdnsd 1        ONLINE  ONLINE       rac1

For 11.2.0.2 and above, there will be two more processes:

ora.cluster_interconnect.haip 1        ONLINE  ONLINE       rac1 ora.crf 1        ONLINE  ONLINE       rac1

To start an offline daemon – if ora.crsd is OFFLINE:

$GRID_HOME/bin/crsctl start res ora.crsd -init

Case 1: OHASD does not start As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up. If ohasd.bin is not up, when checking it’s status, CRS-4639 (Could not contact Oracle High Availability Services) will be reported; and if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made; if it fails to start, the following will be reported:

CRS-4124: Oracle High Availability Services startup failed. CRS-4000: Command Start failed, or completed with errors.

Automatic ohasd.bin start up depends on the following:

1. OS is at appropriate run level:

OS need to be at specified run level before CRS will try to start up.

To find out at which run level the clusterware needs to come up:

cat /etc/inittab|grep init.ohasd h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.

To find out current run level:

who -r


“init.ohasd run” is up

On Linux/UNIX, as “init.ohasd run” is configured in /etc/inittab, process init (pid 1, /sbin/init on Linux, Solaris and hp-ux, /usr/sbin/init on AIX) will start and respawn “init.ohasd run” if it fails. Without “init.ohasd run” up and running, ohasd.bin will not start:

ps -ef|grep init.ohasd|grep -v grep root      2279     1  0 18:14 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run

Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured in upstart in /etc/init, however, the process “”/etc/init.d/init.ohasd run” should still be up.

If any rc Snncommand script (located in rcn.d, example S98gcstartup) stuck, init process may not start “/etc/init.d/init.ohasd run”; please engage OS vendor to find out why relevant Snncommand script stuck.

Error “[ohasd(<pid>)] CRS-0715:Oracle High Availability Service has timed out waiting for init.ohasd to be started.” may be reported of init.ohasd fails to start on time.

3. Cluserware auto start is enabled – its enabled by default

By default CRS is enabled for auto start upon node reboot, to enable:

$GRID_HOME/bin/crsctl enable crs To verify whether its currently enabled or not:

$GRID_HOME/bin/crsctl config crs If the following is in OS messages file

Feb 29 16:20:36 racnode1 logger: Oracle Cluster Ready Services startup disabled. Feb 29 16:20:36 racnode1 logger: Could not access /var/opt/oracle/scls_scr/racnode1/root/ohasdstr The reason is the file does not exist or not accessible, cause can be someone modified it manually or wrong opatch is used to apply a GI patch(i.e. opatch for Solaris X64 used to apply patch on Linux).

4. syslogd is up and OS is able to execute init script S96ohasd

OS may stuck with some other Snn script while node is coming up, thus never get chance to execute S96ohasd; if that’s the case, following message will not be in OS messages:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart. If you don’t see above message, the other possibility is syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that case as well. This may not apply to AIX.

To find out whether OS is able to execute S96ohasd while node is coming up, modify S96ohasd:

From:

case `$CAT $AUTOSTARTFILE` in enable*) $LOGERR “Oracle HA daemon is enabled for autostart.” To:

case `$CAT $AUTOSTARTFILE` in enable*) /bin/touch /tmp/ohasd.start.”`date`” $LOGERR “Oracle HA daemon is enabled for autostart.” After a node reboot, if you don’t see /tmp/ohasd.start.timestamp get created, it means OS stuck with some other Snn script. If you do see /tmp/ohasd.start.timestamp but not “Oracle HA daemon is enabled for autostart” in messages, likely syslogd is not fully up. For both case, you will need engage System Administrator to find out the issue on OS level. For latter case, the workaround is to “sleep” for about 2 minutes, modify ohasd:

From:

case `$CAT $AUTOSTARTFILE` in enable*) $LOGERR “Oracle HA daemon is enabled for autostart.” To:

case `$CAT $AUTOSTARTFILE` in enable*) /bin/sleep 120 $LOGERR “Oracle HA daemon is enabled for autostart.”

  1. File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart. .. Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin “reboot” If you see the first line, but not the last line, likely the filesystem containing the GRID_HOME was not online while S96ohasd is executed.

6. Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid

ls -l $GRID_HOME/cdata/*.olr -rw——- 1 root  oinstall 272756736 Feb  2 18:20 rac1.olr

If the OLR is inaccessible or corrupted, likely ohasd.log will have similar messages like following:

.. 2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR 2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:6m’:failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory 2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory 2010-01-24 22:59:10.473: [  OCRRAW][1373676464]proprinit: Could not open raw device 2010-01-24 22:59:10.473: [  OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26] 2010-01-24 22:59:10.473: [  CRSOCR][1373676464] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2] 2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26 2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry 2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR OR

.. 2010-01-24 23:01:46.275: [  OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5 2010-01-24 23:01:46.275: [  OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format. 2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted 2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprinit: Could not open raw device 2010-01-24 23:01:46.275: [  OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26] 2010-01-24 23:01:46.276: [  CRSOCR][1228334000] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage 2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26 2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry 2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR OR

.. 2010-11-07 03:00:08.932: [ default][1] Created alert : (:OHAS00102:) : OHASD is not running as privileged user 2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting: must be run as privileged user OR

ohasd.bin comes up but output of “crsctl stat res -t -init”shows no resource, and “ocrconfig -local -manualbackup” fails OR

.. 2010-08-04 13:13:11.102: [   CRSPE][35] Resources parsed 2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has been registered with the PE data model 2010-08-04 13:13:11.103: [   CRSPE][35] STARTUPCMD_REQ = false: 2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has changed state from [Invalid/unitialized] to [VISIBLE] 2010-08-04 13:13:11.103: [  CRSOCR][31] Multi Write Batch processing… 2010-08-04 13:13:11.103: [ default][35] Dump State Starting … .. 2010-08-04 13:13:11.112: [   CRSPE][35] SERVERS: :VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}}; recovered state:VISIBLE. Assigned to no pool

————- SERVER POOLS: Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED

2010-08-04 13:13:11.113: [   CRSPE][35] Dumping ICE contents…:ICE operation count: 0 2010-08-04 13:13:11.113: [ default][35] Dump State Done. The solution is to restore a good backup of OLR with “ocrconfig -local -restore <ocr_backup_name>”.

By default, OLR will be backed up to $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr once installation is complete.

7. ohasd.bin is able to access network socket files:

2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))

2010-06-29 10:31:01.571: [  OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))] 2010-06-29 10:31:01.571: [  OCRSRV][3267002960]th_init: Local listener did not reach valid state In Grid Infrastructure cluster environment, ohasd related socket files should be owned by root, but in Oracle Restart environment, they should be owned by grid user, refer to “Network Socket File Location, Ownership and Permission” section for example output.

8. ohasd.bin is able to access log file location:

OS messages/syslog shows:

Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting; Directory /ocw/grid/log/racnode1/ohasd not found. Refer to “Log File Location, Ownership and Permission” section for example output, if the expected directory is missing, create it with proper ownership and permission.

9. After node reboot, ohasd may fail to start on SUSE Linux after node reboot, refer to note 1325718.1 – OHASD not Starting After Reboot on SLES

10. OHASD fails to start, “ps -ef| grep ohasd.bin” shows ohasd.bin is started, but nothing in $GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is looping to close non-opened file handles:

.. 15058/1:         0.1995 close(2147483646)                               Err#9 EBADF 15058/1:         0.1996 close(2147483645)                               Err#9 EBADF .. Call stack of ohasd.bin from pstack shows the following:

_close  sclssutl_closefiledescriptors  main .. The cause is bug 11834289 which is fixed in 11.2.0.3 and above, other symptoms of the bug is clusterware processes may fail to start with same call stack and truss output (looping on OS call “close”). If the bug happens when trying to start other resources, “CRS-5802: Unable to start the agent process” could show up as well.

11. Other potential causes/solutions listed in note 1069182.1 – OHASD Failed to Start: Inappropriate ioctl for device

12. If ohasd still fails to start, refer to ohasd.log in <grid-home>/log/<nodename>/ohasd/ohasd.log and ohasdOUT.log

Case 2: OHASD Agents does not start OHASD.BIN will spawn four agents/monitors to start resource:

oraagent: responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc orarootagent: responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc cssdagent / cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)

If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy state.

1. Common causes of agent failure are that the log file or log directory for the agents don’t have proper ownership or permission.

Refer to below section “Log File Location, Ownership and Permission” for general reference.

2. If agent binary (oraagent.bin or orarootagent.bin etc) is corrupted, agent will not start resulting in related resources not coming up:

2011-05-03 11:11:13.189 [ohasd(25303)]CRS-5828:Could not start agent ‘/ocw/grid/bin/orarootagent_grid’. Details at (:CRSAGF00130:) {0:0:2} in /ocw/grid/log/racnode1/ohasd/ohasd.log.

2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Created alert : (:CRSAGF00130:) :  Failed to start the agent /ocw/grid/bin/orarootagent_grid 2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_START[ora.diskmon 1 1] ID 4098:403 2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Can not stop the agent: /ocw/grid/bin/orarootagent_grid because pid is not initialized .. 2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} Fatal Error from AGFW Proxy: Unable to start the agent process 2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} CRS-2674: Start of ‘ora.diskmon’ on ‘racnode1’ failed

..

2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other :no exe permission, file [/ocw/grid/bin/cssdagent] 2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00126:) :  Agent start failed .. 2011-06-27 22:34:57.806: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor] The solution is to compare agent binary with a “good” node, and restore a good copy.

3. Agent may fail to start due to bug 11834289 with error “CRS-5802: Unable to start the agent process”, refer to Section “OHASD does not start”  #10 for details. Case 3: OCSSD.BIN does not start Successful cssd.bin startup depends on the following:

1. GPnP profile is accessible – gpnpd needs to be fully up to serve profile

If ocssd.bin is able to get the profile successfully, likely ocssd.log will have similar messages like following:

2010-02-02 18:00:16.251: [    GPnP][408926240]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling “ipc://GPNPD_rac1”, try 4 of 500… 2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileVerifyForCall: [at clsgpnp.c:1867] Result: (87) CLSGPNP_SIG_VALPEER. Profile verified.  prf=0x165160d0 2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileGetSequenceRef: [at clsgpnp.c:841] Result: (0) CLSGPNP_OK. seq of p=0x165160d0 is ‘6’=6 2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2186] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote “ipc://GPNPD_rac1” disco “” Otherwise messages like following will show in ocssd.log

2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1) 2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url “ipc://GPNPD_rac1” 2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13) CLSGPNP_NO_DAEMON. Can’t get GPnP service profile from local GPnP daemon 2010-02-03 22:26:17.057: [ default][3852126240]Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 2010-02-03 22:26:17.057: [    CSSD][3852126240]clsgpnp_getProfile failed, rc(13)


Voting Disk is accessible

In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.

2010-02-03 22:37:22.212: [    CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery string(/share/storage/di*) .. 2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks 2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery 2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found 2010-02-03 22:37:22.228: [    CSSD][1145538880]################################### 2010-02-03 22:37:22.228: [    CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread

ocssd.bin may not come up with the following error if all nodes failed while there’s a voting file change in progress:

2010-05-02 03:11:19.033: [    CSSD][1197668093]clssnmCompleteInitVFDiscovery: Detected voting file add in progress for CIN 0:1134513465:0, waiting for configuration to complete 0:1134513098:0 The solution is to start ocssd.bin in exclusive mode with note 1068835.1

If the voting disk is located on a non-ASM device, ownership and permissions should be:

-rw-r—– 1 ogrid oinstall 21004288 Feb  4 09:13 votedisk1

3. Network is functional and name resolution is working:

If ocssd.bin can’t bind to any network, likely the ocssd.log will have messages like following:

2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1) 2010-02-03 23:26:25.804: [GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ]  failed to determine host from clsinet, using default .. 2010-02-03 23:26:25.810: [    CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39 2010-02-03 23:26:25.811: [    CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://rac1:nm_eotcs- ret 39 2010-02-03 23:26:25.811: [    CSSD][1206540320]clssscmain: failed to open gipc endp

If there’s connectivity issue on private network (including multicast is off), likely the ocssd.log will have messages like following:

2010-09-20 11:52:54.014: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 453, LATS 328297844, lastSeqNo 452, uniqueness 1284979488, timestamp 1284979973/329344894 2010-09-20 11:52:54.016: [    CSSD][1078421824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0 ..  >>>> after a long delay 2010-09-20 12:02:39.578: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 1037, LATS 328883434, lastSeqNo 1036, uniqueness 1284979488, timestamp 1284980558/329930254 2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0xe1ad870) 2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmShutDown: Received abortive shutdown request from client. 2010-09-20 12:02:39.895: [    CSSD][1107286336]################################### 2010-09-20 12:02:39.895: [    CSSD][1107286336]clssscExit: CSSD aborting from thread GMClientListener 2010-09-20 12:02:39.895: [    CSSD][1107286336]################################### To validate network, please refer to note 1054902.1

4. Vendor clusterware is up (if using vendor clusterware)

Grid Infrastructure provide full clusterware functionality and doesn’t need Vendor clusterware to be installed; but if you happened to have Grid Infrastructure on top of Vendor clusterware in your environment, then Vendor clusterware need to come up fully before CRS can be started, to verify, as grid user:

$GRID_HOME/bin/lsnodes -n racnode1    1 racnode1    0 If vendor clusterware is not fully up, likely ocssd.log will have similar messages like following:

2010-08-30 18:28:13.207: [    CSSD][36]clssnm_skgxninit: skgxncin failed, will retry 2010-08-30 18:28:14.207: [    CSSD][36]clssnm_skgxnmon: skgxn init failed 2010-08-30 18:28:14.208: [    CSSD][36]################################### 2010-08-30 18:28:14.208: [    CSSD][36]clssscExit: CSSD signal 11 in thread skgxnmon

Before the clusterware is installed, execute the command below as grid user:

$INSTALL_SOURCE/install/lsnodes -v

Case 4: CRSD.BIN does not start Successful crsd.bin startup depends on the following:

1. ocssd is fully up

If ocssd.bin is not fully up, crsd.log will show messages like following:

2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clssscConnect: gipc request failed with 29 (0x16) 2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clsssInitNative: connect failed, rc 29 2010-02-03 22:37:51.639: [  CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS. Waiting for good status ..


OCR is accessible

If the OCR is located on ASM, ora.asm resource (ASM instance) must be up and diskgroup for OCR must be mounted, if not, likely the crsd.log will show messages like:

2010-02-03 22:22:55.186: [  OCRASM][2603807664]proprasmo: Error in open/create file in dg [GI] [  OCRASM][2603807664]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge ORA-15077: could not locate ASM instance serving a required diskgroup

2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: kgfoCheckMount returned [7] 2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: The ASM instance is down 2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: Failed to open [+GI]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE. 2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: No OCR/OLR devices are usable 2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL 2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprinit: Could not open raw device 2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL 2010-02-03 22:22:55.190: [  OCRAPI][2603807664]a_init:16!: Backend init unsuccessful : [26] 2010-02-03 22:22:55.190: [  CRSOCR][2603807664] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge ORA-15077: could not locate ASM instance serving a required diskgroup ] [7] 2010-02-03 22:22:55.190: [    CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26 Note: in 11.2 ASM starts before crsd.bin, and brings up the diskgroup automatically if it contains the OCR.

If the OCR is located on a non-ASM device, expected ownership and permissions are:

-rw-r—– 1 root  oinstall  272756736 Feb  3 23:24 ocr

If OCR is located on non-ASM device and its unavailable, likely crsd.log will show similar message like following:

2010-02-03 23:14:33.583: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory 2010-02-03 23:14:33.583: [  OCRRAW][2346668976]proprinit: Could not open raw device 2010-02-03 23:14:33.583: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26] 2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:6m’:failed in stat OCR file/disk /share/storage/ocr, errno=2, os err string=No such file or directory 2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory 2010-02-03 23:14:34.587: [  OCRRAW][2346668976]proprinit: Could not open raw device 2010-02-03 23:14:34.587: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26] 2010-02-03 23:14:35.589: [    CRSD][2346668976][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26

If the OCR is corrupted, likely crsd.log will show messages like the following:

2010-02-03 23:19:38.417: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26] 2010-02-03 23:19:39.429: [  OCRRAW][3360863152]propriogid:1_2: INVALID FORMAT 2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprioini: all disks are not OCR/OLR formatted 2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprinit: Could not open raw device 2010-02-03 23:19:39.429: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26] 2010-02-03 23:19:40.432: [    CRSD][3360863152][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26

If owner or group of grid user got changed, even ASM is available, likely crsd.log will show following:

2010-03-10 11:45:12.510: [  OCRASM][611467760]proprasmo: Error in open/create file in dg [SYSTEMDG] [  OCRASM][611467760]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge ORA-01031: insufficient privileges

2010-03-10 11:45:12.528: [  OCRASM][611467760]proprasmo: kgfoCheckMount returned [7] 2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmo: The ASM instance is down 2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: Failed to open [+SYSTEMDG]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE. 2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: No OCR/OLR devices are usable 2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL 2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprinit: Could not open raw device 2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL 2010-03-10 11:45:12.529: [  OCRAPI][611467760]a_init:16!: Backend init unsuccessful : [26] 2010-03-10 11:45:12.530: [  CRSOCR][611467760] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge ORA-01031: insufficient privileges ] [7]

If oracle binary in GRID_HOME has wrong ownership or permission, even ASM shows it’s up and running, likely crsd.log will show following:

2012-03-04 21:34:23.139: [  OCRASM][3301265904]proprasmo: Error in open/create file in dg [OCR] [  OCRASM][3301265904]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=12547, loc=kgfokge

2012-03-04 21:34:23.139: [  OCRASM][3301265904]ASM Error Stack : ORA-12547: TNS:lost contact

2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: kgfoCheckMount returned [7] 2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: The ASM instance is down 2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: Failed to open [+OCR]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE. 2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: No OCR/OLR devices are usable 2012-03-04 21:34:23.635: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL 2012-03-04 21:34:23.636: [    GIPC][3301265904] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5326] 2012-03-04 21:34:23.639: [ default][3301265904]clsvactversion:4: Retrieving Active Version from local storage. 2012-03-04 21:34:23.643: [  OCRRAW][3301265904]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required. 2012-03-04 21:34:23.645: [  OCRRAW][3301265904]proprinit: Could not open raw device 2012-03-04 21:34:23.646: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL 2012-03-04 21:34:23.650: [  OCRAPI][3301265904]a_init:16!: Backend init unsuccessful : [26] 2012-03-04 21:34:23.651: [  CRSOCR][3301265904] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [ CRSMAIN][3301265904] Created alert : (:CRSD00111:) :  Could not init OCR, error: PROC-26: Error while accessing the physical storage ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [    CRSD][3301265904][PANIC] CRSD exiting: Could not init OCR, code: 26 The expected ownership and permission of oracle binary in GRID_HOME should be:

-rwsr-s–x 1 grid oinstall 184431149 Feb  2 20:37 /ocw/grid/bin/oracle

If OCR or mirror is unavailable (could be ASM is up, but diskgroup for OCR/mirror is unmounted), likely crsd.log will show following:

2010-05-11 11:16:38.578: [  OCRASM][18]proprasmo: Error in open/create file in dg [OCRMIR] [  OCRASM][18]SLOS : SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +OCRMIR.255.4294967295 ORA-17503: ksfdopn:2 Failed to open file +OCRMIR.255.4294967295 ORA-15001: diskgroup “OCRMIR .. 2010-05-11 11:16:38.647: [  OCRASM][18]proprasmo: kgfoCheckMount returned [6] 2010-05-11 11:16:38.648: [  OCRASM][18]proprasmo: The ASM disk group OCRMIR is not found or not mounted 2010-05-11 11:16:38.648: [  OCRASM][18]proprasmdvch: Failed to open OCR location [+OCRMIR] error [26] 2010-05-11 11:16:38.648: [  OCRRAW][18]propriodvch: Error  [8] returned device check for [+OCRMIR] 2010-05-11 11:16:38.648: [  OCRRAW][18]dev_replace: non-master could not verify the new disk (8) [  OCRSRV][18]proath_invalidate_action: Failed to replace [+OCRMIR] [8] [  OCRAPI][18]procr_ctx_set_invalid_no_abort: ctx set to invalid .. 2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91: Comparing device hash ids between local and master failed 2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Local dev (1862408427, 1028247821, 0, 0, 0) 2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Master dev (1862408427, 1859478705, 0, 0, 0) 2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:9: Shutdown CacheLocal. my hash ids don’t match [  OCRAPI][19]procr_ctx_set_invalid_no_abort: ctx set to invalid [  OCRAPI][19]procr_ctx_set_invalid: aborting… 2010-05-11 11:16:46.587: [    CRSD][19] Dump State Starting …


crsd.bin pid file exists and points to running crsd.bin process

If pid file does not exist, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID FILE doesn’t exist. .. 2010-02-14 17:41:57.927: [  clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host racnode1 bin crs to /ocw/grid/crs/init/ 2010-02-14 17:41:57.927: [  clsdmt][1092499776]Error3 -2 writing PID [30269] to the file [] 2010-02-14 17:41:57.927: [  clsdmt][1092499776]Failed to record pid for CRSD 2010-02-14 17:41:57.927: [  clsdmt][1092499776]Terminating process 2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting on stop request from clsdms_thdmai The solution is to create a dummy pid file ($GRID_HOME/crs/init/$HOST.pid) manually as grid user with “touch” command and restart resource ora.crsd

If pid file does exist and the PID in this file references a running process which is NOT the crsd.bin process, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

2011-04-06 15:53:38.777: [ora.crsd][1160390976] [check] PID will be looked for in /ocw/grid/crs/init/racnode1.pid 2011-04-06 15:53:38.778: [ora.crsd][1160390976] [check] PID which will be monitored will be 1535                               >> 1535 is output of “cat /ocw/grid/crs/init/racnode1.pid” 2011-04-06 15:53:38.965: [ COMMCRS][1191860544]clsc_connect: (0x2aaab400b0b0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD)) [  clsdmc][1160390976]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD)) with status 9 2011-04-06 15:53:38.966: [ora.crsd][1160390976] [check] Error = error 9 encountered when connecting to CRSD 2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Calling PID check for daemon 2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Trying to check PID = 1535 2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] PID check returned ONLINE CLSDM returned OFFLINE 2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] DaemonAgent::check returned 5 2011-04-06 15:53:39.203: [    AGFW][1160390976] check for resource: ora.crsd 1 1 completed with status: FAILED 2011-04-06 15:53:39.203: [    AGFW][1170880832] ora.crsd 1 1 state changed from: UNKNOWN to: FAILED .. 2011-04-06 15:54:10.511: [    AGFW][1167522112] ora.crsd 1 1 state changed from: UNKNOWN to: CLEANING .. 2011-04-06 15:54:10.513: [ora.crsd][1146542400] [clean] Trying to stop PID = 1535 .. 2011-04-06 15:54:11.514: [ora.crsd][1146542400] [clean] Trying to check PID = 1535 To verify on OS level:

ls -l /ocw/grid/crs/init/*pid -rwxr-xr-x 1 ogrid oinstall 5 Feb 17 11:00 /ocw/grid/crs/init/racnode1.pid cat /ocw/grid/crs/init/*pid 1535 ps -ef| grep 1535 root      1535     1  0 Mar30 ?        00:00:00 iscsid                  >> Note process 1535 is not crsd.bin The solution is to create an empty pid file and to restart the resource ora.crsd, as root:

# > $GRID_HOME/crs/init/<racnode1>.pid # $GRID_HOME/bin/crsctl stop res ora.crsd -init # $GRID_HOME/bin/crsctl start res ora.crsd -init


Network is functional and name resolution is working:

If the network is not fully functioning, ocssd.bin may still come up, but crsd.bin may fail and the crsd.log will show messages like:

2010-02-03 23:34:28.412: [    GPnP][2235814832]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=867, tl=3, f=0 2010-02-03 23:34:28.428: [  OCRAPI][2235814832]clsu_get_private_ip_addresses: no ip addresses found. .. 2010-02-03 23:34:28.434: [  OCRAPI][2235814832]a_init:13!: Clusterware init unsuccessful : [44] 2010-02-03 23:34:28.434: [  CRSOCR][2235814832] OCR context init failure.  Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7] 2010-02-03 23:34:28.434: [    CRSD][2235814832][PANIC] CRSD exiting: Could not init OCR, code: 44 Or:

2009-12-10 06:28:31.974: [  OCRMAS][20]proath_connect_master:1: could not connect to master  clsc_ret1 = 9, clsc_ret2 = 9 2009-12-10 06:28:31.974: [  OCRMAS][20]th_master:11: Could not connect to the new master 2009-12-10 06:29:01.450: [ CRSMAIN][2] Policy Engine is not initialized yet! 2009-12-10 06:29:31.489: [ CRSMAIN][2] Policy Engine is not initialized yet! Or:

2009-12-31 00:42:08.110: [ COMMCRS][10]clsc_receive: (102b03250) Error receiving, ns (12535, 12560), transport (505, 145, 0) To validate the network, please refer to note 1054902.1

5. crsd executable (crsd.bin and crsd in GRID_HOME/bin) has correct ownership/permission and hasn’t been manually modified, a simply way to check is to compare output of “ls -l <grid-home>/bin/crsd <grid-home>/bin/crsd.bin” with a “good” node.

6. To troubleshoot further, refer to note 1323698.1 – Troubleshooting CRSD Start up Issue Case 5: GPNPD.BIN does not start

  1. Name Resolution is not working

gpnpd.bin fails with following error in gpnpd.log:

2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling “tcp://node2:9393”, try 1 of 3… 2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1015] ENTRY 2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1066] GIPC gipcretFail (1) gipcConnect(tcp-tcp://node2:9393) 2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1067] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url “tcp://node2:9393” In above example, please make sure current node is able to ping “node2”, and no firewall between them.

Due to bug 10105195, gpnp dispatch is single threaded and could be blocked by network scanning etc, the bug is fixed in 11.2.0.2 GI PSU2, 11.2.0.3 and above, refer to note 10105195.8 for more details.

Case 6: Various other daemons do not start

Common causes:

1. Log file or directory for the daemon doesn’t have appropriate ownership or permission

If the log file or log directory for the daemon doesn’t have proper ownership or permissions, usually there is no new info in the log file and the timestamp remains the same while the daemon tries to come up.

Refer to below section “Log File Location, Ownership and Permission” for general reference.

2. Network socket file doesn’t have appropriate ownership or permission

In this case, the daemon log will show messages like:

2010-02-02 12:55:20.485: [ COMMCRS][1121433920]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

2010-02-02 12:55:20.485: [  clsdmt][1110944064]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))


OLR is corrupted

In this case, the daemon log will show messages like (this is a case that ora.ctssd fails to start):

2012-07-22 00:15:16.565: [ default][1]clsvactversion:4: Retrieving Active Version from local storage. 2012-07-22 00:15:16.575: [    CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR. Returns [19]. 2012-07-22 00:15:16.585: [    CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns [19]. 2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS init failed [19] 2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS daemon aborting [19]. 2012-07-22 00:15:16.585: [    CTSS][1]CTSS daemon aborting

The solution is to restore a good copy of OLR note 1193643.1 Case 7: CRSD Agents do not start CRSD.BIN will spawn two agents to start up user resource -the two agent share same name and binary as ohasd.bin agents:

orarootagent: responsible for ora.netn.network, ora.nodename.vip, ora.scann.vip and  ora.gns oraagent: responsible for ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup, database, service resource etc

To find out the user resource status:

$GRID_HOME/crsctl stat res -t


If crsd.bin can not start any of the above agents properly, user resources may not come up.

1. Common cause of agent failure is that the log file or log directory for the agents don’t have proper ownership or permissions. Refer to below section “Log File Location, Ownership and Permission” for general reference.

2. Agent may fail to start due to bug 11834289 with error “CRS-5802: Unable to start the agent process”, refer to Section “OHASD does not start”  #10 for details. Network and Naming Resolution Verification CRS depends on a fully functional network and name resolution. If the network or name resolution is not fully functioning, CRS may not come up successfully.

To validate network and name resolution setup, please refer to note 1054902.1

Log File Location, Ownership and Permission Appropriate ownership and permission of sub-directories and files in $GRID_HOME/log is critical for CRS components to come up properly.

In Grid Infrastructure cluster environment:

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and two separate RDBMS owner rdbmsap and rdbmsar, here’s what it looks like under $GRID_HOME/log in cluster environment:

drwxrwxr-x 5 grid oinstall 4096 Dec  6 09:20 log drwxr-xr-x  2 grid oinstall 4096 Dec  6 08:36 crs drwxr-xr-t 17 root   oinstall 4096 Dec  6 09:22 rac1 drwxr-x— 2 grid oinstall  4096 Dec  6 09:20 admin drwxrwxr-t 4 root   oinstall  4096 Dec  6 09:20 agent drwxrwxrwt 7 root    oinstall 4096 Jan 26 18:15 crsd drwxr-xr-t 2 grid  oinstall 4096 Dec  6 09:40 application_grid drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 oraagent_grid drwxr-xr-t 2 rdbmsap oinstall 4096 Jan 26 18:15 oraagent_rdbmsap drwxr-xr-t 2 rdbmsar oinstall 4096 Jan 26 18:15 oraagent_rdbmsar drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 ora_oc4j_type_grid drwxr-xr-t 2 root    root     4096 Jan 26 20:09 orarootagent_root drwxrwxr-t 6 root oinstall 4096 Dec  6 09:24 ohasd drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:14 oraagent_grid drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdagent_root drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdmonitor_root drwxr-xr-t 2 root   root     4096 Jan 26 18:14 orarootagent_root -rw-rw-r– 1 root root     12931 Jan 26 21:30 alertrac1.log drwxr-x— 2 grid oinstall  4096 Jan 26 20:44 client drwxr-x— 2 root oinstall  4096 Dec  6 09:24 crsd drwxr-x— 2 grid oinstall  4096 Dec  6 09:24 cssd drwxr-x— 2 root oinstall  4096 Dec  6 09:24 ctssd drwxr-x— 2 grid oinstall  4096 Jan 26 18:14 diskmon drwxr-x— 2 grid oinstall  4096 Dec  6 09:25 evmd drwxr-x— 2 grid oinstall  4096 Jan 26 21:20 gipcd drwxr-x— 2 root oinstall  4096 Dec  6 09:20 gnsd drwxr-x— 2 grid oinstall  4096 Jan 26 20:58 gpnpd drwxr-x— 2 grid oinstall  4096 Jan 26 21:19 mdnsd drwxr-x— 2 root oinstall  4096 Jan 26 21:20 ohasd drwxrwxr-t 5 grid oinstall  4096 Dec  6 09:34 racg drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgeut drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgevtf drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgmain drwxr-x— 2 grid oinstall  4096 Jan 26 20:57 srvm

Please note most log files in sub-directory inherit ownership of parent directory; and above are just for general reference to tell whether there’s unexpected recursive ownership and permission changes inside the CRS home . If you have a working node with the same version, the working node should be used as a reference.

In Oracle Restart environment:

And here’s what it looks like under $GRID_HOME/log in Oracle Restart environment:

drwxrwxr-x 5 grid oinstall 4096 Oct 31  2009 log drwxr-xr-x  2 grid oinstall 4096 Oct 31  2009 crs drwxr-xr-x  3 grid oinstall 4096 Oct 31  2009 diag drwxr-xr-t 17 root   oinstall 4096 Oct 31  2009 rac1 drwxr-x— 2 grid oinstall  4096 Oct 31  2009 admin drwxrwxr-t 4 root   oinstall  4096 Oct 31  2009 agent drwxrwxrwt 2 root oinstall 4096 Oct 31  2009 crsd drwxrwxr-t 8 root oinstall 4096 Jul 14 08:15 ohasd drwxr-xr-x 2 grid oinstall 4096 Aug  5 13:40 oraagent_grid drwxr-xr-x 2 grid oinstall 4096 Aug  2 07:11 oracssdagent_grid drwxr-xr-x 2 grid oinstall 4096 Aug  3 21:13 orarootagent_grid -rwxr-xr-x 1 grid oinstall 13782 Aug  1 17:23 alertrac1.log drwxr-x— 2 grid oinstall  4096 Nov  2  2009 client drwxr-x— 2 root   oinstall  4096 Oct 31  2009 crsd drwxr-x— 2 grid oinstall  4096 Oct 31  2009 cssd drwxr-x— 2 root   oinstall  4096 Oct 31  2009 ctssd drwxr-x— 2 grid oinstall  4096 Oct 31  2009 diskmon drwxr-x— 2 grid oinstall  4096 Oct 31  2009 evmd drwxr-x— 2 grid oinstall  4096 Oct 31  2009 gipcd drwxr-x— 2 root   oinstall  4096 Oct 31  2009 gnsd drwxr-x— 2 grid oinstall  4096 Oct 31  2009 gpnpd drwxr-x— 2 grid oinstall  4096 Oct 31  2009 mdnsd drwxr-x— 2 grid oinstall  4096 Oct 31  2009 ohasd drwxrwxr-t 5 grid oinstall  4096 Oct 31  2009 racg drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgeut drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgevtf drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgmain drwxr-x— 2 grid oinstall  4096 Oct 31  2009 srvm

Network Socket File Location, Ownership and Permission Network socket files can be located in /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle

When socket file has unexpected ownership or permission, usually daemon log file (i.e. evmd.log) will have the following:

2011-06-18 14:07:28.545: [ COMMCRS][772]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_EVMD))

2011-06-18 14:07:28.545: [  clsdmt][515]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=lena042DBG_EVMD)) 2011-06-18 14:07:28.545: [  clsdmt][515]Terminating process 2011-06-18 14:07:28.559: [ default][515] EVMD exiting on stop request from clsdms_thdmai And the following error may be reported:

CRS-5017: The resource action “ora.evmd start” encountered the following error: CRS-2674: Start of ‘ora.evmd’ on ‘racnode1’ failed .. The solution is to stop GI as root (crsctl stop crs -f), clean up socket files and restart GI.

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and clustername eotcs

In Grid Infrastructure cluster environment:

Below is an example output from cluster environment:

drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle: drwxrwxrwt 2 root  oinstall 4096 Feb  2 21:25 . srwxrwx— 1 grid oinstall    0 Feb  2 18:00 master_diskmon srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 mdnsd -rw-r–r– 1 grid oinstall    5 Feb  2 18:00 mdnsd.pid prw-r–r– 1 root  root        0 Feb  2 13:33 npohasd srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 ora_gipc_GPNPD_rac1 -rw-r–r– 1 grid oinstall    0 Feb  2 13:34 ora_gipc_GPNPD_rac1_lock srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.1 srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.2 srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.1 srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.2 srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.1 srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.2 srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.1 srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.2 srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.1 srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.2 srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.1 srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.2 srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sAevm srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sCevm srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_IPC_SOCKET_11 srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_UI_SOCKET srwxrwxrwx 1 root  root        0 Feb  2 21:25 srac1DBG_CRSD srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_CSSD srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_CTSSD srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_EVMD srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GIPCD srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GPNPD srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_MDNSD srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_OHASD srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN2 srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN3 srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_ srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs -rw-r–r– 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs_lock -rw-r–r– 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1__lock srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_IPC_SOCKET_11 srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_UI_SOCKET srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1 -rw-r–r– 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1_lock srwxrwxrwx 1 root  root        0 Feb  2 18:01 sora_crsqs srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROC srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROL srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sSYSTEM.evm.acceptor.auth

In Oracle Restart environment:

And below is an example output from Oracle Restart environment:

drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle: srwxrwx— 1 grid oinstall 0 Aug  1 17:23 master_diskmon prw-r–r– 1 grid oinstall 0 Oct 31  2009 npohasd srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.1 srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.2 srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.1 srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.2 srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.1 srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.2 srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.1 srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.2 srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.1 srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.2 srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sCRSD_UI_SOCKET srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_CSSD srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_OHASD srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sEXTPROC1521 srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_ srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost -rw-r–r– 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost_lock -rw-r–r– 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1__lock srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_IPC_SOCKET_11 srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_UI_SOCKET srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1 -rw-r–r– 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1_lock srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sprocr_local_conn_0_PROL

Diagnostic file collection If the issue can’t be identified with the note, as root, please run $GRID_HOME/bin/diagcollection.sh on all nodes, and upload all .gz files it generated in current directory.

0 views0 comments

Recent Posts

See All

12c Grid Infrastructure Quick Reference

This is a 12c Grid Infrastructure quick reference DETAILS “crsctl stat res -t” output example This example is from a 12c GI Flex Cluster...

Comments


bottom of page