由GATHER_STATS_JOB引发的堵塞

来源:互联网 发布:服装贴牌 知乎 编辑:程序博客网 时间:2024/06/10 16:30

一客户生产库于上午7点半左右出现拥堵,应用客户反映系统无法使用,像HANG住一样。现场工程师检查数据库发现大量的“cursor: pin S wait on X”和“library cache lock”等待事件。

前台应用堵塞,与“cursor: pin S wait on X”和“library cache lock”等待事件有一定关系,但本次故障是由计划任务GATHER_STATS_JOB所在会话自堵塞,使得资源未释放导致。下面是详细的分析过程:

正如下图红色字体所示,的确我们看到后台出现大量的librarycache lock、cursor: pin S wait on X等待事件。

SQL> set linesize 200set pagesize 100SQL> SQL> col SAMPLE_TIME for a30SQL> col event for a60SQL> col PROGRAM for a20SQL> select  2     event,count(*)  3  from  4  oracle_diag_hist  5  where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')  6  and SAMPLE_TIME <  to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')  7  group by event  8  order by count(*)  9  ;EVENT                                                          COUNT(*)------------------------------------------------------------ ----------enq: JS - wdw op                                                      1enq: TX - allocate ITL entry                                          1SQL*Net message to client                                             1enq: RO - fast object reuse                                           1direct path write temp                                                1ges global resource directory to be frozen                            1latch free                                                            1JS kill job wait                                                      1enq: TC - contention                                                  1ges remote message                                                    1gc current block congested                                            1enq: HW - contention                                                  1gc current block busy                                                 2local write wait                                                      2enq: XR - database force logging                                      2gc current multi block request                                        2enq: US - contention                                                  2enq: WL - contention                                                  2enq: FB - contention                                                  2SQL*Net more data from client                                         2gc current grant busy                                                 2SQL*Net more data to dblink                                           2resmgr:internal state change                                          3latch: KCL gc element parent latch                                    3direct path read                                                      3Backup: sbtclose2                                                     4gc current grant 2-way                                                4Backup: sbtremove2                                                    5Streams AQ: qmn coordinator waiting for slave to start                6gc current request                                                    6null event                                                            6gc cr block 2-way                                                     6enq: WF - contention                                                  7gc cr request                                                         9rdbms ipc reply                                                       9gc current block 2-way                                               10enq: MS - contention                                                 11os thread startup                                                    13gcs drm freeze in enter server mode                                  15SQL*Net more data from dblink                                        16gc cr block busy                                                     18gc cr grant 2-way                                                    19CGS wait for IPC msg                                                 23Backup: sbtbackup                                                    23control file parallel write                                          25gcs log flush sync                                                   25log file sync                                                        26row cache lock                                                       27reliable message                                                     28SQL*Net more data to client                                          30log file sequential read                                             31gc buffer busy                                                       38library cache pin                                                    41cr request retry                                                     54read by other session                                                63control file sequential read                                         83log file parallel write                                              98inactive session                                                    110Log archive I/O                                                     129gc cr multi block request                                           166db file parallel write                                              192direct path read temp                                               195db file parallel read                                               686kksfbc child completion                                             946DFS lock handle                                                    1160RMAN backup & recovery I/O                                         1274latch: library cache                                               1486db file scattered read                                             1581Backup: sbtwrite2                                                  1644db file sequential read                                            1818library cache lock                                                 4989                                                                  17397cursor: pin S wait on X                                           3874273 rows selected.SQL>

直接从这些等待事件入手查找原因是比较困难的,它们只所以等待是因为无法获取需要的handle、chunk等,这些都是内存中的东西,我们也无法理解。既然这样,换个查找方向。我们来看故障前后一段时间内,谁一直在阻塞其他会话?

select BLOCKING_SESSION,count(*)fromoracle_diag_histwhere SAMPLE_TIME > to_date('2013-07-03:04:00:00','yyyy-mm-dd:hh24:mi:ss')and SAMPLE_TIME <  to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')group by BLOCKING_SESSION;BLOCKING_SESSION   COUNT(*)---------------- ----------                      57743              73       2115            2112         84             517          3            1425          1             531         32             549          1             248          1             514          1             168          1             314          111 rows selected.SQL> 

通过上面的SQL,我们能看到session 73和2112阻塞其他会话的次数最多,很可疑。这些会话为什么会阻塞其他会话?难道他被阻塞了?那我们继续查谁阻塞了这些会话。

SQL> col SAMPLE_TIME for a40SQL> col event for a30SQL> select   2     session_id,SAMPLE_TIME,event,sql_id,blocking_session  3  from  4  oracle_diag_hist  5  where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')  6  and SAMPLE_TIME <  to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')  7  and session_id = &session_id  8  order by SAMPLE_TIME  9  ;SQL> /Enter value for session_id: 73old   7: and session_id = &session_idnew   7: and session_id = 73SESSION_ID SAMPLE_TIME                    EVENT                          SQL_ID        BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------        73 03-JUL-13 01.00.01.983 AM      direct path read temp          4bzt4j1n3kbg1                           1459945305 oracle@essdb5 (J002)        73 03-JUL-13 01.00.12.103 AM                                     4bzt4j1n3kbg1                           1459945305 oracle@essdb5 (J002)        73 03-JUL-13 01.00.22.213 AM                                     4bzt4j1n3kbg1                           1459945305 oracle@essdb5 (J002)        73 03-JUL-13 01.00.32.313 AM                                     4bzt4j1n3kbg1                           1459945305 oracle@essdb5 (J002)        73 03-JUL-13 01.00.42.423 AM                                     4bzt4j1n3kbg1                           1459945305 oracle@essdb5 (J002)        73 03-JUL-13 01.00.52.533 AM      direct path read temp          4bzt4j1n3kbg1                           1459945305 oracle@essdb5 (J002)::::::::::::::::::::::::::::::::::SESSION_ID SAMPLE_TIME                    EVENT                          SQL_ID        BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------        73 03-JUL-13 05.59.33.376 AM      latch: library cache                                                            0 oracle@essdb5 (J002)        73 03-JUL-13 05.59.43.476 AM                                     4y1y43113gv8f                           1333736285 oracle@essdb5 (J002)        73 03-JUL-13 05.59.53.586 AM      latch: library cache           b6usrg82hwsa3                                    0 oracle@essdb5 (J002)        73 03-JUL-13 06.00.03.706 AM      DFS lock handle                b6usrg82hwsa3                                    0 oracle@essdb5 (J002)        73 03-JUL-13 06.00.13.806 AM      latch: library cache                                                            0 oracle@essdb5 (J002)        73 03-JUL-13 06.00.23.916 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 06.00.34.022 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 06.00.44.122 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 06.00.54.222 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 06.01.04.342 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 06.01.14.442 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)SESSION_ID SAMPLE_TIME                    EVENT                          SQL_ID        BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------        73 03-JUL-13 06.01.24.552 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 06.01.34.662 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 06.01.44.762 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 06.01.54.862 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 06.02.04.982 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)::::::::::::::::::::::::::::::::::SESSION_ID SAMPLE_TIME                    EVENT                          SQL_ID        BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------        73 03-JUL-13 08.57.25.845 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.57.35.955 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.57.46.055 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.57.56.165 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.58.06.285 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.58.16.385 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.58.26.495 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.58.36.605 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.58.46.705 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.58.56.815 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.59.06.931 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)SESSION_ID SAMPLE_TIME                    EVENT                          SQL_ID        BLOCKING_SESSION SQL_PLAN_HASH_VALUE PROGRAM---------- ------------------------------ ------------------------------ ------------- ---------------- ------------------- --------------------        73 03-JUL-13 08.59.17.041 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)        73 03-JUL-13 08.59.27.141 AM      DFS lock handle                8r7dm0qvp2259               73                   0 oracle@essdb5 (J002)2847 rows selected.SQL>

我们可以看到,会话73 6:00之后被自己阻塞,经过检查这个自阻塞一直持续到数据库被重启,这是不正常的。但该会话在6:00之前是正常的,那我们再来看一下谁阻塞了会话2112。


SQL> col SAMPLE_TIME for a40SQL> col event for a30SQL> select   2     session_id,SAMPLE_TIME,event,sql_id,blocking_session  3  from  4  oracle_diag_hist  5  where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')  6  and SAMPLE_TIME <  to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')  7  and session_id = &session_id  8  order by SAMPLE_TIME  9  ;Enter value for session_id: 2112old   7: and session_id = &session_idnew   7: and session_id = 2112SESSION_ID SAMPLE_TIME                              EVENT                          SQL_ID        BLOCKING_SESSION---------- ---------------------------------------- ------------------------------ ------------- ----------------      2112 03-JUL-13 07.36.29.933 AM                db file sequential read        fyjpq29hka5bn      2112 03-JUL-13 07.46.56.595 AM                                               fbf927kwsussx      2112 03-JUL-13 07.47.06.705 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.47.16.804 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.47.26.904 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.47.37.014 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.47.47.114 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.47.57.234 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.48.07.344 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.48.17.444 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.48.27.554 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.48.37.654 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.48.47.754 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.48.57.874 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.49.07.984 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.49.18.082 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.49.28.182 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.49.38.282 AM                cursor: pin S wait on X        27qc21g5nsc87      2112 03-JUL-13 07.49.48.382 AM                cursor: pin S wait on X        27qc21g5nsc87::::::::SESSION_ID SAMPLE_TIME                              EVENT                          SQL_ID        BLOCKING_SESSION---------- ---------------------------------------- ------------------------------ ------------- ----------------      2112 03-JUL-13 08.52.08.121 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.52.18.221 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.52.28.331 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.52.38.441 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.52.48.541 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.52.58.661 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.53.08.771 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.53.18.881 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.53.28.979 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.53.39.089 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.53.49.209 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.53.59.309 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.54.09.419 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.54.19.529 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.54.29.629 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.54.39.739 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.54.49.859 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.54.59.959 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.55.10.069 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.55.20.179 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.55.30.277 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.55.40.387 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.55.50.507 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.56.00.607 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.56.10.717 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.56.20.827 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.56.30.937 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.56.41.047 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.56.51.167 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.57.01.267 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.57.11.377 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.57.21.487 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.57.31.585 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.57.41.695 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.57.51.815 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.58.01.915 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.58.12.025 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.58.22.135 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.58.32.235 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.58.42.345 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.58.52.475 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.59.02.575 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.59.12.685 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.59.22.795 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.59.32.895 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.59.43.004 AM                library cache lock             27qc21g5nsc87      2112 03-JUL-13 08.59.53.124 AM                library cache lock             27qc21g5nsc87435 rows selected.SQL>

从上图我们可以看到会话2112在7点36分出现等待事件“db file sequential read”,这是正常的。但从7点47分开始等待“cursor:pin S wait on X”直到8点52分转换成“library cache lock”,注意这里的sql_id始终没变,这是不正常的。

通过两个阻塞严重的会话73、2112可以看出,2112堵塞的时间晚于73,很有可能是会话73的堵塞使得2112堵塞。那我们再来看看73阻塞了谁?

SQL> col SAMPLE_TIME for a30SQL> col event for a30SQL> col PROGRAM for a20SQL> select  2     session_id,event,count(*)  3  from  4  oracle_diag_hist  5  where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')  6  and SAMPLE_TIME <  to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')  7  and blocking_session = &session_id  8  group by session_id,event  9  ;Enter value for session_id: 73old   7: and blocking_session = &session_idnew   7: and blocking_session = 73SESSION_ID EVENT                            COUNT(*)---------- ------------------------------ ----------       508 row cache lock                          3        73 DFS lock handle                      1063       168 library cache lock                    663       981 library cache lock                    386SQL> 

会话73自阻塞,为什么73会自阻塞呢?从上面我们可以看到73阻塞是从03-JUL-13 06.00.23.916开始的,并且73是JOB,那我们来检查一下系统JOB运行状况,见下图:

SQL> col state for a10SQL> SELECT owner,job_name,state,last_start_date,last_run_duration,failure_count  2  FROM dba_scheduler_jobs;OWNER      JOB_NAME                  STATE      LAST_START_DATE                      LAST_RUN_DURATION              FAILURE_COUNT---------- ------------------------- ---------- ------------------------------------ ------------------------------ -------------SYS        AUTO_SPACE_ADVISOR_JOB    SCHEDULED  03-JUL-13 10.00.02.060295 PM +08:00  +000000000 01:05:55.361158                 3SYS        GATHER_STATS_JOB          SCHEDULED  03-JUL-13 10.00.02.053605 PM +08:00  +000000000 07:59:58.933113                 0SYS        FGR$AUTOPURGE_JOB         DISABLED                                                                                   0SYS        PURGE_LOG                 SCHEDULED  04-JUL-13 03.00.00.175794 AM +08:00  +000000000 00:00:00.415187                 0ORACLE_OCM MGMT_STATS_CONFIG_JOB     SCHEDULED  01-JUL-13 01.01.01.554056 AM +08:00  +000000000 00:00:12.660441                 0ORACLE_OCM MGMT_CONFIG_JOB           SCHEDULED  03-JUL-13 10.00.01.997028 PM +08:00  +000000000 00:00:00.320530                 0EXFSYS     RLM$SCHDNEGACTION         SCHEDULED  04-JUL-13 05.18.07.107427 PM +08:00  +000000000 00:00:00.215880                 0EXFSYS     RLM$EVTCLEANUP            SCHEDULED  22-OCT-12 05.35.21.909816 PM +08:00  +000000000 00:02:13.685219                 18 rows selected.SQL> col job_name for a20SQL> col status for a20SQL> col last_start_date for a36SQL> SELECT log_id, job_name, status,  2        TO_CHAR (log_date, 'DD-MON-YYYY HH24:MI') log_date#,actual_start_date  3  FROM dba_scheduler_job_run_details  4  WHERE job_name = 'GATHER_STATS_JOB' order by actual_start_date;    LOG_ID JOB_NAME             STATUS               LOG_DATE#            ACTUAL_START_DATE---------- -------------------- -------------------- -------------------- ---------------------------------------------------------------------------    173057 GATHER_STATS_JOB     STOPPED              04-JUN-2013 06:00    03-JUN-13 10.00.12.702279 PM +08:00    173105 GATHER_STATS_JOB     SUCCEEDED            05-JUN-2013 05:12    04-JUN-13 10.00.03.553269 PM +08:00    173186 GATHER_STATS_JOB     STOPPED              06-JUN-2013 06:00    05-JUN-13 10.00.04.671089 PM +08:00    173246 GATHER_STATS_JOB     SUCCEEDED            07-JUN-2013 03:58    06-JUN-13 10.00.01.713362 PM +08:00    173307 GATHER_STATS_JOB     STOPPED              08-JUN-2013 06:00    07-JUN-13 10.00.07.335319 PM +08:00    173289 GATHER_STATS_JOB     SUCCEEDED            08-JUN-2013 07:49    08-JUN-13 06.00.11.991233 AM +08:00    173429 GATHER_STATS_JOB     STOPPED              11-JUN-2013 06:00    10-JUN-13 10.00.02.020636 PM +08:00    173459 GATHER_STATS_JOB     STOPPED              12-JUN-2013 06:00    11-JUN-13 10.00.10.844326 PM +08:00    173488 GATHER_STATS_JOB     STOPPED              13-JUN-2013 06:00    12-JUN-13 10.00.08.097010 PM +08:00    173518 GATHER_STATS_JOB     STOPPED              14-JUN-2013 06:00    13-JUN-13 10.00.02.130179 PM +08:00    173589 GATHER_STATS_JOB     STOPPED              15-JUN-2013 06:00    14-JUN-13 10.00.02.825754 PM +08:00    LOG_ID JOB_NAME             STATUS               LOG_DATE#            ACTUAL_START_DATE---------- -------------------- -------------------- -------------------- ---------------------------------------------------------------------------    173595 GATHER_STATS_JOB     SUCCEEDED            15-JUN-2013 07:33    15-JUN-13 06.00.12.296982 AM +08:00    173686 GATHER_STATS_JOB     STOPPED              18-JUN-2013 06:00    17-JUN-13 10.00.03.440121 PM +08:00    173736 GATHER_STATS_JOB     STOPPED              19-JUN-2013 06:00    18-JUN-13 10.00.01.361487 PM +08:00    173808 GATHER_STATS_JOB     STOPPED              20-JUN-2013 06:00    19-JUN-13 10.00.01.680778 PM +08:00    173836 GATHER_STATS_JOB     STOPPED              21-JUN-2013 06:00    20-JUN-13 10.00.04.455498 PM +08:00    173905 GATHER_STATS_JOB     STOPPED              22-JUN-2013 06:00    21-JUN-13 10.00.06.883596 PM +08:00    173886 GATHER_STATS_JOB     SUCCEEDED            22-JUN-2013 16:30    22-JUN-13 06.00.30.956798 AM +08:00    174012 GATHER_STATS_JOB     STOPPED              25-JUN-2013 06:00    24-JUN-13 10.00.01.468786 PM +08:00    174046 GATHER_STATS_JOB     STOPPED              26-JUN-2013 06:00    25-JUN-13 10.00.02.324651 PM +08:00    174114 GATHER_STATS_JOB     STOPPED              27-JUN-2013 06:00    26-JUN-13 10.00.01.509654 PM +08:00    174147 GATHER_STATS_JOB     STOPPED              28-JUN-2013 06:00    27-JUN-13 10.00.02.731738 PM +08:00    LOG_ID JOB_NAME             STATUS               LOG_DATE#            ACTUAL_START_DATE---------- -------------------- -------------------- -------------------- ---------------------------------------------------------------------------    174194 GATHER_STATS_JOB     STOPPED              29-JUN-2013 06:00    28-JUN-13 10.00.04.998374 PM +08:00    174208 GATHER_STATS_JOB     SUCCEEDED            29-JUN-2013 12:27    29-JUN-13 06.00.04.335794 AM +08:00    174341 GATHER_STATS_JOB     STOPPED              02-JUL-2013 06:00    01-JUL-13 10.00.01.313663 PM +08:00    174422 GATHER_STATS_JOB     STOPPED              03-JUL-2013 10:10    02-JUL-13 10.00.01.866622 PM +08:00    174469 GATHER_STATS_JOB     STOPPED              04-JUL-2013 06:00    03-JUL-13 10.00.01.981592 PM +08:0027 rows selected.

从上图红色字体标注位置可以看出GATHER_STATS_JOB运行了8小时才结束,且未为异常结束。我们都知道,这个job默认是从晚上10点开始,如果早上6点还没有执行完毕,系统会终止,从这个时间点来看,说明这个JOB是被系统终止的。

那么针对大量事件“cursor: pin S wait on X”和“library cache lock”的产生就有两种推论:

推论1,是会话73运行在7月2日晚上10点开始运行job GATHER_STATS_JOB,7月3日凌晨6点还没运行完,被系统终止,但此时会话73出现自阻塞现象,资源没有释放,导致后来的会话拥堵,出现“cursor: pin Swait on X”和“library cache lock”内存中一些资源无法释放。如果推论成立,那么这两个事件的出现时间应该晚于6点。

推论2,是系统在这个时间段本身就会有一些“cursor: pin Swait on X”和“library cache lock”事件,后因未即使处理堆积引起。

针对推论2,我查过7月1日、2日相同时间段的等待事件,未发现“cursor:pin S wait on X”和“library cache lock”等待事件,基本可以论证不是因为突发的事件堆积引起。

针对推论1,我们来检查这两个事件发生的时间点:

SQL> select        min(SAMPLE_TIME)  2    3  from  4  oracle_diag_hist  5  where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')  6  and SAMPLE_TIME <  to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')  7  and event = 'cursor: pin S wait on X'  8  ;MIN(SAMPLE_TIME)---------------------------------------------------------------------------03-JUL-13 07.08.46.880 AMSQL> SQL> col PROGRAM for a30SQL> select  2     SAMPLE_TIME,session_id,event,blocking_session,PROGRAM  3  from  4  oracle_diag_hist  5  where SAMPLE_TIME > to_date('2013-07-03:01:00:00','yyyy-mm-dd:hh24:mi:ss')  6  and SAMPLE_TIME <  to_date('2013-07-03:09:00:00','yyyy-mm-dd:hh24:mi:ss')  7  and event = 'library cache lock'  8  order by SAMPLE_TIME  9  ;SAMPLE_TIME                    SESSION_ID EVENT                          BLOCKING_SESSION PROGRAM------------------------------ ---------- ------------------------------ ---------------- ------------------------------03-JUL-13 01.00.12.103 AM             998 library cache lock                              datatofile@XHM-3ES- (TNS V1-V3)03-JUL-13 01.15.11.768 AM             527 library cache lock                              oracle@essdb6 (CJQ0)03-JUL-13 01.48.33.296 AM             574 library cache lock                              oracle@essdb1 (TNS V1-V3)03-JUL-13 01.49.03.590 AM            1406 library cache lock                              oracle@essdb1 (TNS V1-V3)03-JUL-13 01.49.03.590 AM            1526 library cache lock                              oracle@essdb1 (TNS V1-V3)03-JUL-13 01.49.03.636 AM             554 library cache lock                              oracle@essdb1 (TNS V1-V3)03-JUL-13 01.49.03.636 AM            2037 library cache lock                              oracle@essdb1 (TNS V1-V3)03-JUL-13 02.00.20.845 AM            1987 library cache lock                              rman@essdb5 (TNS V1-V3)03-JUL-13 02.17.11.497 AM             527 library cache lock                              oracle@essdb6 (CJQ0)03-JUL-13 03.05.52.789 AM            1058 library cache lock03-JUL-13 03.27.26.550 AM             210 library cache lock                              oracle@essdb3 (TNS V1-V3)03-JUL-13 03.34.00.749 AM             445 library cache lock                              oracle@essdb5 (J007)03-JUL-13 03.57.54.474 AM             527 library cache lock                              oracle@essdb6 (CJQ0)03-JUL-13 04.04.19.862 AM             726 library cache lock                              oracle@essdb1 (TNS V1-V3)03-JUL-13 04.04.19.862 AM            1425 library cache lock                              oracle@essdb1 (TNS V1-V3)03-JUL-13 04.11.12.802 AM            1527 library cache lock03-JUL-13 04.19.48.146 AM             527 library cache lock                              oracle@essdb6 (CJQ0)03-JUL-13 04.26.44.162 AM            1083 library cache lock                              sqlldr@XHM-3ES- (TNS V1-V3)03-JUL-13 04.57.13.688 AM              73 library cache lock                         1425 oracle@essdb5 (J002)03-JUL-13 05.09.21.366 AM            1785 library cache lock                              sqlldr@XHM-3ES- (TNS V1-V3)03-JUL-13 05.33.16.602 AM             750 library cache lock                              oracle@essdb3 (TNS V1-V3)03-JUL-13 06.01.51.732 AM             886 library cache lock03-JUL-13 06.02.01.832 AM             886 library cache lock03-JUL-13 06.02.11.932 AM             886 library cache lock03-JUL-13 06.02.22.032 AM             886 library cache lock03-JUL-13 06.02.32.132 AM             886 library cache lock03-JUL-13 06.02.42.242 AM             886 library cache lock03-JUL-13 06.02.52.362 AM             886 library cache lock03-JUL-13 06.03.02.462 AM             886 library cache lock:::::::03-JUL-13 07.07.01.599 AM             886 library cache lock03-JUL-13 07.07.11.697 AM             886 library cache lock03-JUL-13 07.07.15.933 AM             168 library cache lock                           73 JDBC Thin Client03-JUL-13 07.07.21.797 AM             886 library cache lock03-JUL-13 07.07.26.043 AM             168 library cache lock                           73 JDBC Thin Client03-JUL-13 07.07.31.897 AM             886 library cache lock03-JUL-13 07.07.36.153 AM             168 library cache lock                           73 JDBC Thin Client03-JUL-13 07.07.42.007 AM             886 library cache lock03-JUL-13 07.07.46.253 AM             168 library cache lock                           73 JDBC Thin Client03-JUL-13 07.07.52.127 AM             886 library cache lock03-JUL-13 07.07.56.353 AM             168 library cache lock                           73 JDBC Thin Client03-JUL-13 07.08.02.227 AM             886 library cache lock::::03-JUL-13 08.45.54.136 AM             759 library cache lock                              JDBC Thin Client03-JUL-13 08.45.54.136 AM             831 library cache lock                              JDBC Thin Client03-JUL-13 08.45.54.136 AM             886 library cache lock                         211203-JUL-13 08.45.54.136 AM             915 library cache lock                              JDBC Thin Client03-JUL-13 08.45.54.136 AM            1512 library cache lock                              JDBC Thin Client03-JUL-13 08.45.54.136 AM            1913 library cache lock                              JDBC Thin Client

从上图中红色字体标注部分可以看出事件“cursor: pin S wait on X”最早出现的时间为8点46分,晚于会话73的自堵塞时间点。事件“library cache lock”在5点33分之前就有,但都比较短暂,而在6点01分后才频繁出现,最后还出现一些会话又被会话73堵塞,频繁出现的时间点也晚于会话73的自堵塞时间点。因此推论1基本成立,本次故障是由会话73死亡后资源未释放,产生自阻塞导致。

我们都知道在ORACLE中清理死亡会话的工作是由PMON进程负责的,至于本次故障时刻,是否是会话73已死亡而PMON进程没有去清理该死亡进程资源未释放导致?还是会话73本身就是活动的,此时只是因为无法得到“DFS lock handle”而自锁导致,仅根据现有的信息是无法得知的,还需要在故障时刻收集一些额外的信息才可能理清原委。

从目前仅有的信息来看,该故障很可能是由会话73的自堵塞导致。

应急办法:

故障时刻尝试kill对应的job进程,例如:kill -9 “会话73对应的PID”,再观察数据库状态。