关于ORA-3136的一个问题分析和验证
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 关于ORA-3136的一个问题分析和验证
对于ora-3136这个错误,我相信大家都遇到过,因为太常见了。一客户也遇到这个问题,让进行分析。由于不方便贴日志;因此我进行了模拟并验证这个分析。
我这里使用最新的Oracle 20c版本进行测试验证。根据常规的诊断方法,我们应该参考这个mos文档;Troubleshooting Guide ORA-3136: WARNING Inbound Connection Timed Out (Doc ID 465043.1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
APPLIES TO: Oracle Database - Enterprise Edition - Version 10.2.0.5 and later Oracle Net Services - Version 10.2.0.5 and later Information in this document applies to any platform. PURPOSE The default timeout for INBOUND CONNECT TIMEOUT at both the database and listener was introduced in version 10gR2 RDBMS. The Oracle Net parameters SQLNET.INBOUND_CONNECT_TIMEOUT and INBOUND_CONNECT_TIMEOUT_listener_name now default to 60 seconds. If any client fails to provide credentials or for any reason cannot complete the authentication process in 60 seconds or less, this error will be thrown in the alert.log: WARNING: inbound connection timed out (ORA-3136) TROUBLESHOOTING STEPS The "WARNING: inbound connection timed out (ORA-3136)" in the alert log indicates that the client was not able to complete the authentication process within the period of time specified by the parameter SQLNET.INBOUND_CONNECT_TIMEOUT. A default setting for this parameter was introduced with 10gR1 and is 60 seconds or 1 minute. In 11gR1 and newer, the ORA-3136 will usually be accompanied by timeout errors and may also include the client's IP and the client port used to establish the connection. For example: Fatal NI connect error 12170. VERSION INFORMATION: TNS for IBM/AIX RISC System/6000: Version 11.2.0.1.0 - Production TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.1.0 - Production Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.1.0 - Production Time: 24-AUG-2017 03:55:01 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12606 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.x.y.z)(PORT=63510)) <--Client's IP and ephemeral port recorded here WARNING: inbound connection timed out (ORA-3136) Note that when the client ip address and ephemeral port are provided in the alert log, we can usually use that information to determine when this connection was "processed" by the listener by searching the listener log for this snippet: (HOST=10.x.y.z)(PORT=63510)) This is the corresponding listener log entry for this connection: 24-AUG-2017 03:54:01 * (CONNECT_DATA=(SID=<instance_name>)(CID=(PROGRAM=<program_name>)(HOST=<myclient>)(USER=<myuser>))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.x.y.z)(PORT=63510)) * establish Note the timestamp at the listener log: 24-AUG-2017 03:54:01 Note the timestamp at the alert log: Time: 24-AUG-2017 03:55:01 We can clearly see that the inbound connect timeout setting is 1 minute in this case and this client failed to authenticate in that time. The connection was established at the listener exactly 1 minute before it was terminated at the database. Note: This timeout restriction was introduced to combat a situation where malicious clients attempt to flood database servers with connection requests using invalid credentials. The following are the most likely reasons for this error - Very often this error is thrown because a client or application is attempting to connect using the wrong userid or password. See the following: Note: 793259.1 ORA-1017 May Cause ORA-3136 WARNING : Inbound Connection Timed Out in Alert Log The server receives a valid client connection request but the client takes a long time to authenticate more than the default 60 seconds. The DB server is heavily loaded due to which it cannot finish the client logon within the timeout specified. To understand what is causing this issue, following checks can be done The default value of 60 seconds is good enough in most conditions for the database server to authenticate a client connection. If it is taking longer, then it's worth checking the following items before implementing the workaround: 1. Check whether local connection on the database server is successful & quick. 2. If local connections are quick ,then check for underlying network delay with the help of your network administrator. 3. Check whether your Database performance has degraded. 4. Check alert log for any critical errors for eg, ORA-600 or ORA-7445 and get them resolved first. These critical errors might have triggered the slowness of the database server. To set these parameters to use values higher than the default of 60 seconds, follow these instructions: Edit the server side sqlnet.ora file and add this parameter. It is important to note that in RAC the database will reference a sqlnet.ora file in its own home and not in Grid where the listener is running. So to be clear, add this line to the sqlnet.ora file in $RDBMS_HOME/network/admin: SQLNET.INBOUND_CONNECT_TIMEOUT=<n> Where <n> is the value in seconds. E.g.: SQLNET.INBOUND_CONNECT_TIMEOUT = 120 There is no need to restart either the listener or the database after this change unless the database is running in shared server mode. The next dedicated server process that gets spawned will pick up the new setting. To check whether database server SQLNET.INBOUND_CONNECT_TIMEOUT is active: Eg. SQLNET.INBOUND_CONNECT_TIMEOUT=120 a. For Dedicated server setup, enable the support level sqlnet server tracing will show the timeout value as below: niotns: Enabling CTO, value=120000 (milliseconds) <== 120 seconds niotns: Not enabling dead connection detection. niotns: listener bequeathed shadow coming to life... b. For shared Server setup, $ telnet <database server IP> <dispatcher port> Example. $ telnet <Database IP address> <port #> The telnet session should disconnect after 120 seconds which indicates that the sqlnet.inbound_connect_timeout is active. If you have further questions regarding this issue then please create a Service Request in My Oracle Support and provide the following information: a. Client and matching server traces generated at support level. Note 395525.1 How to Enable Oracle Net Client,Server,Listener,Kerberos and External procedure Tracing from Net Manager (netmgr): Note 374116.1 How to Match Oracle Net Client and Server Trace Files b. Upload sqlnet.ora, listener.ora Sqlnet.log, & Alert_<sid>.log from database server NOTE: In the <images or examples> <above or below> and/or the attached document, user details / company name / address / email / telephone number represent a fictitious sample (based upon made up data used in the Oracle Demo Vision instance). Any similarity to actual persons, living or dead, is purely coincidental and not intended in any manner. |
另外一篇文档是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
APPLIES TO: Oracle Database - Enterprise Edition - Version 9.2.0.1 and later Oracle Net Services - Version 11.2.0.4 to 11.2.0.4 [Release 11.2] Oracle Database Cloud Schema Service - Version N/A and later Oracle Database Exadata Express Cloud Service - Version N/A and later Oracle Database Exadata Cloud Machine - Version N/A and later Information in this document applies to any platform. PURPOSE This troubleshooting guide contains information on how to diagnose issues with connection timeouts where the origin of the problem is suspected to be inside the Oracle Database. In such cases, a first investigation at the network layer or Oracle Net may have pointed to the Database as a possible location for the root cause of the timeout. TROUBLESHOOTING STEPS Connection Timeout errors ORA-3135 and ORA-3136 A connection timeout error can be issued when an attempt to connect to the database does not complete its connection and authentication phases within the time period allowed by the following: SQLNET.INBOUND_CONNECT_TIMEOUT and/or INBOUND_CONNECT_TIMEOUT_<listener_name> server-side parameters. Starting with Oracle 10.2, the default for these parameters is 60 seconds where in previous releases it was 0, meaning no timeout. On a timeout, the client program will receive the ORA-3135 (or possibly TNS-3135) error: ORA-3135 connection lost contact and the database will log the ORA-3136 error in its alert.log: ... Sat May 10 02:21:38 2008 WARNING: inbound connection timed out (ORA-3136) ... Authentication SQL When a database session is in the authentication phase, it will issue a sequence of SQL statements. The authentication is not complete until all these are parsed, executed, fetched completely. Some of the SQL statements in this list e.g. on 10.2 are: select value$ from props$ where name = 'GLOBAL_DB_NAME' select privilege#,level from sysauth$ connect by grantee#=prior privilege# and privilege#>0 start with grantee#=:1 and privilege#>0 select SYS_CONTEXT('USERENV', 'SERVER_HOST'), SYS_CONTEXT('USERENV', 'DB_UNIQUE_NAME'), SYS_CONTEXT('USERENV', 'INSTANCE_NAME'), SYS_CONTEXT('USERENV', 'SERVICE_NAME'), INSTANCE_NUMBER, STARTUP_TIME, SYS_CONTEXT('USERENV', 'DB_DOMAIN') from v$instance where INSTANCE_NAME=SYS_CONTEXT('USERENV', 'INSTANCE_NAME') select privilege# from sysauth$ where (grantee#=:1 or grantee#=1) and privilege#>0 ALTER SESSION SET NLS_LANGUAGE= 'AMERICAN' NLS_TERRITORY= 'AMERICA' NLS_CURRENCY= '$' NLS_ISO_CURRENCY= 'AMERICA' NLS_NUMERIC_CHARACTERS= '.,' NLS_CALENDAR= 'GREGORIAN' NLS_DATE_FORMAT= 'DD-MON-RR' NLS_DATE_LANGUAGE= 'AMERICAN' NLS_SORT= 'BINARY' TIME_ZONE= '+02:00' NLS_COMP= 'BINARY' NLS_DUAL_CURRENCY= '$' NLS_TIME_FORMAT= 'HH.MI.SSXFF AM' NLS_TIMESTAMP_FORMAT= 'DD-MON-RR HH.MI.SSXFF AM' NLS_TIME_TZ_FORMAT= 'HH.MI.SSXFF AM TZR' NLS_TIMESTAMP_TZ_FORMAT= 'DD-MON-RR HH.MI.SSXFF AM TZR' NOTE: The list of SQL above is not complete and does not represent the ordering of the authentication SQL . Differences may also exist from release to release. Hangs during Authentication The above SQL statements need to be Parsed, Executed and Fetched as happens for all SQL inside an Oracle Database. It follows that any problem encountered during these phases which appears as a hang or severe slow performance may result in a timeout. Symptoms of such hangs will be seen by the authenticating session as waits for: cursor: pin S wait on X latch: row cache objects row cache lock Other types of wait events are possible; this list may not be complete. The issue here is that the authenticating session is blocked waiting to get a shared resource which is held by another session inside the database. That blocker session is itself occupied in a long-running activity (or its own hang) which prevents it from releasing the shared resource needed by the authenticating session in a timely fashion. This results in the timeout being eventually reported to the authenticating session. Troubleshooting of Authentication hangs In such situations, we need to find out the blocker process holding the shared resource needed by the authenticating session in order to see what is happening to it. Typical diagnostics used in such cases are the following: Three consecutive systemstate dumps at level 266 during the time that one or more authenticating sessions are blocked. It is likely that the blocking session will have caused timeouts to more than one connection attempt. Hence, systemstate dumps can be useful even when the time needed to generate them exceeds the period of a single timeout e.g. 60 sec: $ sqlplus -prelim '/ as sysdba' oradebug setmypid oradebug unlimit oradebug dump systemstate 266 ...wait 90 seconds oradebug dump systemstate 266 ...wait 90 seconds oradebug dump systemstate 266 quit ASH reports covering e.g. 10-15 minutes of a time period during which several timeout errors were seen. If possible, Two consecutive queries on V$LATCHHOLDER view for the case where the shared resource being waited for is a latch. select * from v$latchholder; The systemstate dumps should help in identifying the blocker session. Level 266 will show us in what code it is executing which may help in locating any existing bug as the root cause. Examples of issues which can result in Authentication hangs Unpublished Bug 6879763 shared pool simulator bug fixed by patch for unpublished Bug 6966286 see Note 563149.1 Unpublished Bug 7039896 workaround parameter _enable_shared_pool_durations=false see Note 7039896.8 Other approaches to avoid the problem In some cases, it may be possible to avoid problems with Authentication SQL by pinning such statements in the Shared Pool soon after the instance is started and they are freshly loaded. You can use the following artcile to advise on this: Document 726780.1 How to Pin a Cursor in the Shared Pool using DBMS_SHARED_POOL.KEEP Pinning will prevent them from being flushed out due to inactivity and aging and will therefore prevent them for needing to be reloaded in the future i.e. needing to be reparsed and becoming susceptible to Authentication hang issues. |
针对上面两篇文档的方法我进行汇总一下;大致分为几种情况:
1、参数设置问题
2、防火墙
3、系统压力过大;比如出现一些latch等异常等到导致用户登录系统时进行权限等递归验操作验证失败而出现超时。
4、相关bug;可能影响share pool进而出现第三种情况。
还有一篇mos文档提到,不恰当的MTU设置也可能出现该问题。
实际上这几种问题都跟我这里讲的无关,否则我也没必要写这篇文章了。我这里需要讲是一个参数的问题:
ersion | Parameter | Type | Modifiable | Default Value | Comments | |
12.1.0.2 | sec_protocol_error_further_action | STRING Allows Multiple Values |
ALTER SESSION ALTER SYSTEM (IMMEDIATE) |
(DROP,3) | ||
11.2.0.4 | sec_protocol_error_further_action | STRING Allows Multiple Values |
ALTER SESSION ALTER SYSTEM (IMMEDIATE) |
CONTINUE | ||
11.1.0.7 | sec_protocol_error_further_action | STRING Allows Multiple Values |
ALTER SESSION ALTER SYSTEM (IMMEDIATE) |
N/A | ||
11.1.0.6 | sec_protocol_error_further_action | STRING Allows Multiple Values |
ALTER SESSION ALTER SYSTEM (IMMEDIATE) |
N/A |
上述参数在11.2.0.4版本之前虽然存在,然而没有默认值;实际上这是一个Oracle Net Service方面的增强;在mos文档SEC_PROTOCOL_ERROR_FURTHER_ACTION and Bad Packets (Doc ID 1666398.1)中有一定描述。本质上来讲是进行了一定安全方面增强。比如你将该参数设置为’delay,10’。
对于频繁连接的主机,Oracle可能会将该机器识别为存在潜在攻击威胁的client;进而延迟该主机过来的Connect;而此时这些会话都将报错ora-3136.
为了验证上面这个观点,我在自己的环境中进行了测试,得到日志就是上面的图片内容。
当然当我将参数修改为默认值后,经测试每秒20多个connect都没有问题。
这个问题感觉还是蛮有意思的;不是太多见,给大家分享一下吧。
Leave a Reply
You must be logged in to post a comment.