读书人

dbcp重连失败有关问题排查记录(timeou

发布时间: 2012-10-25 10:58:57 作者: rapoo

dbcp重连失败问题排查记录(timeout设置)
背景

??前几天很悲剧,公司的oracle小型机据说是网卡驱动有问题,导致数据库整整挂了1个多小时,后来切换到了备份数据库上,可用性一下子跌倒了3个9。

?

??在后续的某一天,凌晨3点进行了网卡升级,又从备份库重新切回主库的时,测试过程中发现dbcp的自动重连部分机器有问题。

问题描述

整个过程出现了两种错误日志:?

异常1:?

?

Caused by: java.sql.SQLException: The Network Adapter could not establish the connection        at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)        at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:131)        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:197)        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:525)        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:413)        at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:508)        at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:203)        at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:33)        at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:510)        at com.alibaba.china.jdbc.SimpleDriver.connect(SimpleDriver.java:101)        at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)        at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582)        at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1148)        at org.apache.commons.dbcp.AbandonedObjectPool.borrowObject(AbandonedObjectPool.java:79)        at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)        at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)        at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113)        at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:190)        ... 49 moreCaused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection        at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:328)        at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:421)        at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:630)        at oracle.net.ns.NSProtocol.connect(NSProtocol.java:206)        at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:966)        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:292)        ... 62 moreCaused by: java.net.ConnectException: Connection timed out        at java.net.PlainSocketImpl.socketConnect(Native Method)        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)        at java.net.Socket.connect(Socket.java:525)        at java.net.Socket.connect(Socket.java:475)        at java.net.Socket.<init>(Socket.java:372)        at java.net.Socket.<init>(Socket.java:186)        at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:127)        at oracle.net.nt.ConnOption.connect(ConnOption.java:126)        at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:306)        ... 67 more

异常2:

?

Caused by: org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object        at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)        at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)        at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113)        at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:190)        ... 49 moreCaused by: java.util.NoSuchElementException: Timeout waiting for idle object        at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1134)        at org.apache.commons.dbcp.AbandonedObjectPool.borrowObject(AbandonedObjectPool.java:79)        at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)        ... 52 more

?

下面再看一下,我们的dbcp配置: (关于dbcp的配置,可以先看下这两篇文章:?dbcp基本配置和重连配置?, ??解读dbcp自动重连那些事??)

?

<bean id="dataSource" destroy-method="close"> <property name="driverClassName" value="oracle.jdbc.OracleDriver" /><property name="url" value="${database.driver.url}" /><property name="username"><value>${database.driver.username}</value></property><property name="password" value="${database.driver.password}" /><property name="maxActive"><value>14</value></property><property name="initialSize"><value>1</value></property>        <property name="maxWait"><value>60000</value></property>        <property name="maxIdle"><value>14</value></property>        <property name="minIdle"><value>1</value></property>        <property name="removeAbandoned"><value>true</value></property>        <property name="removeAbandonedTimeout"><value>180</value></property><property name="timeBetweenEvictionRunsMillis"><value>60000</value></property><property name="minEvictableIdleTimeMillis"><value>1800000</value></property><property name="connectionProperties"><value>bigStringTryClob=true;clientEncoding=GBK;defaultRowPrefetch=50;serverEncoding=ISO-8859-1</value></property></bean>

?

这是oracle的配置,mysql配置时多了个validate配置。

?

问题分析?

?

1. 从日志分析来看,对应异常1的只在前面出现了百来次之后,后续的请求都直接抛出了异常2。

2. 从页面使用来看,每次页面返回502错误,一般是等待了60s左右。 这个和maxWait参数配置的60s接近。

3. 当时验证的两组集群,一组业务的机器没问题,全都自动重链ok的。 另一组中,有一台预发布机也能自动重连,但就几台提供正式访问的机器自动重连一直失败.

4. 针对自动重连失败的机器,netstat -nat | grep 1521查看链接为0

?

通过这几个现象,和查看了dbcp的源码,大致的一个猜想:?

1. ?假设说原先的应用中active的链接数为maxActive

2. ?在数据库进行关闭后,原先的连接池中的异常会拿到"Broken pipe"异常,从日志中分析也就只有几条,符合。

?? ? 如果正在使用的连接会抛出:java.sql.SQLException: 无法从套接字读取更多的数据。

?? ? 如果拿到链接正准备使用,会抛出:?java.sql.SQLException: OALL8 处于不一致状态

3. ?等这些链接,被淘汰完之后,新的请求会尝试调用makeObject()重新去创建链接,因为此时的数据库的服务已经关闭,(理论上应该是"Connection refused",直接是端口不可访问, 但这个估计和我们采用了虚拟ip做数据库热备有关,后面会进行虚拟ip到真实数据库的一个路由过程),所以最后这里抛出了“java.net.ConnectException: Connection timed out”

4. ?因为数据库重启过程,机器始终是对外提供服务,而且流量还不小。所以很容易堆积这样正在执行makeObject()创建链接的请求,一旦超过了一个阀值后,后续的链接就会是一个block的过程,block时间到了60s后就直接抛出异常。到时具体看一下dbcp的源码

?

所以总结一下:

1. ?数据库关闭一段时间后,请求阻塞在makeObject()创建链接过程,一直没有释放,导致后续的请求全都是处于block,根本没机会去makeObject()创建链接,即使这时数据库已经重启后恢复了,应用还是无法正常提供服务。

2. ?一般的数据库临时性维护,比如重启调整配置,热备数据库切换,网络瞬间断开。这一类问题,基本上在下一次makeObject()时就可以创建好新的链接,不会处于一个长期的阻塞状态,所以自动重连基本没啥影响。

?

按照这样的解释,因自己能力有限,不能搭建这么一套热备的数据库环境,所以很难再次重现这样的情况。而且基本上数据库维护也就是2这样的情况,基本重链没啥问题。

?

看一下代码分析:

dbcp使用了common-pools,看一下每次从连接池中获取链接的过程:

?

public Object borrowObject() throws Exception {        long starttime = System.currentTimeMillis();        Latch latch = new Latch();        byte whenExhaustedAction;        long maxWait;        synchronized (this) {            // Get local copy of current config. Can't sync when used later as            // it can result in a deadlock. Has the added advantage that config            // is consistent for entire method execution            whenExhaustedAction = _whenExhaustedAction;            maxWait = _maxWait;            // Add this request to the queue            _allocationQueue.add(latch);                                                                     // Work the allocation queue, allocating idle instances and            // instance creation permits in request arrival order            allocate();                                                                                      // 位置1        }        for(;;) {            synchronized (this) {                assertOpen();            }            // If no object was allocated from the pool above            if(latch.getPair() == null) {                // check if we were allowed to create one                if(latch.mayCreate()) {                    // allow new object to be created                } else {                    // the pool is exhausted                    switch(whenExhaustedAction) {                        case WHEN_EXHAUSTED_GROW:                            省略.....                        case WHEN_EXHAUSTED_FAIL:                            省略.....                        case WHEN_EXHAUSTED_BLOCK:                                       // 位置5                            try {                                synchronized (latch) {                                    // Before we wait, make sure another thread didn't allocate us an object                                    // or permit a new object to be created                                    if (latch.getPair() == null && !latch.mayCreate()) {                                        if(maxWait <= 0) {                                            latch.wait();                                        } else {                                            // this code may be executed again after a notify then continue cycle                                            // so, need to calculate the amount of time to wait                                            final long elapsed = (System.currentTimeMillis() - starttime);                                            final long waitTime = maxWait - elapsed;                                            if (waitTime > 0)                                            {                                                latch.wait(waitTime);                                            }                                        }                                    } else {                                        break;                                    }                                }                            } catch(InterruptedException e) {                                Thread.currentThread().interrupt();                                throw e;                            }                            if(maxWait > 0 && ((System.currentTimeMillis() - starttime) >= maxWait)) {                                synchronized(this) {                                    // Make sure allocate hasn't already assigned an object                                    // in a different thread or permitted a new object to be created                                    if (latch.getPair() == null && !latch.mayCreate()) {                                        // Remove latch from the allocation queue                                        _allocationQueue.remove(latch);                                    } else {                                        break;                                    }                                }                                throw new NoSuchElementException("Timeout waiting for idle object");     // 位置6                            } else {                                continue; // keep looping                            }                        default:                            throw new IllegalArgumentException("WhenExhaustedAction property " + whenExhaustedAction +                                    " not recognized.");                    }                }            }            xxxx }private synchronized void allocate() {        if (isClosed()) return;        // First use any objects in the pool to clear the queue        for (;;) {            if (!_pool.isEmpty() && !_allocationQueue.isEmpty()) {                                                 // 位置2                Latch latch = (Latch) _allocationQueue.removeFirst();                latch.setPair((ObjectTimestampPair) _pool.removeFirst());                _numInternalProcessing++;                synchronized (latch) {                    latch.notify();                }            } else {                break;            }        }        // Second utilise any spare capacity to create new objects        for(;;) {            if((!_allocationQueue.isEmpty()) && (_maxActive < 0 || (_numActive + _numInternalProcessing) < _maxActive)) {             // 位置3                Latch latch = (Latch) _allocationQueue.removeFirst();                latch.setMayCreate(true);                _numInternalProcessing++;                                                                                // 位置4                synchronized (latch) {                    latch.notify();                }            } else {                break;            }        }    }

?

?

说明:?

位置1: 每个数据库请求都会先调用allocate()方法

位置2: 先检查当前空闲的数据库连接,如果有直接拿一个返回,客户端就直接拿到了链接,因为我们没配置testOnBorrow参数,所以客户端会拿到一个Broken pipe的链接

位置3: 当原先没有空闲的链接,检查一下条件:?_numActive + _numInternalProcessing < _maxActive。 注意我们的maxActive设置了14。 注意这里的_numInternalProcessing参数

?

位置4: 增加_numInternalProcessing计数,表明当前正在处理中的连接数。后续在makeObject成功后,就会执行_numInternalProcessing--操作,numInternalProcessing的意即为当前正在处理的连接数,计算maxActive也将其计算在内

位置5:就是当位置3的条件无法满足时,说白了就是_numInternalProcessing = _maxActive = 14时,后续的请求都会进入WHEN_EXHAUSTED_BLOCK,进行maxWait时间的等待,超过时间也就出现了对应的异常2的内容。

?

?

看一下makeObject创建链接过程的一些timeout时间设置:

?

因为从上面的分析来看,造成无法自动重链的问题主要是makeObject进行了一个阻塞动作。是否可以设置合理的socket参数,一般我们知道的主要是两个参数: connectTimeout , ?readTimeout。

这一类timeout设置,都应该石由对应的链接driver驱动提供。

?

mysql : ?(com.mysql.jdbc.Driver) ?

方式1: 直接jdbc url中指定参数

?

jdbc:mysql://127.0.0.1:3066/test?connectTimeout=3000&socketTimeout=60000

?

方式2: 通过dbcp传递properties参数

?

<bean id="dataSource" destroy-method="close">         ..............<property name="connectionProperties"><value>connectTimeout=3000&socketTimeout=60000</value></property></bean>

?

具体参见文档:?http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html?

?

oracle : (oracle.jdbc.OracleDriver) , 没有找到对应的官方文档,通过反编译源码找到配置项

?

oracle.net.nt.ConnStrategy中定义: ?

?

public void createSocketOptions(Properties paramProperties)  {    String str1 = null;    String str2 = null;    int i = 0;    Enumeration localEnumeration = paramProperties.keys();    while (localEnumeration.hasMoreElements())    {      str1 = (String)localEnumeration.nextElement();      if (str1.equalsIgnoreCase("TCP.NODELAY"))      {        i = 1;        str2 = paramProperties.getProperty("TCP.NODELAY").toUpperCase();        if (str2.equals("NO"))          this.socketOptions.put(new Integer(0), "NO");        else          this.socketOptions.put(new Integer(0), "YES");      }      else if (str1.equalsIgnoreCase("oracle.net.READ_TIMEOUT"))      {        str2 = paramProperties.getProperty("oracle.net.READ_TIMEOUT");        this.socketOptions.put(new Integer(3), str2);      }      else if (str1.equalsIgnoreCase("oracle.net.CONNECT_TIMEOUT"))      {        str2 = paramProperties.getProperty("oracle.net.CONNECT_TIMEOUT");        this.socketOptions.put(new Integer(2), str2);      }      else if (str1.equalsIgnoreCase("oracle.net.ssl_server_dn_match"))      {        str2 = paramProperties.getProperty("oracle.net.ssl_server_dn_match");        this.socketOptions.put(new Integer(4), str2);      }      else if (str1.equalsIgnoreCase("oracle.net.wallet_location"))      {        str2 = paramProperties.getProperty("oracle.net.wallet_location");        this.socketOptions.put(new Integer(5), str2);      }      else if (str1.equalsIgnoreCase("oracle.net.ssl_version"))      {        str2 = paramProperties.getProperty("oracle.net.ssl_version");        this.socketOptions.put(new Integer(6), str2);      }      else if (str1.equalsIgnoreCase("oracle.net.ssl_cipher_suites"))      {        str2 = paramProperties.getProperty("oracle.net.ssl_cipher_suites");        this.socketOptions.put(new Integer(7), str2);      }      else if (str1.equalsIgnoreCase("javax.net.ssl.keyStore"))      {        str2 = paramProperties.getProperty("javax.net.ssl.keyStore");        this.socketOptions.put(new Integer(8), str2);      }      else if (str1.equalsIgnoreCase("javax.net.ssl.keyStoreType"))      {        str2 = paramProperties.getProperty("javax.net.ssl.keyStoreType");        this.socketOptions.put(new Integer(9), str2);      }      else if (str1.equalsIgnoreCase("javax.net.ssl.keyStorePassword"))      {        str2 = paramProperties.getProperty("javax.net.ssl.keyStorePassword");        this.socketOptions.put(new Integer(10), str2);      }      else if (str1.equalsIgnoreCase("javax.net.ssl.trustStore"))      {        str2 = paramProperties.getProperty("javax.net.ssl.trustStore");        this.socketOptions.put(new Integer(11), str2);      }      else if (str1.equalsIgnoreCase("javax.net.ssl.trustStoreType"))      {        str2 = paramProperties.getProperty("javax.net.ssl.trustStoreType");        this.socketOptions.put(new Integer(12), str2);      }      else if (str1.equalsIgnoreCase("javax.net.ssl.trustStorePassword"))      {        str2 = paramProperties.getProperty("javax.net.ssl.trustStorePassword");        this.socketOptions.put(new Integer(13), str2);      }      else      {        if (!str1.equalsIgnoreCase("ssl.keyManagerFactory.algorithm"))          continue;        str2 = paramProperties.getProperty("ssl.keyManagerFactory.algorithm");        this.socketOptions.put(new Integer(14), str2);      }    }    if ((i == 0) && (!this.reuseOpt))      this.socketOptions.put(new Integer(0), "YES");  }

?

?

指定参数:?

?

?

<bean id="dataSource" destroy-method="close">     ..............<property name="connectionProperties"><value>oracle.net.CONNECT_TIMEOUT=3000&oracle.net.READ_TIMEOUT=60000</value></property></bean>

?

?

最后

因为没法模拟热备的oracle环境,对虚ip机制也不了解。所以针对前面提到的makeObject针对数据库关闭时,不是返回"Connection refused",而是返回给我"Connection timed out"

最后看了下,PlainSocketImpl.c, socket的native的一些实现,苦于知识缺乏,以后有机会再研究一下。

异常列表:

?

if (connect_rv == JVM_IO_INTR) {    JNU_ThrowByName(env, JNU_JAVAIOPKG "InterruptedIOException",    "operation interrupted");} else if (errno == EPROTO) {    NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ProtocolException",                           "Protocol error");} else if (errno == ECONNREFUSED) {    NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ConnectException",                           "Connection refused");} else if (errno == ETIMEDOUT) {    NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ConnectException",                           "Connection timed out");} else if (errno == EHOSTUNREACH) {            NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "NoRouteToHostException",                           "Host unreachable");        } else if (errno == EADDRNOTAVAIL) {            NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "NoRouteToHostException",                             "Address not available");        } else if ((errno == EISCONN) || (errno == EBADF)) {            JNU_ThrowByName(env, JNU_JAVANETPKG "SocketException",                            "Socket closed");} else {    NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "SocketException", "connect failed");}return;    }

?

还有线上通过Btrace抓取了下,oracle驱动运行时的timeout参数设置:?

?

@BTracepublic class OracleTimeoutTracer {    @OnMethod(clazz = "oracle.net.nt.ConnStrategy", method = "createSocketOptions", location = @Location(value = Kind.ENTRY))    public static void strategy(@Self Object self, Properties paramProperties) {        println(str(paramProperties));    }    @OnMethod(clazz = "oracle.net.nt.TcpNTAdapter", method = "connect", location = @Location(value = Kind.ENTRY))    public static void adapter(@Self Object self) {        Field socketTimeoutField = field("oracle.net.nt.TcpNTAdapter", "sockTimeout");        Field hostField = field("oracle.net.nt.TcpNTAdapter", "host");        Field portField = field("oracle.net.nt.TcpNTAdapter", "port");        int scoketTimeout = (Integer) get(socketTimeoutField, self);        int port = (Integer) get(portField, self);        String host = (String) get(hostField, self);        println(strcat(strcat(strcat(strcat(strcat("host : ", host), ", port : "), str(port)), " , socketTimeout : "),                       str(scoketTimeout)));    }}

?

输出结果:

?

{user=alibaba, password=xxxxx, bigStringTryClob=true, defaultRowPrefetch=50, protocol=thin}host : 10.20.36.19, port : 1521 , socketTimeout : 0

貌似默认的超时时间为0,从oracle driver的代码来看,如果没有显示设置timeout参数,默认都是socket.connectTimeout = socket.setSoTimeout = 0;

socket connect连接时,对应的超时时间为0,按照文档描述是没有timeout限制。 但针对Ip,Port端口不可访问是立马就会知道是个"Connection refused",用telnet测试即可。

?

针对虚ip(ip漂移技术),下次有机会研究和验证下。

1 楼 khotyn 2011-10-19 问一个和本文不太相关的问题,你的数据库连接池里面配置了bigStringTryClob,这个属性的用处什么? 2 楼 afeiod 2012-07-27 不错,有深入

读书人网 >其他数据库

热点推荐