Tuesday, August 5, 2014

How to avoid abnormal connections drop in Oracle due to firewall


Recently, I got an opportunity to work on an application issue where connections were getting dropped sporadically causing business outages. On investigation we found that there is something going on between application server and database server which is dropping the connection and it may be timeouts due to firewall. We wrote a script to perform a small test from application server using sqlplus and figured out that timeouts are happening every 60 minutes and our guess was right that it was happening due to firewall. So, firewall was actually dropping applications connections to oracle, if there was no activity happening between application server and database server for 60 minutes i.e. no packets traveling to and fro between application and database server. Things become more worse for us when network team refused to disable any firewall as it was introduced as part of Drawbridge program to avoid any malfunctions programs from different network.

So what would you do in above scenario? I would tell you how to handle such timeouts and connections drop due to firewall issue. There are two things you can implement:

1.) Dead Connection Detection (DCD):  DCD is intended for systems where clients power down their systems or client machines crash unexpectedly without disconnecting from oracle database sessions. DCD basically cleans up any dead connections resulting from abnormal termination of client.

Other usage scenario for DCD is to keep database connections alive when external firewall is configured to terminate idle connections.

This feature can be configured on oracle database server by enabling sqlnet.expire_time in sqlnet.ora file located in $ORACLE_HOME/network/admin or $TNS_ADMIN. Once enabled SQL*Net on the server send a "probe" to client according to timing set in sqlnet.expire_time.  This "probe" packet basically is empty SQL*Net packet and doesn't represent any level of SQL*Net data. If client connection is alive, the packet is discarded and time mechanism is reset but if the client has abnormally disconnected, the server will receive an error from the send call issued for the probe and SQL*Net on the server will signal operating system to release the connection resource.

This "probe" packet basically also allows us to keep the idle connection alive if firewall is configured to terminate such connection. Since probe allows packet to be sent as per the timer set in sqlnet.ora, firewall will always see the activity between database and client servers and will not terminate the connection.

After enabling the dead connection, you need to restart listener for making the changes in effect. The best way to determine if DCD is enabled and functioning properly is to generate a server trace and search the file for DCD probe packet. To generate a server trace, set TRACE_LEVEL_SERVER=16 and TRACE_DIRECTORY_SERVER=<path> in sqlnet.ora on the server. The resulting trace file will have a filename of svr_<PID>.trc and will be located in the specified directory.

If you are on 11gR1, you may need to apply the patch 6918493 before using DCD. Pls refer metalink document Doc ID 6918493.8 for details about the bug.


2.) Implement TCS_KEEP_ALIVE at OS level: For enabling this feature, you need to set below parameters at OS level:
 i.) tcp_keepalive_time: The interval between the last data packet send and the first keep alive probe; after the connection is marked to need keepalive this counter is not used any further.
ii.) tcp_keepalive_interval: The interval between subsequential keep alive probes, regardless of what connection has exchanged in the mean time.
iii.) tcp_keepalive_probes: The number of unacknowledged packets to be sent before marking the connection as dead and notifying the application layer.

There are two ways to configure keepalive parameters inside the kernel via userspace commands:

  • procfs interface
  • sysctl interface
The procfs interface
This interface requires both sysctl and  procfs to be built into the kernel, and procfs mounted somewhere in the filesystem (usually on  /proc, as in the examples below). You can read the values for the actual parameters by "catting" files in  /proc/sys/net/ipv4/ directory:

  # cat /proc/sys/net/ipv4/tcp_keepalive_time
  7200

  # cat /proc/sys/net/ipv4/tcp_keepalive_intvl
  75

  # cat /proc/sys/net/ipv4/tcp_keepalive_probes
  9
   
The first two parameters are expressed in seconds, and the last is the pure number. This means that the keepalive routines wait for two hours (7200 secs) before sending the first keepalive probe, and then resend it every 75 seconds. If no ACK response is received for nine consecutive times, the connection is marked as broken.

Modifying this value is straightforward: you need to write new values into the files. Suppose you decide to configure the host so that keepalive starts after ten minutes of channel inactivity, and then send probes in intervals of one minute. Because of the high instability of our network trunk and the low value of the interval, suppose you also want to increase the number of probes to 20.
Here's how we would change the settings:

  # echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time

  # echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl

  # echo 20 > /proc/sys/net/ipv4/tcp_keepalive_probes
        
To be sure that all succeeds, recheck the files and confirm these new values are showing in place of the old ones.
Remember that procfs handles special files, and you cannot perform any sort of operation on them because they're just an interface within the kernel space, not real files, so try your scripts before using them, and try to use simple access methods as in the examples shown earlier.
You can access the interface through the   sysctl(8) tool, specifying what you want to read or write.

  # sysctl \
  > net.ipv4.tcp_keepalive_time \
  > net.ipv4.tcp_keepalive_intvl \
  > net.ipv4.tcp_keepalive_probes
  net.ipv4.tcp_keepalive_time = 7200
  net.ipv4.tcp_keepalive_intvl = 75
  net.ipv4.tcp_keepalive_probes = 9
        

Note that sysctl names are very close to  procfs paths. Write is performed using the -w switch of sysctl (8):

  # sysctl -w \
  > net.ipv4.tcp_keepalive_time=600 \
  > net.ipv4.tcp_keepalive_intvl=60 \
  > net.ipv4.tcp_keepalive_probes=20
  net.ipv4.tcp_keepalive_time = 600
  net.ipv4.tcp_keepalive_intvl = 60
  net.ipv4.tcp_keepalive_probes = 20
        

Note that sysctl (8) doesn't use sysctl(2) syscall, but reads and writes directly in the procfs subtree, so you will need procfs enabled in the kernel and mounted in the filesystem, just as you would if you directly accessed the files within the procfs interface.   Sysctl(8) is just a different way to do the same thing.

The sysctl interface
There is another way to access kernel variables: sysctl(2 ) syscall. It can be useful when you don't have procfs available because the communication with the kernel is performed directly via syscall and not through the procfs subtree. There is currently no program that wraps this syscall (remember that sysctl(8) doesn't use it).

No comments:

Post a Comment