ProFTPD: KeepAlives

Overview of KeepAlive
Any discussion of so-called "keep-alive" functionality must start by answering the question: "What is does 'keep-alive' mean?" As one specification succintly states:

  A "keep-alive" mechanism periodically probes the other end of a connection
  when the connection is otherwise idle, even when there is no data to be sent.

Keepalive mechanisms appear in many different protocols, under various names. Protocols may be layered over the top of another protocol; for example, HTTPS consists of HTTP layered over SSL/TLS, itself layered over TCP/IP. Each protocol layer may have its own form of such keepalive functionality.

TCP KeepAlive
For TCP, the definitive specification for keepalive functionality is RFC 1122, Section 4.2.3.6. While RFC 1122 includes a good discussion of why TCP keepalives are meant to be off by default, in practice TCP keepalives do have value, especially when dealing with network equipment (such as routers, firewalls, NATs) between the client and the server; such equipment might terminate the connection when the connection has been idle for too long. The use of TCP keepalives can help to prevent such network equipment from breaking the connection needlessly.

How TCP KeepAlives Works
OK, so you want to use the TCP keepalive functionality in your program. The question is "How exactly does the TCP keepalive feature work?" Good question. Answering this requires three different numeric values: the idle time, the number of probes to send (the probe count), and the interval time between each probe. Remember, though, that the SO_KEEPALIVE TCP socket option must be enabled on the socket in order for the TCP keepalive feature to be used.

First, let's assume that you have created a TCP connection, and have transferred data back and forth on that connection. All of the data have been transferred, but you have not closed the connection, so now it is idle. How long does that connection sit idle, with no data transferred at the TCP layer, before one end of the connection or the other starts to wonder whether the connection is still alive? This amount of time where the connection sits idle is the TCP keepalive idle time; the default idle time is two hours (per RFC 1122).

Our TCP connection has been sitting idle now for amount of time given by the idle time value; what happens then? At this point, the end of the TCP connection with TCP keepalive enabled sends out a "probe". This probe is just a small TCP packet which requires a response from the other side. Once the probe has been sent, the amount of time given by the interval time value passes. If we hear nothing back from the remote peer within the interval time, we send another probe. This process repeats until either a) we receive a response back from the peer, or b) the probe count value has been reached.

Let us assume that our TCP connection was idle for so long that TCP keepalive probes were sent, and still no response was received. What happens then? At this point, the connection is broken. When the programs at either end of the connection next try to read or write data on that connection, the read/write attempts will fail.

When To Use TCP KeepAlive
When the TCP client is connected directly to the TCP server, it usually does not matter whether one end or the other uses TCP keepalive. As long as one of them does, a broken connection can be detected.

    client <-------------------------------> server

However, if the TCP client connects to the TCP server via proxies/routers/firewalls/NAT, the picture changes. When this happens (and it is the common scenario), then both sides may need to use TCP keepalive to learn when their side of the proxied connection is broken:

    client <-----------> NAT <-------------> server

In this situation, there are actually two different TCP connections involved: between the client and the NAT, and between the NAT and the server. Each TCP connection may break independently of the other, which is why both ends of the connection (client and server) may need to use TCP keepalives. Use of TCP keepalives also helps here because when the router/firewall/NAT receives the TCP keepalive probe, it may (depending on the network equipment in question) cause the router to reset any timers that were about to close the TCP connections on either side.

Why are TCP KeepAlives Useful for FTP?
"This is all very fascinating", you say, "but what does it have to do with proftpd and FTP?" If you have ever had an FTP download (or upload) take a very long time, only to have that transfer timed out in the middle, then TCP keepalives may prevent the timeout.

Consider what happens for FTP transfers which take a long time (either due to very large file(s) being transferred, or a slow connection): you have one TCP connection for the control connection, and a separate TCP connection for the data transfer connection. All of the bytes are being transferred over the data connection, so that data connection is certainly not idle -- but while the data transfer is occurring, the control connection is idle! And let's assume that your FTP connections are going through some NAT device in between the client and the server. That NAT may not be very smart; it may not know that the two different TCP connections of your FTP session are related to each other; it only sees one idle TCP connection, and one busy TCP connection. If that FTP control connection is idle for too long, then the NAT may close it (in order to keep valuable space in its state tables available for TCP connections that actually need to transfer bytes). (Some NATs have been known to close TCP connections that have been idle for only 5 minutes.) The FTP server sees that the FTP control connection is closed, and aborts the data transfer. What a mess!

If either the FTP server or the FTP client had used TCP keepalives on the control connection, then maybe that NAT would have seen the TCP keepalive probes, and not closed the idle control connection. So how can we make sure that either the client or the server has TCP keepalives enabled?

In proftpd-1.3.5rc1 and later, ProFTPD's SocketOptions directive supports a keepalive parameter for controlling whether the server uses TCP keepalives, e.g.:

  # Disable use of TCP keepalives
  SocketOptions keepalive off

  # Enable use of TCP keepalives (this is the default)
  SocketOptions keepalive on

In addition, on some Unix platforms, the SocketOptions directive's keepalive parameter can do finer-grained tuning of the TCP keepalive values:

  # Enable use of TCP keepalives, with the given idle/count/interval values
  SocketOptions keepalive 7200:9:75

In general, though, you should use the system-wide defaults unless you are running into data transfer timeout issues. If you are seeing timeouts, try using the keepalive parameter of SocketOptions to gradually reduce the idle timeout by small increments (e.g. 10-15 seconds), then if that does not help, increment the count by 1 at a time (remember that each probe is more extra data transfer), then if that still does not help, increase the interval time. Do not reduce the interval time, since that is the amount of time that you should wait to see if the other end responds, before sending another probe. Waiting less time before the other end responds means a greater chance of killing your TCP connection unnecessarily.

Not all TCP stacks let the application control the TCP keepalive timeout after which the first probe will be sent, or the total number of probes sent, or how much time between probes will be used. That is, many TCP stacks only allow enabling/disabling of TCP keepalive. If TCP keepalive is enabled, then the standard values of 2 hours for the idle timeout, a count of 9 probes, with 75 seconds between probes, will be used.

Since many platforms do not allow fine-grained tuning of TCP keepalive values, especially on a per-service basis, other means for checking whether the connection is still alive must be used. And that leads us to application-level keepalive mechanisms.

FTP KeepAlive
If tuning TCP keepalives does not work to keep your long-lasting data transfers from timing out, what can be done? Answer: use keepalive features at other layers in the protocol stack. FTP has its own ideas for doing keepalive checks, but they are not as elegant as that of TCP.

The main issue with FTP keepalives is that they are all initiated by the client. FTP is a request/response model, and it does not allow for the FTP server to send arbitrary unrequested data to the FTP client via the control connection. Fortunately, there are many FTP clients which implement some sort of FTP keepalive feature.

How FTP KeepAlive Works
Since the FTP server cannot do anything to test whether the FTP session is alive, the FTP client must do the tests. The easiest way to test whether an FTP session is alive is to send an FTP command. And fortunately, RFC 959, Section 4.1.3 defines the NOOP ("No Operation") command whose sole purpose is to elicit the "OK" response from the FTP server. This makes the NOOP command the ideal way to test whether the FTP server is still alive and listening to the FTP client.

Some firewalls/routers know about this NOOP trick, though, and may filter out/drop that FTP command. FTP clients, then, have been known to resort to a number of other FTP commands for use as FTP keepalives, including:

NOOP
LIST
STAT
CWD
REST 0
PWD
TYPE A

Some FTP clients even choose a command at random from the above list, just to keep any interfering router/firewall/NAT guessing!

Sadly, some FTP servers cannot handle receiving an FTP command on the control connection while they are in the middle of transferring data on the data connection (proftpd can handle this). But that may not matter, for the purposes of FTP keepalives; all that matters is that at the TCP level, the bytes were sent by the client and acknowledged by the server's TCP stack.

FTP Client-Specific KeepAlive Settings
Not every FTP client supports the FTP keepalive functionality. If you want to try out FTP clients which do support FTP keepalives, you might look into the ftp:nop-interval setting for lftp, or the control-timeout setting for ncftp.

SSH KeepAlive
The SSH2 protocol (and SFTP, which runs over SSH2) is more complex than FTP, and thus has much better support for application-level keepalive functionality. Either end of an SSH2 connection can send messages at any time.

How SSH KeepAlive Works
The usual mechanism used by SSH2 implementations for implementing an SSH2-level keepalive check is send either a CHANNEL_REQUEST or a GLOBAL_REQUEST message for a known unsupported command, and to request a response from the other side. It does not matter that the requested command is unsupported; all that matters is that the response comes back, signifying that the other end is still alive and listening. Both clients and servers use this technique.

In the case of ProFTPD's mod_sftp module, the way to configure SSH2 keepalives is the SFTPClientAlive directive. When configured, the mod_sftp module sends CHANNEL_REQUEST/GLOBAL_REQUEST messages for "keepalive@proftpd.org" in order to solicit a response from the connected client.

KeepAlive in Other Protocols
Many application protocols end up reinventing the keepalive feature in some way, usually as a "ping/pong" mechanism where a "ping" is sent every so often by one side, with a "pong" response expected from the other end of the connection.

HTTP
Most HTTP connections have no need of a keepalive mechanism since HTTP connections are usually short-lived, and since there are usually data flowing in one direction or the other on the HTTP connection (thus an HTTP connection is usually not idle for long enough time to warrant a keepalive feature). HTTP long polling (i.e. RFC 6202) is an exception; and for HTTP long polling connections, use of TCP keepalives may be needed. But the HTTP protocol itself does not specifically define a way for either end to arbitrarily send data across the connection for the purpose of determining whether the connection is still alive. (HTTP keepalive refers to a different concept, i.e. that of telling the server to not close the connection after sending its response so that the connection can be reused, thus "kept alive".)

LDAP
For long-lived LDAP connections, keepalive functionality can be implemented by using the Abandon operation, as described here. The idea is to have the client send a request that it knows the server will ignore/discard; the act of transmitting the request over the connection acts to keep any intermediaries on the network (router/NAT/firewall) from closing an "idle" connection prematurely.

WebSocket
The WebSocket protocol, defined in RFC 6455, does have need of a keepalive mechanism, since it establishes a long-lived connection. Thus does the RFC define ping/pong messages; see Section 5.5.2 (PING), and Section 5.5.3 (PONG).

Additional Reading

Frequently Asked Questions

Question: When should I use application-level keepalives like FTP or SSH2 keepalives instead of TCP keepalives? Or can I use them all at the same time?
Answer: There is no issue with using the different types of keepalives at the same time; remember that each type of keepalive functions at a different protocol layer.

That said, if you have to choose, I would recommend using the application-level keepalive feature when you can, as opposed to TCP keepalives. Why? First, you probably have more control over when the application-level keepalive feature is used than if you used TCP keepalives (e.g. you can use the application-level keepalive check more frequently than the TCP keepalive feature would detect a dead connection).

Second, TCP keepalives may not cause a NAT to keep the TCP connection on each side open, but an application-level keepalive should do so, since the application-level keepalive data must traverse both connections:

           <------------ SSH2 REQUEST ----------->
           -------------- FTP NOOP -------------->
    client <---------------> NAT <---------------> server
           <-- TCP probe -->     <-- TCP probe -->

Question: Why should I use SocketOptions to configure proftpd's TCP keepalive settings, as opposed to changing some of the sysctls on my machine that apply to TCP KeepAlives?
Answer: Using the SocketOptions directive means that your TCP keepalive tunings will only affect the FTP/SFTP connections to proftpd, instead of applying to all TCP connections to your machine. FTP/SFTP sessions, being long-lived, probably have different keepalive timing needs than from other TCP connections, so it is best to tune their settings separately, without impacting all connections to that machine.

Question: If TCP keepalives are so useful, why not tune them to be quite short? Why does RFC 1122 recommend a default of two hours before checking if the peer is still there?
Answer: There are a couple of reasons why you might not want to tune your TCP keepalive settings to be shorter.

First, keep in mind that TCP was designed to keep the TCP level connection "alive" even though various parts of the underlying hardware, such as routers, firewalls, etc might crash and be rebooted in the middle of things. This is why TCP retransmits lost packets, routes around unresponsive hops, etc. If, then, your TCP keepalive settings are aggressively short, your TCP connection may be shut down (due to not responding to TCP keepalive probes in time) because of a crashed network component. The recommended default of two hours allows for such network route changes without losing the TCP connection.

Second, every TCP keepalive probe counts as more data transferred over your network link. Some links may have exorbitantly high rates for transferred bytes (think satellite links), so on such links, keeping the number of bytes transferred down amounts to cost savings. Tuning TCP keepalive to use shorter times on such links means more data transferred, thus higher costs. On other links, where data transfer rates are cheap, the additional bytes transferred due to shorter TCP keepalive settings may be negligible.

Last, many people argue that use of TCP keepalives may consume unnecessary bandwidth. The question becomes "If no one is using the connection, who cares if the connection is still good?" (To be fair, this argument makes more sense when the connected peers are not separated by proxies, gateways, and NATs.)