keepalive 0.1 README
This product supplies a tool and some helpers for creating a "keepalive" configuration between a ZEO storage server and one or more clients (typically Zope application server processes).
This configuration is intended to keep "enough" traffic moving across each client-server connection to defeat TCP-violating "middlemen" (firewalls, routers, etc.), which have been observed to abort idle connections without doing proper TCP teardown on them.
The symptom in such a case is that one of the pair (the client) believes, even at the kernel level, that its connection to the other remains open; the other endpoint (the server) typically sees the connection close, and logs that. The hapless client usually ends up blocked on a read from the server which can never be satisfied, and must be manually restarted or "hupped" to recover.
The irony here is that ZEO's caching actually contributes to the problem: if the "working set" in an application server's cache is coherent with its usage patterns (reads), it doesn't need to send any packets to the storage server, and thus falls prey to the "idle timeout".
Defeating such hostile behavior at the application level is a bit of a kludge: essentially, we must create enough non-cacheable activity on each client to force periodic writes / reads to the storage server, with a frequency high enough to avoid having the connetion appear idle.
The 'keepalive' product assists the site manager to construct a configuration which generations some traffic, using the following components:
- A ZODB-based tool, which stores state for each client (minimally, a timestamp), based on a user-defined key. The ZEO traffic in the configuration will be primarily writes and reads to this per-client state.
- A backported version of the ZServer.Clockserver shipped with Zope 2.10.x. The clock server allows the site manager to configure the traffic across the ZEO connection without requiring an external trigger such as cron.
These two components together can be used to configure a "chatty" protocol between each ZEO client and the storage server.
In a hypothetical deployment, Joe, the site manager, has three application server instances, 'alpha', 'beta', and 'gamma'. 'alpha' and 'beta' are running on a single dual-core box, while 'gamm' runs on the same box as 'delta', the storage server.
In this environment, Joe has observed that connections between the clients and the storage seem to be aborted if they remain idle for longer than 5 minutes (300 seconds).
Joe adds the 'keepalive' product to the 'Products' directory of each application server's $INSTANCE_HOME. He then creates an instance of the 'ZEO Keepalive Tool' in the root of the ZMI on 'alpha', with the default ID ('keepalive_tool') and configures the tool's "Properties" tab as follows:
'warning_interval' -- 90 (seconds). 'error_interval' -- 180 (seconds). 'refresh_interval' -- 0 (seconds)
The first two values are used only to highlight clients which are potentially lagged / blocked; they do not affect the protocol between the server and the clients. The third is used to generate a "meta refresh" on the tool page; the default value of '0' disables any auto-refresh.
Joe then adds the following stanza to the zope.conf for 'alpha':
%import Products.keepalive <clock-server> method /keepalive_tool/updateClient?key=alpha period 60 user admin password qqq123 host localhost </clock-server>
This stanza sets up a recurring call to the 'updateClient' method of the tool, passing a key identifying the current client. Each call will occur approximately 60 seconds after the previous one.
Because the tool's methods are protected by Zope security, Joe has to supply valid credentials for the mock requests; he could create a new user with only the privileges required to call the 'updateClient' method, but chooses to use his normal manager account while testing.
After restarting the 'alpha', Joe sees the following line in the logfile, which tells him that the clock server is active:
2007-04-05T09:39:26 INFO ZServer Clock server for "/keepalive_tool/updateClient?key=alpha" started (user: admin, period: 60)
Joe then visits the tool's "Status" tab in the ZMI, after loggging in as the 'admin' user: http://localhost:8081/keepalive_tool/status.html At first, the tool looks "empty", but after the first "tick" of the clock server should have passed, Joe refreshes the page and sees an entry for the 'alpha' client, with a timestamp. The timestamp cell has a white background, indicating that the client has updated its state more recently than the 'warning_interval'.
Joe then adds similar stanzas to the zope.conf files for 'beta' and 'gamma', replacing the 'key=alpha' string as appropriate, and restarts the two servers. Impatient to see a configuration, he then adds keys for each one in the tool's "Status" view. Initially, they show "n/a" for their timestamp, and have the cell colored light blue, indicating that they are inactive.
Joe tweaks the 'refresh_interval' on the tool's "Properties" tab to '10', and returns to the "Status" tab. There, he sees that 'beta' and 'gamma' are now active (white, with timestamps). Curious, Joe stops the application server on 'beta': after 90 seconds or so, its cell turns orange (the warning indicator). He restarts it, and then watches it return to its normal white state after the first tick.
Joe leaves the browser open to the tool's status window during the course of the day, and notes that the clients never seem to "hang": they stay "nominal" and white.
Joe decides to test his theory that the firewall is aborting idle connections after 5 minutes. In the zope.conf file 'beta', he changes the interval from 60 (one update per minute) to 600 (one update per ten minutes), and restarts the server. He then monitors the refreshing status tab of the tool on the 'alpha' port, and sees the row for 'beta' turn first orange and then red. It returns to white after the first ten-minute "tick", becuase Joe has been poking around mistakenly in the ZMI of the 'beta' server.
After leaving the 'beta' server idle for a half hour, he attempts to visit the tool's "status" tab in the browser pointed at 'beta'; the request "hangs". Tailing the event log, he sees no evidence that the client belives itself to be disconnected from the server. However, he sees such evidence in the log for 'delta', which shows a disconnect from the client shortly after he quit working in it.
Joe reduces the clock server interval on 'beta' to 240 seconds, and restarts. He now sees that the storage server remains connected, even when he is not working actively on 'beta'.
$Id: README.txt,v 1.2 2007/04/06 14:28:11 tseaver Exp $