As I mentioned in the previous entry(“DTrace TCP, UDP providers”), the DTrace tcp provider can be used to carry out a number of useful latency measurements. We have included some scripts that help with these measurements, and I’ll try and explain them here. Another useful reference is Bryan and Brendan’s CEC2006 presentation, though some minor details have changed in the TCP provider arguments since then – consult the updated docs when in doubt.

In the last entry, we used the tcp:::send and tcp:::receive probe points exclusively. There are also a number of probe points relating to TCP connection establishment:

tcp:::connect-request – initial SYN sent as part of three-way handshake to establish active connection
tcp:::connect-refused – intial SYN was refused and we receive a RST|ACK “reset” segment
tcp:::connect-established – we send a final ACK, completing establishment of the active connection
tcp:::accept-refused – initial SYN has been received, but nothing is listening. Send RST|ACK “reset” segment
tcp:::accept-established – final ACK has been received and the connection is established passively for the listener
The connect-* probes relate to the initiating or active side of the TCP connection, while the accept-* probes are for the listening or passive side of the connection.

Connection latency

The tcpconnlat.d scripts measures the latency incurred for the TCP 3-way handshake to complete for outbound connections. The tcpconnlat.d script measures this using a quantized aggregation – in other words connection latencies are grouped into buckets up to latency n to 2n, 2n to 4n etc. Below are the results of running the script on one window while running a number of wgets to various websites:

# dtrace -s /usr/demo/dtrace/tcpconnlat.d
dtrace: script '/usr/demo/dtrace/tcpconnlat.d' matched 2 probes
^C

  Connect Latency (ns)                                64.210.72.58
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
        67108864 |                                         0        

  Connect Latency (ns)                                74.125.45.103
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
        67108864 |                                         0        

  Connect Latency (ns)                                192.9.164.55
           value  ------------- Distribution ------------- count
        33554432 |                                         0
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
       134217728 |                                         0        

  Connect Latency (ns)                                192.9.164.72
           value  ------------- Distribution ------------- count
        33554432 |                                         0
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
       134217728 |                                         0        

The script works by measuring time between the firing of the tcp:::connect-request probe (which fires when the initial SYN of the three-way handshake is sent) and the final ACK is sent (indicating the connection has been established). The script uses the unique-per-connection connection id (cs_cid) to record the starting timestamp for comparison.

tcp:::connect-request
{
        start[args[1]->cs_cid] = timestamp;
}

tcp:::connect-established
/start[args[1]->cs_cid] /
{
        @latency["Connect Latency (ns)", args[3]->tcps_raddr] =
            quantize(timestamp - start[args[1]->cs_cid]);
        start[args[1]->cs_cid] = 0;
}

First-byte latency

First byte latency is the time from connection establishment to when data is received. The script, tcp1stbyte.d, is geared towards the initiating connection (since it times from tcp:::connect-established to tcp:::receive firing), but could be rewritten to be more general by replacing “tcp:::connect-established {}” with “tcp:::state-change / args[3]->tcps_state == TCP_STATE_ESTABLISHED/ {}”).

# dtrace -s /usr/demo/dtrace/tcp1stbyte.d
dtrace: script '/usr/demo/dtrace/tcp1stbyte.d' matched 16 probes
^C

  1st Byte Latency (ns)                               89.207.56.140
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
        67108864 |                                         0        

  1st Byte Latency (ns)                               66.102.9.99
           value  ------------- Distribution ------------- count
        33554432 |                                         0
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
       134217728 |                                         0        

Again, the script is straightforward:

tcp:::connect-established
{
        start[args[1]->cs_cid] = timestamp;
}

tcp:::receive
/start[args[1]->cs_cid] && (args[2]->ip_plength - args[4]->tcp_offset) > 0/
{
        @latency["1st Byte Latency (ns)", args[2]->ip_saddr] =
            quantize(timestamp - start[args[1]->cs_cid]);
        start[args[1]->cs_cid] = 0;
}

Round-trip latency

Here’s a script not included in the demo scripts, and even though there are edge cases it probably won’t cover, it’s a useful demonstration of some additional features of the TCP provider. It measures average round-trip time from sending to acknowledgement of TCP segments on a per-host basis. Here’s the output:

# dtrace -s tcp_rtt.d
^C
Remote host               TCP Avg RTT(ns) StdDev          NumBytes   NumSegs
127.0.0.1                 255635          162214          0          4
::1                       2718049         3127457         0          4
66.102.9.99               25552916        27326796        230        12
89.207.56.140             28347521        57559039        110        28
74.125.105.87             30258841        0               0          1

We see localhost RTTs are smallest for IPv4, then IPv6, then remote hosts. It is interesting that IPv6 localhost RTT is so much greater. The script itself is a bit more complex:

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::send
{
	snxt[args[1]->cs_cid, args[3]->tcps_snxt] = timestamp;
	@bytesrtt[args[2]->ip_daddr] =
	    sum(args[2]->ip_plength - args[4]->tcp_offset);
	@countrtt[args[2]->ip_daddr] = count();
}

tcp:::receive
/ snxt[args[1]->cs_cid, args[4]->tcp_ack] /
{
	@meanrtt[args[2]->ip_saddr] =
	    avg(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	@stddevrtt[args[2]->ip_saddr] =
	    stddev(timestamp - snxt[args[1]->cs_cid, args[4]->tcp_ack]);
	snxt[args[1]->cs_cid, args[4]->tcp_ack] = 0;
}

END
{
	printf("%-25s %-15s %-15s %-10s %-10s\n",
	    "Remote host", "TCP Avg RTT(ns)", "StdDev", "NumBytes", "NumSegs");
	printa("%-25s %@-15d %@-15d %@-10d %@-10d\n",
	    @meanrtt, @stddevrtt, @bytesrtt, @countrtt);
}

We use an associative array snxt, which stores timestamps associated with a particular connection and the next segment sequence number that will be sent. The tcpsinfo value tcps_snxt represents the next sequence number that will be used for sending, so at a tcp:::send probe point, it represents the ACK sequence number we would expect to receive for that segment. Recall, when we acknowledge a TCP segment, we acknowledge it with the next sequence number we expect to receive, so when this segment is acknowledged, we will most probably receive a segment with that sequence number specified as the acknowledgement number. So when we receive a segment for that connection _and_ having that sequence number as acknowledgement, we can use the original send timestamp to determine round-trip acknowledgement time. This is then averaged across all connections for the remote host to get a mean acknowledgement RTT. Note that this scheme will break down for selective acknowledgement (among other cases) but it is a neat demonstration of the kind of measurement that is possible, and represents an easy way to find slow peer systems. Suspiciously high round trip times would merit further investigation, especially in cases where machines are on the same network segments and thus should have broadly similar latencies. It should be noted that lumping all the ports together (as this script does) leads to high standard deviations for average round-trip times, so a drilldown by port would likely be merited.