Regular outages on CHI, 2000-3000ms latency spikes on DAL, where to next?

Cabal696 · October 11, 2012, 07:33:46 PM

Over the past couple weeks, I've had no shortage of issues with the two tunnel locations I've tried - CHI was dropping all packets for 1-2 minutes every 30-60 minutes, and DAL appears to be a little more reliable, but still has huge latency spikes (sometimes as high as 5000ms). My local IPv4 connectivity is low latency and has great reliability, and on a good day I'm 30-40ms away from DAL or CHI. Any suggestions on a tunnel endpoint that doesn't have these problems? I was thinking of trying DEN next, but would love to stop hopping. Thanks!

cholzhauer · October 11, 2012, 08:14:48 PM

I'm seeing the same problems on Chicago. I submitted a ticket to ipv6@he.net but haven't received a response yet. Are your problems at a certain time or randomly throughout the day?

Cabal696 · October 12, 2012, 07:06:05 AM

I've since moved off of CHI to DAL,but I recall seeing it throughout the day (though probably more frequently during prime times). Here are some traceroutes from DAL this morning at 8:58 AM CDT. Issues with DAL are definitely throughout the day, and are still ongoing, it seems.

$ traceroute tserv1.dal1.he.net
traceroute to tserv1.dal1.he.net (216.218.224.42), 64 hops max, 52 byte packets
1 my.ipv4.internal.ip (10.x.y.1) 0.519 ms 0.825 ms 0.972 ms
2 10.21.128.1 (10.21.128.1) 11.172 ms 12.082 ms 17.932 ms
3 pflucmts-a.tex.sta.suddenlink.net (173.219.254.100) 9.970 ms 11.461 ms 9.997 ms
4 tyrm-10g.tex.sta.suddenlink.net (173.219.254.80) 16.919 ms 14.491 ms 16.978 ms
5 dllsosr01-10gex1-1.tex.sta.suddenlink.net (66.76.30.30) 41.918 ms 43.518 ms 43.933 ms
6 10gigabitethernet3-1.core1.dal1.he.net (206.223.118.37) 39.956 ms 38.418 ms 40.046 ms
7 tserv1.dal1.he.net (216.218.224.42) 1271.919 ms 1181.312 ms 1111.936 ms

$ traceroute6 www.google.com
traceroute6 to www.google.com (2001:4860:4002:802::1013) from 2001:470:1f0f:4c9:ddf8:b5dc:638b:xxyy, 64 hops max, 12 byte packets
1 my.ipv6.internal.ip 1.072 ms 0.621 ms 1.079 ms
2 cabal696-1.tunnel.tserv8.dal1.ipv6.he.net 1130.162 ms 1147.000 ms 1134.909 ms
3 gige-g2-14.core1.dal1.he.net 2205.946 ms 2200.439 ms 2236.879 ms
4 10gigabitethernet5-4.core1.atl1.he.net 2249.905 ms 2235.772 ms 2254.987 ms
5 2001:4860:1:1::1b1b:0:15 2209.837 ms 2264.293 ms 2296.927 ms
6 2001:4860::1:0:489 2352.871 ms 2265.484 ms
2001:4860::1:0:5db 2269.795 ms
7 2001:4860::8:0:2f03 2277.280 ms
2001:4860::8:0:2f04 2258.171 ms 2309.173 ms
8 2001:4860::8:0:2c9c 2320.792 ms 2304.195 ms
2001:4860::8:0:2c9d 2315.812 ms
9 2001:4860::1:0:57f 2286.128 ms 2353.240 ms 2289.924 ms
10 2001:4860:0:1::275 2347.948 ms 2394.118 ms 2343.947 ms
11 2001:4860:8000:b:92e6:baff:fe61:546a 2334.973 ms 2349.178 ms 2273.897 ms

realdreams · October 14, 2012, 05:37:50 PM

Are you experiencing any issues other than ping? The router in your traceroute is likely to deprioritize ICMP replies

kasperd · October 15, 2012, 03:25:08 AM

Quote from: realdreams on October 14, 2012, 05:37:50 PMThe router in your traceroute is likely to deprioritize ICMP replies

Such behaviour would not make any sense. When a router has to produce an ICMP packet by itself (error message or a ping reply), then that usually means it has to spend CPU cycles. And some routers don't have many CPU cycles to spare for that.

But as long as a router just has to forward an ICMP packet, then there is no additional cost compared to other types of packets. A router capable of forwarding packets in hardware can use that to forward echo requests in one direction and ttl expired messages in the other direction, without ever realizing that those were ICMP packets.

The symptoms are not consistent with any sort of ICMP rate limiting. The symptoms are however consistent with a bufferbloat problem on the path to or from the tunnel server. But that is not the only possible explanation.

News:

Regular outages on CHI, 2000-3000ms latency spikes on DAL, where to next?

Cabal696

cholzhauer

Cabal696

realdreams

kasperd