Netalyzer says I have a IPv6 fragmentation problem.

broquea · February 19, 2012, 06:45:46 PM

I'm referring to the physical/tunnel interface setting, not the values that packets can get transported as via pmtud. If you end up with packets transported at 1464 mtu, the tunnel's interface setting isn't changed to 1464, it remains 1480. You are talking packets, not interfaces.

kasperd · February 19, 2012, 11:56:08 PM

Quote from: broquea on February 19, 2012, 06:45:46 PMYou are talking packets, not interfaces.

In the end all that matters to the users, is what packets the tunnel server is sending. What terminology is used on the tunnel server is not important. In my example the tunnel server was using an MTU of 1464 for processing the network traffic.

bicknell · February 27, 2013, 01:43:04 PM

I opened a ticket on this with Apple back when I originally posted here. Today I received an update from Apple:

Firmware version 7.6.3 has been released officially, please retest.

I'm actually up and running on an entirely different home gateway right now, so I'll have to both upgrade and put it back in service to test. Apple seems to think they have fixed something related to this, so perhaps fragments will pass properly now.

I'll try and report back by this weekend.

bimmerdriver · March 12, 2013, 10:55:05 PM

I ran this test as well and got the same result as the original poster:

Your system can not send or receive fragmented traffic over IPv6. The path between our system and your network has an MTU of 1480 bytes. The bottleneck is at IP address 2001:470:0:9b::2. The path between our system and your network does not appear to handle fragmented IPv6 traffic properly.

The MTU on my router (sophos utm) is 1500. My internet connection is vdsl2, not using pppoe or pppoa.

The test also gave this result:

Your browser successfully fetched a test image from our IPv6 server. Unfortunately, this is substantially slower than IPv4: it took 0.2 seconds longer to fetch the image over IPv6 compared to IPv4. Your browser prefers IPv6 over IPv4.

Is there anything I can or should do about either of these issues?

broquea · March 12, 2013, 11:11:43 PM

Your tunnel's MTU is 1480 which is maximum for using this tunnel service.

200ms slower rendering an image in a browser? Spotify tuned their system to start playing music within 285ms from pressing play because that is just above 250ms which is when the human brain processes sound as instantaneous (cool NPR story from years ago). I'd say an extra 200ms to render something will not be noticed.

bimmerdriver · March 13, 2013, 12:14:18 AM

Quote from: broquea on March 12, 2013, 11:11:43 PM
Your tunnel's MTU is 1480 which is maximum for using this tunnel service.

200ms slower rendering an image in a browser? Spotify tuned their system to start playing music within 285 seconds from pressing play because that is just above 250ms which is when the human brain processes sound as instantaneous (cool NPR story from years ago). I'd say an extra 200ms to render something will not be noticed.

Thanks very much.

kasperd · March 13, 2013, 08:33:46 AM

Quote from: broquea on March 12, 2013, 11:11:43 PMSpotify tuned their system to start playing music within 285 seconds from pressing play

I assume you meant 285 ms.

Quote from: broquea on March 12, 2013, 11:11:43 PMbecause that is just above 250ms which is when the human brain processes sound as instantaneous (cool NPR story from years ago).

When working on a sound processing project at university, we got some advice from a professional sound technician, who told us the threshold is 20ms.

When I am watching a DVD I can clearly feel something is wrong, if the sound is offset by 100ms. But I can't always pinpoint the direction in which the sound is off.

Quote from: broquea on March 12, 2013, 11:11:43 PMI'd say an extra 200ms to render something will not be noticed.

My former colleagues would ridicule you for making such a statement. There we were working with deadlines around 200ms to complete the rendering of a webpage, including the time it took to download all resources needed to render the page.

200ms may not be enough to consciously notice that there was a delay. But subconsciously the users will notice the difference, and it will change their perception of the overall quality of the service.

broquea · March 13, 2013, 08:50:38 AM

Quote from: kasperd on March 13, 2013, 08:33:46 AMI assume you meant 285 ms.

oops, I did, lemme go fix that

QuoteWhen working on a sound processing project at university, we got some advice from a professional sound technician, who told us the threshold is 20ms.

When I am watching a DVD I can clearly feel something is wrong, if the sound is offset by 100ms. But I can't always pinpoint the direction in which the sound is off.

Neat, I find that using VLC lets you tweak with the audio delay and have definitely noticed some weirdness on a few videos out there but its almost always around something absurd like 400-700ms off which is way way more obvious generally.

QuoteMy former colleagues would ridicule you for making such a statement. There we were working with deadlines around 200ms to complete the rendering of a webpage, including the time it took to download all resources needed to render the page.

200ms may not be enough to consciously notice that there was a delay. But subconsciously the users will notice the difference, and it will change their perception of the overall quality of the service.

Also probably depends on age of the user as well :) A younger person firing on all neurons should be noticing it, but as we get older and more addled, everything will probably feels like it takes forever!

bicknell · March 20, 2013, 05:15:41 PM

I finally was able to upgrade to firmware 7.6.3 and rerun my tests. Same result, so I updated the Apple bug ticket with that information. I'm going to stay running on my Airport for the time being in case I need to retest again soon.

Whatever is going on here, I don't think it's fixed yet. I'm going to e-mail the Netalyzr folks and point them to this thread as well.

bicknell · March 25, 2013, 01:44:10 PM

I've been working with the Netalyzr folks on some testing, and we've already found some interesting details. Nothing specific to report back yet, but I think we have a complex interaction between various bits of infrastructure that are all working "in spec" but not in a way each likes.

I do want to point out one thing I found which is a bit of a surprise to me. It turns out most (all?) of the Linux kernels output UDP fragments in _reverse_ order. That is, say you had a 3500 byte packet to transmit which would become 1500 byte segment #1, 1500 byte segment #2, and 500 byte segment #3. They will go out on the wire 3, then 2, then 1.

At least with a couple of popular NAT implementations we tested that do in fact reassemble fragments this causes them to be dropped. Segment #1 must be received first to create a state table entry for the rest of the packets.

I'm tempted to say that the emission of the fragments backwards is wrong, but the reality is even if they were sent in order there is the potential for them to be re-ordered during transport across the network. However, as a programmer I also get that having a NAT box store random fragments in the hopes the rest of the bits come in later is both a bit of a programming challenge and a potential DDOS vector.

I'm not a big Linux fan, so I'm wondering if anyone knows if this reverse fragment behavior is pervasive across all kernels, or if there is any work around.

kasperd · March 25, 2013, 02:58:33 PM

Quote from: bicknell on March 25, 2013, 01:44:10 PMI do want to point out one thing I found which is a bit of a surprise to me. It turns out most (all?) of the Linux kernels output UDP fragments in _reverse_ order. That is, say you had a 3500 byte packet to transmit which would become 1500 byte segment #1, 1500 byte segment #2, and 500 byte segment #3. They will go out on the wire 3, then 2, then 1.

It's easier to reassemble packets that way. The last fragment is the only one, which can tell you how large the reassembled packet is going to be. So until you have received the last fragment, you cannot allocate memory for reassembly, and you'll have to keep packets separate.

As the packets are being reassembled, you need to keep track of which bytes of the final packet have been received, and which have not. Keeping track of that is much easier if you receive them in order. And since you need to start with the last, that order has to be reverse order. From that perspective, it makes a lot of sense to send fragments in reverse order.

For the sender it would actually be slightly simpler to do them in order. Because once you have send the first fragment, you can overwrite the end of the first fragment with the IP header for the next fragment, that way you avoid using an extra buffer and doing additional copying.

Quote from: bicknell on March 25, 2013, 01:44:10 PMAt least with a couple of popular NAT implementations we tested that do in fact reassemble fragments this causes them to be dropped. Segment #1 must be received first to create a state table entry for the rest of the packets.

Fragmentation is known to be problematic. NAT is known to be problematic. Combining the two makes it even worse. What you are describing is not the only problem.

What you describe is understandable behaviour. After all the port numbers are crucial to the operations a NAT device performs on packets, and the port numbers are only listed in the first fragment. However the onus is on the NAT implementers to solve the problem, as fragmentation was standardized before NAT was invented. And a NAT is not allowed to reassemble the fragments. It will however need to store all the fragments in memory until it knows where to forward them to.

Quote from: bicknell on March 25, 2013, 01:44:10 PMHowever, as a programmer I also get that having a NAT box store random fragments in the hopes the rest of the bits come in later is both a bit of a programming challenge and a potential DDOS vector.

That's something implementers of NAT devices have to deal with, if they don't want to be shipping a broken product. You can set aside a few MB of memory for storing fragments that cannot yet be forwarded due to the first fragment not having been seen yet. And a FIFO strategy would be sensible for discarding packets once memory is full, and in fact fragments must be discarded once they are a few minutes old, even if there isn't any memory pressure.

Quote from: bicknell on March 25, 2013, 01:44:10 PMif there is any work around.

My best recommendation is to avoid NAT and avoid fragmentation.

The problem you described would be even harder to solve in IPv6 than it is in IPv4 due to the possibility of extension headers moving the port numbers to a different position. There isn't even a guarantee, that the port number is in the first fragment. If there are so many extension headers, that the transport header is in a later segment, you need all the packets from the very first until the one with the transport header in order to figure out the port number. You may even need all of those fragments to figure out what the protocol is.

The good news is, that with IPv6 you don't need NAT. And IPv6 has improved the situation regarding a lot of other fragmentation related issues.

bicknell · March 26, 2013, 07:49:28 AM

Quote from: kasperd on March 25, 2013, 02:58:33 PM
Quote from: bicknell on March 25, 2013, 01:44:10 PMI do want to point out one thing I found which is a bit of a surprise to me. It turns out most (all?) of the Linux kernels output UDP fragments in _reverse_ order. That is, say you had a 3500 byte packet to transmit which would become 1500 byte segment #1, 1500 byte segment #2, and 500 byte segment #3. They will go out on the wire 3, then 2, then 1.
It's easier to reassemble packets that way. The last fragment is the only one, which can tell you how large the reassembled packet is going to be. So until you have received the last fragment, you cannot allocate memory for reassembly, and you'll have to keep packets separate.

As the packets are being reassembled, you need to keep track of which bytes of the final packet have been received, and which have not. Keeping track of that is much easier if you receive them in order. And since you need to start with the last, that order has to be reverse order. From that perspective, it makes a lot of sense to send fragments in reverse order.

Your statement makes no sense to me.

The first frame is the only frame with a IP header, which includes the length field. The rest have an offset inside the packet. So in the hypothetical stream I mentioned above, the receiver would get:

Packet #3: Fragment, offset 3000, len 500.
Packet #2: Fragment, offset 1500, len 1500.
Packet #1: IP Header, length 3500, fragment, offset 0 len 1500.

You're suggesting the receiver can guess from packet #3 this is a 3500 byte frame, but that is not correct. Consider this stream for a 4000 byte packet:

Packet #4: Fragment, offset 3500, len 500.
Packet #3: Fragment, offset 3000, len 500.
Packet #2: Fragment, offset 1500, len 1500.
Packet #1: IP Header, length 4000, fragment, offset 0 len 1500.

Except that packet #4 is dropped in flight. The receiver receives the same packet #3, but cannot guess the memory size until packet #1 is received!

More importantly, from a security perspective when received in reverse order the receiving host must store all fragments received in memory to see if a header comes in that matches. This enables a trivial DDOS, send fragments to the host and it will run out of memory! If packet #1 is received first it's header can be matched against ACL's, including dynamic state entries, and allowed or discarded. Subsequent fragments can then be saved or discarded upon reception based on if they match an initial packet that has already passed the security checks.

I believe from both a programming perspective and a security perspective things are significantly easier if the fragmented frames arrive in order, rather than any out of order sequence, including the reversed sequence Linux uses.

bicknell · March 26, 2013, 08:00:05 AM

I finally got off my duff and did some serious testing, and what I discovered is interesting. I've attached a diagram, but basically I inserted an ethernet switch between the Time Capsule and my cable modem so that I could capture both sides. I then ran this query on the EndHost:

EndHost:~ bicknell$ dig +norecurse +bufsize=2048 txt txtpadding-1800.netalyzr.icsi.berkeley.edu @ipv6-node.netalyzr.icsi.berkeley.edu
;; Warning: ID mismatch: expected ID 28893, got 25185
;; Warning: query response not set

; <<>> DiG 9.8.3-P1 <<>> +norecurse +bufsize=2048 txt txtpadding-1800.netalyzr.icsi.berkeley.edu @ipv6-node.netalyzr.icsi.berkeley.edu
;; global options: +cmd
;; connection timed out; no servers could be reached

The ID mismatch is the first interesting thing, but it's not actually the problem. First let's look at a tcpdump on the EndHost itself, to see how this makes it out on the wire:

09:20:34.195533 IP6 2001:470:e07d:1:54ea:2859:ef83:1f92.50321 > 2607:f740:b::f93.53: 28893 [1au] TXT? txtpadding-1800.netalyzr.icsi.berkeley.edu. (71)
09:20:34.250004 IP6 2607:f740:b::f93.53 > 2001:470:e07d:1:54ea:2859:ef83:1f92.50321: 25185 updateM [b2&3=0x6420] [14646a] [12596q] [8242n] [12336au][|domain]
09:20:39.196723 IP6 2001:470:e07d:1:54ea:2859:ef83:1f92.50321 > 2607:f740:b::f93.53: 28893 [1au] TXT? txtpadding-1800.netalyzr.icsi.berkeley.edu. (71)
09:20:44.197888 IP6 2001:470:e07d:1:54ea:2859:ef83:1f92.50321 > 2607:f740:b::f93.53: 28893 [1au] TXT? txtpadding-1800.netalyzr.icsi.berkeley.edu. (71)

We see the initial query, a packet triggering the ID mismatch, and then two repeats of the query. Note that there is no response to the query of any kind.

Moving on to the Sniffer, we get the rest of the details:

09:20:35.397019 IP 74-93-155-149-memphis-tn.hfc.comcastbusiness.net > tserv2.ash1.he.net: IP6 2001:470:e07d:1:54ea:2859:ef83:1f92.50321 > ipv6-node.netalyzr.icsi.berkeley.edu.domain: 28893 [1au] TXT? txtpadding-1800.netalyzr.icsi.berkeley.edu. (71)
09:20:35.444275 IP tserv2.ash1.he.net > 74-93-155-149-memphis-tn.hfc.comcastbusiness.net: IP6 ipv6-node.netalyzr.icsi.berkeley.edu > 2001:470:e07d:1:54ea:2859:ef83:1f92: frag (1448|360)
09:20:35.450971 IP tserv2.ash1.he.net > 74-93-155-149-memphis-tn.hfc.comcastbusiness.net: IP6 ipv6-node.netalyzr.icsi.berkeley.edu.domain > 2001:470:e07d:1:54ea:2859:ef83:1f92.50321: 25185 updateM [b2&3=0x6420] [14646a] [12596q] [8242n] [12336au][|domain]
09:20:40.398153 IP 74-93-155-149-memphis-tn.hfc.comcastbusiness.net > tserv2.ash1.he.net: IP6 2001:470:e07d:1:54ea:2859:ef83:1f92.50321 > ipv6-node.netalyzr.icsi.berkeley.edu.domain: 28893 [1au] TXT? txtpadding-1800.netalyzr.icsi.berkeley.edu. (71)
09:20:40.442996 IP tserv2.ash1.he.net > 74-93-155-149-memphis-tn.hfc.comcastbusiness.net: IP6 ipv6-node.netalyzr.icsi.berkeley.edu > 2001:470:e07d:1:54ea:2859:ef83:1f92: frag (1432|376)
09:20:40.443275 IP tserv2.ash1.he.net > 74-93-155-149-memphis-tn.hfc.comcastbusiness.net: IP6 ipv6-node.netalyzr.icsi.berkeley.edu > 2001:470:e07d:1:54ea:2859:ef83:1f92: frag (0|1432) domain > 50321: 28893*- 1/1/2 TXT[|domain]
09:20:45.399317 IP 74-93-155-149-memphis-tn.hfc.comcastbusiness.net > tserv2.ash1.he.net: IP6 2001:470:e07d:1:54ea:2859:ef83:1f92.50321 > ipv6-node.netalyzr.icsi.berkeley.edu.domain: 28893 [1au] TXT? txtpadding-1800.netalyzr.icsi.berkeley.edu. (71)
09:20:45.443485 IP tserv2.ash1.he.net > 74-93-155-149-memphis-tn.hfc.comcastbusiness.net: IP6 ipv6-node.netalyzr.icsi.berkeley.edu > 2001:470:e07d:1:54ea:2859:ef83:1f92: frag (1432|376)
09:20:45.444585 IP tserv2.ash1.he.net > 74-93-155-149-memphis-tn.hfc.comcastbusiness.net: IP6 ipv6-node.netalyzr.icsi.berkeley.edu > 2001:470:e07d:1:54ea:2859:ef83:1f92: frag (0|1432) domain > 50321: 28893*- 1/1/2 TXT[|domain]

These of course are the tunnel encapsulated packets. Here we see the query, the errant packet triggering the ID mismatch, but then a response! The response is fragmented, and the fragments are received in reverse order. We see first offset 1432 length 376, and second offset 0 length 1432.

The good news here is that TunnelBroker is off the hook, the fragments are making it down my tunnel. :D

The bad news is that they are not making it past my Time Capsule. I'm working with the Netalyzr folks to see if there is anything we can do to get the fragments in order to see if that makes a difference before going back to update my bug report with Apple. I suspect though that many firewalls will block all fragments (very bad), and that many will block the fragments received out of order (somewhat bad). If people can replicate this test with different hardware it would be appreciated.

kasperd · March 26, 2013, 01:08:54 PM

Quote from: bicknell on March 26, 2013, 07:49:28 AMYour statement makes no sense to me.

I assumed you knew how fragmentation works. My bad.

Each fragment has an IP header. The length field in the IP header always indicates the length of the fragment. Fragments are not numbered, but they carry an indication of their offset within the reassembled packet. Additionally there is one bit indicating if this is the last fragment. The first fragment is recognized by having offset 0.

A packet which is not fragmented is simply a fragment, which is simultaneously the first and the last fragment. (In case of IPv6 the fragment header can be left out on such a fragment saving 8 bytes of space).

No fragment contains a field indicating the size of the reassembled packet. The size is computed by adding fragment offset and fragment length of the last fragment.

Quote from: bicknell on March 26, 2013, 07:49:28 AMMore importantly, from a security perspective when received in reverse order the receiving host must store all fragments received in memory to see if a header comes in that matches. This enables a trivial DDOS, send fragments to the host and it will run out of memory!

This is a well-known issue, which you have to keep in mind when implementing an IP stack. Just limit the amount of memory used for reassembly and use a FIFO strategy to discard fragments when memory need to be used for a newly arrived fragment.

Quote from: bicknell on March 26, 2013, 07:49:28 AMIf packet #1 is received first it's header can be matched against ACL's, including dynamic state entries, and allowed or discarded. Subsequent fragments can then be saved or discarded upon reception based on if they match an initial packet that has already passed the security checks.

Ordering was optimized for the receiver not the firewall. Additionally with IPv6 you can't always apply ACL's based on any one fragment. IPv6 packets can be constructed in a way, where even figuring out the port number being used requires all the fragments.

There is a simple solution though. Just let all the fragments pass through the firewall until you see the first fragment of the packet. Then when you see the first fragment you decide if the packet is permitted or not. If the packet is permitted, you let it through. If the packet is rejected, the firewall sends an ICMPv6 error code based on the first fragment. Other fragments, which have already passed through the firewall, will be discarded by the destination host.

Quote from: bicknell on March 26, 2013, 07:49:28 AMI believe from both a programming perspective and a security perspective things are significantly easier if the fragmented frames arrive in order, rather than any out of order sequence, including the reversed sequence Linux uses.

I think your idea about what is easier would change, if you tried to implement fragment reassembly. As for the security implications, it is a mistake to consider what ordering is easiest to deal with. Your security needs to work regardless of which order an attacker sends packets in.

kasperd · March 26, 2013, 01:44:00 PM

Quote from: bicknell on March 26, 2013, 08:00:05 AMbasically I inserted an ethernet switch between the Time Capsule and my cable modem so that I could capture both sides.

That's not supposed to be possible to do with a switch. I keep an old hub around just in case I need to do that sort of debugging.

Quote from: bicknell on March 26, 2013, 08:00:05 AMEndHost:~ bicknell$ dig +norecurse +bufsize=2048 txt txtpadding-1800.netalyzr.icsi.berkeley.edu @ipv6-node.netalyzr.icsi.berkeley.edu
;; Warning: ID mismatch: expected ID 28893, got 25185
;; Warning: query response not set

; <<>> DiG 9.8.3-P1 <<>> +norecurse +bufsize=2048 txt txtpadding-1800.netalyzr.icsi.berkeley.edu @ipv6-node.netalyzr.icsi.berkeley.edu
;; global options: +cmd
;; connection timed out; no servers could be reached

The ID mismatch is the first interesting thing, but it's not actually the problem.

That ID mismatch is a symptom of a quite mysterious bug on their side. Notice how the incorrect ID being received is always 25185. That packet is not DNS at all. What it contains is ASCII data. I captured one of those and found this 31 character string in the packet "bad 1496 2001:470:0:69::2 1480 ".

It showed up at the exact place in the sequence of packets, where the first fragment of the DNS reply would have been expected. The second fragment of the DNS reply had already been received. From the offset on the second fragment I can see the size of the first fragment, which would have been too large for the tunnel MTU, which explains why the first fragment did not arrive.

Quote from: bicknell on March 26, 2013, 08:00:05 AMThe bad news is that they are not making it past my Time Capsule.

Then the Time Capsule is at fault. And netalyzr is correct to report that you have a fragmentation problem.

Quote from: bicknell on March 26, 2013, 08:00:05 AMI'm working with the Netalyzr folks to see if there is anything we can do to get the fragments in order to see if that makes a difference

I can hack together a DNS server sending replies in various orders, if you need to test that.

News:

Netalyzer says I have a IPv6 fragmentation problem.