[Openswan Users] UDP fragmentation in Linux

Marcus Leech mleech at nortel.com
Fri Mar 4 12:41:30 CET 2005


After my fiasco of last night (trying to use 2048-bit certs and having 
them utterly fail to make across
  the network), I've started looking into Linux UDP fragmentation grossness.

It seems that even if you set the appropriate IP options 
(IP_MTU_DISCOVER to IP_PMTU_DONT), UDP
  packets are getting badly munged if they exceed the local MTU.  It 
looks like they're simply getting *truncated*,
  which is so NOT according to spec that it makes me ill.  It's not like 
the Linux stack can't deal with sending
  fragments, either, since pings with sizes > local MTU get fragmented, 
sent across the internet, and apparently
  correctly reassembled at the other end.

But with UDP packets (NOT JUST PLUTO--I wrote some test code), the stack 
simply emits a single packet with
  the "more fragments" flag bit set in the IP header, the UDP length 
field set to the UDP length, and the IP length set to
  the MTU.  But the trailing fragment(s) never get emitted--just the 
first one.  This would cause a fragment reassembly
  timeout at the receiver.  This is so broken, I don't even know where 
to begin (splutter, grumble).  The behaviour goes back to at least
  2.4.18, and is consistent in 2.6.11.  I'm surely not the first person 
to observe this behaviour and start ranting.

Another observation.  When I was testing this stuff purely-locally (on 
the same IP subnet), I could use long
  certificates, and nothing bad happened.  I can only assume that the 
Linux stack detects the "local subnettedness"
  and uses jumbograms--I don't have the patience/energy to go back and 
set it up again to run a tcpdump.

I'm suspecting that the IPTABLES code is scewing up in some way, since 
the kernel ip_output routines call
  NF_HOOK, rather than passing directly to the routing-chosen hardware 
device.  Somewhere in all
  that netfilter goop, I think that the output packet fragmentation code 
has become broken--at least for UDP.
  Like I observed, ICMP ECHO packets get correctly fragmented when they 
exceed the local MTU.

I can't believe people put up with this.  It's so horribly, outrageously 
broken.   Now, I know that there are
  those that argue that IP fragmentation itself is *conceptually* 
broken, but the fact is that it's standard,
  and it largely works.  The exceptions are firewalls, which don't like 
to deal with reassembly, so they
  drop fragments on the floor as punishment.  But I think that the 
community has slowly become confused
  about IP fragments--letting the poor behaviour of firewalls and 
similar IP machinery dicate a new, and
  profoundly-bad de-facto standard.

I know that in IPV6, there's no fragmentation at all. But minimum MTU is 
also larger.

In the absence of app-layer fragmentation in IKE, how am I supposed to 
support larger (2048-bit)
  certificates?




More information about the Users mailing list