[Openswan Users] UDP fragmentation in Linux
Marcus Leech
mleech at nortel.com
Fri Mar 4 12:41:30 CET 2005
After my fiasco of last night (trying to use 2048-bit certs and having
them utterly fail to make across
the network), I've started looking into Linux UDP fragmentation grossness.
It seems that even if you set the appropriate IP options
(IP_MTU_DISCOVER to IP_PMTU_DONT), UDP
packets are getting badly munged if they exceed the local MTU. It
looks like they're simply getting *truncated*,
which is so NOT according to spec that it makes me ill. It's not like
the Linux stack can't deal with sending
fragments, either, since pings with sizes > local MTU get fragmented,
sent across the internet, and apparently
correctly reassembled at the other end.
But with UDP packets (NOT JUST PLUTO--I wrote some test code), the stack
simply emits a single packet with
the "more fragments" flag bit set in the IP header, the UDP length
field set to the UDP length, and the IP length set to
the MTU. But the trailing fragment(s) never get emitted--just the
first one. This would cause a fragment reassembly
timeout at the receiver. This is so broken, I don't even know where
to begin (splutter, grumble). The behaviour goes back to at least
2.4.18, and is consistent in 2.6.11. I'm surely not the first person
to observe this behaviour and start ranting.
Another observation. When I was testing this stuff purely-locally (on
the same IP subnet), I could use long
certificates, and nothing bad happened. I can only assume that the
Linux stack detects the "local subnettedness"
and uses jumbograms--I don't have the patience/energy to go back and
set it up again to run a tcpdump.
I'm suspecting that the IPTABLES code is scewing up in some way, since
the kernel ip_output routines call
NF_HOOK, rather than passing directly to the routing-chosen hardware
device. Somewhere in all
that netfilter goop, I think that the output packet fragmentation code
has become broken--at least for UDP.
Like I observed, ICMP ECHO packets get correctly fragmented when they
exceed the local MTU.
I can't believe people put up with this. It's so horribly, outrageously
broken. Now, I know that there are
those that argue that IP fragmentation itself is *conceptually*
broken, but the fact is that it's standard,
and it largely works. The exceptions are firewalls, which don't like
to deal with reassembly, so they
drop fragments on the floor as punishment. But I think that the
community has slowly become confused
about IP fragments--letting the poor behaviour of firewalls and
similar IP machinery dicate a new, and
profoundly-bad de-facto standard.
I know that in IPV6, there's no fragmentation at all. But minimum MTU is
also larger.
In the absence of app-layer fragmentation in IKE, how am I supposed to
support larger (2048-bit)
certificates?
More information about the Users
mailing list