[Openswan Users] openswan MTU problems

Wed Oct 19 12:41:27 CEST 2005

Hello.

We've been using openswan here for a project for the last few months and 
everything was fine until we had an incident last week.
After trying to find the exact causes for a few days we've managed to 
find out a very precise sequence of events, but have been unable to find 
out what might have caused them. I'm posting the event hoping that 
someone might have an idea that we've missed or has an insightful 
investigation path that we might have missed.

We have two servers running jboss and pgpool (a postgres frontend used 
to replicate postgres database in multiple servers) and a postgres 
database on each. We encrypt everything between them using transport 
mode ESP with pre shared key. The servers are directly connected via 
Ethernet. Everything was going fine until a certain point in time when 
some things happened (almost simultaneously):

1. Openswan rekeyed a SA on one server.
2. We lost communications unidirectionally, no new messages from the 
server that just rekeyed but old flows were getting thru.
3. After a time, the pgpool on the machine that was NOT rekeying 
detected the other node as dead, while the one rekeying did not notice 
anything (we think it was operating normally).
4. We began receiving messages like "Oct 12 19:19:26 bradbury kernel: 
pmtu discovery on SA ESP/e5fb8ff6/c0a85015" for approximately 15 minutes.

Seeing the log messages we've suspected mtu problems. We tested sending 
packets hoping that we would find an ESP packet with length greater than 
the MTU, however after sending test packets of various lengths (we've 
covered every possible length from 1000 to 1500) but found nothing 
unusual (it started sending two ESP packets when necessary). We tested 
both using ping and a custom udp sending program, so we could check both 
raw sockets and udp.

We're pretty sure it was the combination of some improbable events, that 
might even include iptables and the kernel, but after lots of tests, 
we're running out of hypotheses. FYI, we've suspected a misuse of 
iptables but found no evidence of anyone executing anything related to 
iptables in those nodes.

We're using Debian Sarge Linux, kernel 2.6.8 (using linux's own ipsec 
kernel modules) and openswan 2.2.0. If anybody has an idea regarding the 
causes of these events or even some ideas to check, we'd please like to 
hear about them. If you need some logs to check any ideas you have, just 
ask and we'll post them.

Thanks in advance.

Ernesto Alvarez.
Network administrator.