[Openswan Users] Dead Peer Detection restart causes tunnel to be established, but afterwards cannot ping from either side
the1geekman at gmail.com
Tue Oct 11 07:59:43 EDT 2011
Some progress of sorts. I ended up abandoning my OpenSwan to OpenSwan
test, as it came with its own set of weird set backs. I think this has
to do with both the testing and live servers sharing the same default
gateway and public address space - maybe.
What I decided to do instead was, use a different device as the remote
end point. We just got in a Cisco SRP527W for R/D. I decided to use
that for testing, to see if it made a difference. I wasn't sure if it
would help, since I thought perhaps the IPSec implementation would be
about the same, rendering the results inconclusive.
However, upon replacing the RV042 in our office with this new Cisco
SRP device, the tunnel seemed much more stable. I was able to issue
/etc/init.d/ipsec restart from our OpenSwan box multiple times
(checked logs to ensure OpenSwan initiated the connection - main mode,
quick mode or whatever), and each time it came back without issues.
Furthermore, I was able to ping to the Cisco remote peer while our
RV042 remote peer I could not ping after a restart of IPSec; both
tunnels terminated at the same OpenSwan box. So it seems to me that
the issue is somehow specific to the IPSec implementation on the RV042
Unfortunately, I don't have any other VPN devices to test this theory
with, but it seems pretty solid, the moment I switch back to the RV042
in our office, I can reproduce the issue again by restarting OpenSwan.
In any case, here's a summary of some of the testing I did:
* I rebooted the SRP and it initiated main mode when it came back.
* I issued /etc/init.d/ipsec restart, OpenSwan initiated main mode.
* I issued another ipsec restart 10 minutes later, OpenSwan initiated
main mode. Tunnel worked. I was running a ping from 192.168.15.100 to
172.16.0.4 at the time, it only timed out twice while OpenSwan
restarted, then resumed to work fine.
* After restarting, each time I tried to ping 192.168.10.254, which is
the RV042 remote peer for a second tunnel. It did not work both times
after I restarted OpenSwan, while I was able to ping 192.168.15.1 (The
Cisco SRP) both times right away.
* Only got RV042 to work after doing a replace and waiting for the
RV042 to respond to main mode (as usual). Both times that I tried to
use --up to initiate main mode from OpenSwan after a replace too, it
did not work, as usual.
Now my main problem is that I can't just ditch the RV042 and be done
with it, because we have 20+ of these RV042s deployed already at
various customer sites. It would be a big job to replace them.
Potentially all of them we could end up having to terminate against
our OpenSwan box (right now they're terminating against another
So I really need to try and figure out what to do about it. Is there
anything else I can do to narrow down the problem? I am considering
lodging a fault of some kind with Linksys/Cisco Small Business, but my
guess is that they will jump at the chance to blame OpenSwan, so I
need to give as much proof as I can. I know the RV042 in our office,
at least, is running the latest firmware, and it was still having the
Also, I should say thanks for your suggestion Erich. We have an
existing Nagios infrastructure that we're planning on using to send
pings down the tunnel to the remote peer, every 5 minutes as part of
our VPN monitoring, so we'll get notified in the case of failures like
this. But obviously, if it happens every hour or two (like the logs
indicated) then that's a problem for us (and our users).
Its interesting you say it works well for you - are the interruptions
noticeable? We are considering putting in a cron to issue a --replace
on a tunnel once it stops responding, to fix this issue. The problem
is, the nature of the issue means that our OpenSwan box can't be the
one to initiate - otherwise the tunnel doesn't work. So that would
mean at random hourly intervals, the tunnel would stop for a few
seconds till it times out on the other end who then re-initiates. And
for people using RDP etc, I imagine this would be noticeable and
Would you say it happens that often for you? Or considerably less? I
wouldn't mind that work around so much if the chances of it occurring
for each tunnel every few hours wasn't so high. Unfortunately for me,
it may prove to be our last hope of getting it working at all with the
Thanks everyone for your continued help. Really appreciate it. Please
let me know if there's any other information I can give.
On Tue, Oct 11, 2011 at 9:17 AM, Erich Titl <erich.titl at think.ch> wrote:
> on 10.10.2011 14:30, Geekman wrote:
>> Hi All,
>> Decided to run a ping to the other side of a tunnel for a few hours
>> today, because obviously if the tunnels stop working intermittently
>> due to this issue, then that's a big step from only happening on
>> restart. It seemed highly likely given the presentation of the issue
>> I've seen thus far.
>> And on it goes like that. This was without any sort of intervention --
>> no restarts or anything, obviously this is kind of a deal breaker.
>> Unless I was to add a cron job to issue "--replace" on all tunnels
>> every 15 minutes... which feels so dirty to me.
> It is but at least in my case something outside IPSEC appears to work
> better than DPD. Mind you I am running code from the stone age.
> I am running a ICMP echo through the tunnel every n seconds. if the echo
> fails, I ping every second.
> If it fails consecutively for a defined number of retries, I try to
> determine if I have access to the default router, if not, I restart the
> interface and the next loop starts.
> Else I try to restart the tunnel and check again.
> This has proven to be quite effective.
> Still a good implementation of DPD should perform a lot better than
> this, but then....
> Users at openswan.org
> Micropayments: https://flattr.com/thing/38387/IPsec-for-Linux-made-easy
> Building and Integrating Virtual Private Networks with Openswan:
More information about the Users