[Openswan dev] netlink_get: XFRM_MSG_ACQUIRE message

Thu Dec 9 13:49:16 EST 2010

On Thu, 9 Dec 2010, Avesh Agarwal wrote:

> I am seeing several messages related to "netlink_get: XFRM_MSG_ACQUIRE 
> message" in the log (a small part of the log given below). The ipsec 
> connection is in tunnel mode between openswan (as a client) and Cisco ipsec 
> gateway. When Openswan starts phase 2 renegotiation, and if it fails (may be 
> due to if network goes down), Openswan keeps getting XFRM_MSG_ACQUIRE 
> messages and keeps initiating phase 2. This becomes a major concern because 
> very soon the no. of state entries reaches to upto 25k, and in one case 
> (where a firmware related to a wireless card reloads in between, and network 
> goes down. There is also a bug related to this 658121), we see Openswan 
> crashing at  "st->st_new_iv_len = h->hash_digest_len;" in function 
> init_phase2_iv in the file ikev1_quick.c. I am using openswan-2.6.31, and I 
> saw some patches that went into  2.6.29 related to "NETKEY: Fix for spurious 
> %hold netlink-acquires" . Not sure if these things are related, but seems 
> similar at the first look, so I would appreciate if you have any thoughts 
> about it. I looked about these messages XFRM_MSG_ACQUIRE in google, but did 
> not find much info about them.

Yes. This is a bug. But in our opinion of the kernel NETKEY code. They
should not send pluto an aquire message every jiffy :P Note that KLIPS
has some exponential back-off for those and those a first+last packet
caching to mitigate these.

I believe klips also looks at which SA the packet was hitting and ensures
that subsequent packets that would hit the same SA would not trigger
more spurious duplicate aquires. This relates to the narrow/broad hold
code in kernel_netlink.c I believe.

and I have a suspicion we might get more then one per packet as well,
or else how would you end up with 25k ones.

I'm not sure if the crasher is due to a DoS against pluto by the kernel. Perhaps
we should build in some protection against that, but the ultimate solution is
to not bother pluto with 25k aquire messages within a few seconds in which we
cannot even complete a single SA re-establishment (even if there was no outage)

Paul

> ----------------------------------------------------
> Dec  9 01:36:19 localhost pluto[10555]: | *received kernel message
> Dec  9 01:36:19 localhost pluto[10555]: | netlink_get: XFRM_MSG_ACQUIRE 
> message
> Dec  9 01:36:19 localhost pluto[10555]: | add bare shunt 0x7fdab9eb9b78 
> 10.11.10.243/32:68 --17--> 172.16.52.28/32:67 => %hold 0    %acquire-netlink
> Dec  9 01:36:19 localhost pluto[10555]: initiate on demand from 
> 10.11.10.243:68 to 172.16.52.28:67 proto=17 state: fos_start because: acquire
> Dec  9 01:36:19 localhost pluto[10555]: | find_connection: looking for policy 
> for connection: 10.11.10.243:17/68 -> 172.16.52.28:17/67
> Dec  9 01:36:19 localhost pluto[10555]: | find_connection: conn "test" has 
> compatible peers: 10.11.10.243/32 -> 0.0.0.0/0 [pri: 16842760]
> Dec  9 01:36:19 localhost pluto[10555]: | find_connection: comparing best 
> "test" [pri:16842760]{0x7fdab9eae2b8} (child none) to "test" 
> [pri:16842760]{0x7fdab9eae2b8} (child none)
> Dec  9 01:36:19 localhost pluto[10555]: | find_connection: concluding with 
> "test" [pri:16842760]{0x7fdab9eae2b8} kind=CK_PERMANENT
> Dec  9 01:36:19 localhost pluto[10555]: | assign hold, routing was 
> prospective erouted, needs to be erouted HOLD
> Dec  9 01:36:19 localhost pluto[10555]: | eroute_connection replace %trap 
> with broad %hold eroute 10.11.10.243/32:0 --0-> 172.16.0.0/16:0 => %hold 
> (raw_eroute)
> Dec  9 01:36:19 localhost pluto[10555]: | raw_eroute result=1
> Dec  9 01:36:19 localhost pluto[10555]: | adding specific host-to-host bare 
> shunt
> Dec  9 01:36:19 localhost pluto[10555]: | delete narrow %hold eroute 
> 10.11.10.243/32:68 --17-> 172.16.52.28/32:67 => %hold (raw_eroute)
> Dec  9 01:36:19 localhost pluto[10555]: | raw_eroute result=1
> Dec  9 01:36:19 localhost pluto[10555]: | delete bare shunt 0x7fdab9eb9b78 
> 10.11.10.243/32:68 --17--> 172.16.52.28/32:67 => %hold 0    %acquire-netlink
> Dec  9 01:36:19 localhost pluto[10555]: | duplicating state object #1
> Dec  9 01:36:19 localhost pluto[10555]: | creating state object #453 at 
> 0x7fdab9ec3c88
> Dec  9 01:36:19 localhost pluto[10555]: | processing connection test
> Dec  9 01:36:19 localhost pluto[10555]: | ICOOKIE:  56 9d 20 b4  70 91 16 67
> Dec  9 01:36:19 localhost pluto[10555]: | RCOOKIE:  dd 29 c4 55  f2 2a 3a 24
> Dec  9 01:36:19 localhost pluto[10555]: | state hash entry 4
> Dec  9 01:36:19 localhost pluto[10555]: | inserting state object #453 on 
> chain 4
> Dec  9 01:36:19 localhost pluto[10555]: | inserting event EVENT_SO_DISCARD, 
> timeout in 0 seconds for #453
> Dec  9 01:36:19 localhost pluto[10555]: | event added at head of queue
> Dec  9 01:36:19 localhost pluto[10555]: | kernel_alg_esp_enc_ok(12,0): 
> alg_id=12, alg_ivlen=8, alg_minbits=128, alg_maxbits=256, res=0, ret=1
> Dec  9 01:36:19 localhost pluto[10555]: | 
> kernel_alg_esp_enc_keylen():alg_id=12, keylen=16
> Dec  9 01:36:19 localhost pluto[10555]: | kernel_alg_esp_auth_keylen(auth=2, 
> sadb_aalg=3): a_keylen=20
> Dec  9 01:36:19 localhost pluto[10555]: "test" #453: initiating Quick Mode 
> PSK+ENCRYPT+TUNNEL+PFS+UP+AGGRESSIVE+IKEv2ALLOW {using isakmp#1 
> msgid:401ded70 proposal=AES(12)_128-SHA1(2)_160 
> pfsgroup=OAKLEY_GROUP_MODP1024}
>