[Openswan dev] Problem and proposed bug fix to Openswan when the system's time jumps forward multiple decades
Antony Richards
arichards at cybertec.com.au
Thu Aug 4 03:44:45 EDT 2011
Hi,
See inline.
On 08/04/2011 04:37 PM, D. Hugh Redelmeier wrote:
> | From: Antony Richards<arichards at cybertec.com.au>
>
> | I came across a bug in Openswan. The box starts up. Openswan connects. The
> | date is set back to 1970, NTPD is started and sets the date to be 2011.
>
> That sounds like a bug in ntpd. How does the date get set to 1970?
>
>
I explicitly turned off ntpd and set the date back to 1970. The reason
was that it quickly reproduced an issue I observed when an Linux
embedded device I'm developing is first powered on with a pre-loaded
configuration.
> My recollection (unverified) that ntpd is supposed to use adjtime(3)
> and thus the time is monotonically increasing.
>
> | After this, the command "ipsec auto --status" blocks indefinitely.
>
> Did pluto log the message "time moved backwards %ld seconds" (from
> now())?
>
Yes.
> Pluto's now() tries to keep time monotonic. The method is to use the
> variable "delta" to accumulate the amount of backward time observed
> and to add "delta" to each subsequent result. It was expected that
> delta would be small.
>
> If the time goes backwards a whole bunch, and then formward a similar
> amount, I guess that adding delta could cause overflow.
>
> (32-bit time_t should be able to hold 136 years worth of time. It is
> signed, and 0 represents Jan 1, 1970. So the representation wraps
> arround (overflows) in 2038. If time flies backwards 40 years, and
> then forwards 40 years, the correction mechanism will add 40 years to
> the current time (2011) and thus overflow.)
>
> now() is called frequently. I imagine that there is a bound on how
> long it could have been since the last call. Proposed change: If the
> observed time change is way too large in the forward direction, use
> that to trim the value of delta.
>
> So: (currently) if time flies 40 years into the past, then delta += 40 years.
>
> (proposed) If time flies 40 years into the future, delta -= 40 years.
>
That would solve the problem also, but using uptime gives a simpler
solution.
What would the minimum jump in time be before adjusting delta?
> It would be good for now() to check if n + delta results in an
> overflow. What to do then??? passert failure?
>
>
If all references in timer.c use changes in time then (and the changes
are small, eg less than 1 year) then the overflow can be safely handled
by doing unsigned differences. Casting the result to a signed number
will be correct for before, now or after.
eg
unsigned int now;
unsigned int scheduled;
if (((int)(scheduled - now)) < 0)
{
act on the event.
}
I believe this works for signed numbers also (I just did a paper
exercise to prove it to myself).
So next_event() and handle_timer_event() need to handle time this way.
This should be done irrespective of a change to now().
I can create a new diff with this change in it - be about a day or so
(I'll ensure it works with timer overflow).
> | Analysis (based on 2.4 and verified in 2.6) is that in timer.c:next_event(),
> | the expression *evlist->ev_time - now()* is negative (ev_time is 40 years
> | ago). Hence next_event() returns 0, causing the event to be triggered.
> |
> | But when handle_timer_event() is called, the expression *now()< ev->ev_time*
> | evaluates to TRUE (due to long wrapping???), meaning that the event is not
> | removed from the event list.
> |
> | Looking at /server.c/, the code becomes stuck in the loop ... in
> | /call_server()/ next_event() returns 0. This then sets ndes to 0 (and
> | osw_select() is NOT called). But when handle_timer_event() is called, the
> | function believes that there are no events to handle, so the event queue does
> | not change. So when call_server() loops it calls next_event() which returns 0
> | again.
> |
> | This blocks the thread from calling the select() function to handle all the
> | file descriptors. Hence *ipsec auto --status* never gets a reply to its
> | request, and will block forever. Likewise, all communications into Openswan
> | never gets a reply.
> |
> | A proposed solution (that I have tested successfully in version 2.4) is to
> | change now() to use the systems uptime instead. This solution also has the
> | benefit that timing jitter is not introduced by people setting the time on a
> | device or programs such as ntpd.
>
> What is the cost of a sysinfo call? now() is called a lot.
>
Behinds the scene the timer is just a counter that would be copied
over. The man page shows that about 14 counters/variables would need to
be copied per call.
The kernel function do_sysinfo() does seem to have some overhead.
> sysinfo(2) is Linux-specific according to the manpage.
>
> The semantics of clock_gettime(3) using CLOCK_MONOTONIC might be
> better. The tv_sec field of the resulting timespec should do (it's a
> time_t).
Agree. Any timer that that cannot be set works. Looking into the
kernel code, it looks like less overhead than do_sysinfo()
I'll send a new diff file soon covering these comments (ie changing the
maths in timer.c and using sys_clock_gettime()).
Thanks,
Antony.
>
> _______________________________________________
> Dev mailing list
> Dev at openswan.org
> http://lists.openswan.org/mailman/listinfo/dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.openswan.org/pipermail/dev/attachments/20110804/a5468906/attachment-0001.html
More information about the Dev
mailing list