[Openswan dev] Problem and proposed bug fix to Openswan when the system's time jumps forward multiple decades

Thu Aug 4 03:44:45 EDT 2011

Hi,

See inline.

On 08/04/2011 04:37 PM, D. Hugh Redelmeier wrote:
> | From: Antony Richards<arichards at cybertec.com.au>
>
> | I came across a bug in Openswan.  The box starts up.  Openswan connects.  The
> | date is set back to 1970, NTPD is started and sets the date to be 2011.
>
> That sounds like a bug in ntpd.  How does the date get set to 1970?
>
>    
I explicitly turned off ntpd and set the date back to 1970.  The reason 
was that it quickly reproduced an issue I observed when an Linux 
embedded device I'm developing is first powered on with a pre-loaded 
configuration.

> My recollection (unverified) that ntpd is supposed to use adjtime(3)
> and thus the time is monotonically increasing.
>    
> | After this, the command "ipsec auto --status" blocks indefinitely.
>
> Did pluto log the message "time moved backwards %ld seconds" (from
> now())?
>    
Yes.

> Pluto's now() tries to keep time monotonic.  The method is to use the
> variable "delta" to accumulate the amount of backward time observed
> and to add "delta" to each subsequent result.  It was expected that
> delta would be small.
>
> If the time goes backwards a whole bunch, and then formward a similar
> amount, I guess that adding delta could cause overflow.
>
> (32-bit time_t should be able to hold 136 years worth of time.  It is
> signed, and 0 represents Jan 1, 1970.  So the representation wraps
> arround (overflows) in 2038.  If time flies backwards 40 years, and
> then forwards 40 years, the correction mechanism will add 40 years to
> the current time (2011) and thus overflow.)
>
> now() is called frequently.  I imagine that there is a bound on how
> long it could have been since the last call.  Proposed change: If the
> observed time change is way too large in the forward direction, use
> that to trim the value of delta.
>
> So: (currently) if time flies 40 years into the past, then delta += 40 years.
>
> (proposed) If time flies 40 years into the future, delta -= 40 years.
>    
That would solve the problem also, but using uptime gives a simpler 
solution.
What would the minimum jump in time be before adjusting delta?
> It would be good for now() to check if n + delta results in an
> overflow.  What to do then???  passert failure?
>
>    
If all references in timer.c use changes in time then (and the changes 
are small, eg less than 1 year) then the overflow can be safely handled 
by doing unsigned differences.  Casting the result to a signed number 
will be correct for before, now or after.

eg
unsigned int now;
unsigned int scheduled;

if (((int)(scheduled - now)) < 0)
{
   act on the event.
}

I believe this works for signed numbers also (I just did a paper 
exercise to prove it to myself).

So next_event() and handle_timer_event() need to handle time this way.  
This should be done irrespective of a change to now().

I can create a new diff with this change in it - be about a day or so 
(I'll ensure it works with timer overflow).

> | Analysis (based on 2.4 and verified in 2.6) is that in timer.c:next_event(),
> | the expression *evlist->ev_time - now()* is negative (ev_time is 40 years
> | ago).  Hence next_event() returns 0, causing the event to be triggered.
> |
> | But when handle_timer_event() is called, the expression *now()<  ev->ev_time*
> | evaluates to TRUE (due to long wrapping???), meaning that the event is not
> | removed from the event list.
> |
> | Looking at /server.c/, the code becomes stuck in the loop ... in
> | /call_server()/  next_event() returns 0.  This then sets ndes to 0 (and
> | osw_select() is NOT called).  But when handle_timer_event() is called, the
> | function believes that there are no events to handle, so the event queue does
> | not change.  So when call_server() loops it calls next_event() which returns 0
> | again.
> |
> | This blocks the thread from calling the select() function to handle all the
> | file descriptors. Hence *ipsec auto --status* never gets a reply to its
> | request, and will block forever.  Likewise, all communications into Openswan
> | never gets a reply.
> |
> | A proposed solution (that I have tested successfully in version 2.4) is to
> | change now() to use the systems uptime instead.  This solution also has the
> | benefit that timing jitter is not introduced by people setting the time on a
> | device or programs such as ntpd.
>
> What is the cost of a sysinfo call?  now() is called a lot.
>    
Behinds the scene the timer is just a counter that would be copied 
over.  The man page shows that about 14 counters/variables would need to 
be copied per call.

The kernel function do_sysinfo() does seem to have some overhead.
> sysinfo(2) is Linux-specific according to the manpage.
>
> The semantics of clock_gettime(3) using CLOCK_MONOTONIC might be
> better.  The tv_sec field of the resulting timespec should do (it's a
> time_t).

Agree.  Any timer that that cannot be set works.  Looking into the 
kernel code, it looks like less overhead than do_sysinfo()

I'll send a new diff file soon covering these comments (ie changing the 
maths in timer.c and using sys_clock_gettime()).

Thanks,
Antony.

>
> _______________________________________________
> Dev mailing list
> Dev at openswan.org
> http://lists.openswan.org/mailman/listinfo/dev
>    

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.openswan.org/pipermail/dev/attachments/20110804/a5468906/attachment-0001.html