[Openswan dev] Problem and proposed bug fix to Openswan when the system's time jumps forward multiple decades

Fri Aug 5 04:25:22 EDT 2011

Hi,

See inline.

On 08/05/2011 03:54 AM, D. Hugh Redelmeier wrote:
> | From: Antony Richards<arichards at cybertec.com.au>
>
> | On 08/04/2011 04:37 PM, D. Hugh Redelmeier wrote:
> |>  | From: Antony Richards<arichards at cybertec.com.au>
> |>
> |>  | I came across a bug in Openswan.  The box starts up.  Openswan connects.
> |>  | The
> |>  | date is set back to 1970, NTPD is started and sets the date to be 2011.
> |>
> |>  That sounds like a bug in ntpd.  How does the date get set to 1970?
> |>
> |>
> | I explicitly turned off ntpd and set the date back to 1970.  The reason was
> | that it quickly reproduced an issue I observed when an Linux embedded device
> | I'm developing is first powered on with a pre-loaded configuration.
>
> What were you observing in your embedded system?  Did pluto log the
> message about time moving backwards?  Is your system behaving
> reasonably?  Does it time-travel at great rates?
>
>    
Unfortunately the log files I got from the field did not include the 
Openswan logs.  All I knew was that new boxes, with the real time 
clock's capacitor probably flat, resulted in the command *ipsec auto 
--status* blocking.  My discussions internally lead me to the conclusion 
that the date must have been changed after Openswan was started.

The procedure above (changing the year from 2011 to 1970 then back 2011) 
caused the observed issue in the field to happen in my testbed.


> | That would solve the problem also, but using uptime gives a simpler solution.
>
> Yes.
>
> What I don't remember is whether we use results of now() in logging,
> implicitly assuming that the values are meaningful to the reader (i.e.
> now they are the best-guess at wall-clock time, but with another clock
> that might not be the case).
>
>    
I don't know the code base well enough to be able to comment on this.  A 
quick look at gtags-cscope shows that the function time() is called when 
clock time is needed.
> If you switch to a monotonic clock, the monotonicity-creating feature
> of now() ought to be replaced by a monotonicity-checking feature.
>
> | What would the minimum jump in time be before adjusting delta?
>
> If the time jumps past all pending event, logically it could be
> clamped at the time of the last event, I would think.  I think that
> there is almost always a pending event.  If there is not, then perhaps
> the "delta" could be zeroed anyway: nothing is affected by delta in
> that case.
>    
That would work, except the jump would also cause all the events to 
prematurely trigger.
> |>  It would be good for now() to check if n + delta results in an
> |>  overflow.  What to do then???  passert failure?
> |>
> |>
> | If all references in timer.c use changes in time then (and the changes are
> | small, eg less than 1 year) then the overflow can be safely handled by doing
> | unsigned differences.  Casting the result to a signed number will be correct
> | for before, now or after.
> |
> | eg
> | unsigned int now;
> | unsigned int scheduled;
> |
> | if (((int)(scheduled - now))<  0)
> | {
> |   act on the event.
> | }
> |
> | I believe this works for signed numbers also (I just did a paper exercise to
> | prove it to myself).
> |
> | So next_event() and handle_timer_event() need to handle time this way.  This
> | should be done irrespective of a change to now().
> |
> | I can create a new diff with this change in it - be about a day or so (I'll
> | ensure it works with timer overflow).
>
> I don't understand.
>
> The "official" type is time_t.  I'd not want to monkey with that
> without a good reason.  And int certainly isn't a good choice for a
> replacement.  int might be only 16 bits.
>
> If we were willing to play with types, using unsigned long instead of
> time_t gives us more headroom (136 years from 1970 instead of 68 years
> from 1970).  But that isn't the right fix.
>    
I was trying to give an example of how it is safe to allow timer 
overflow as long as signed differences between timestamps are always 
used.  Of course the variable for time has to remain time_t.

Maybe a better example is if you have an event list based on the second 
hand on your watch (ie from 0 to 59).  All calculations are using modulo 
60 arithmetic.  (In this example we can only schedule an event up to 30 
seconds in advance, and observe old events up to 30 seconds in the past).

If an event is scheduled at 5 seconds and it is now 55 seconds (ie -5 
modulo 60, if we are using signed number), should it be triggered?  As 
its 10 seconds in the future or 50 seconds in the past means that the 
event is not to be triggered yet (remember we only have a 30 second 
horizon).

If time is unsigned then the value of (/scheduled < now/) is (5 < /55/) 
which is true.  Hence do trigger the event. *Incorrect behavour.*

If time is signed then the value of  (/scheduled < now/) is (/5 < -5/) 
which is false. Hence don't trigger the event.  Correct behavour.

But the value of (/(signed)(scheduled - now) < 0/) is (/signed(5 - 55) < 
0/) is (/signed(-50) < 0/) is (/10 < 0/) which is false.  Hence don't 
trigger the event.  Correct behavour.


*Another example.* That is an event is scheduled at 29 seconds and it is 
now 31 seconds (-29).  The event should now be triggered.

If time is unsigned then the value of (/scheduled < now/) is (29 < /31/) 
which is true.  Hence do trigger the event.  Correct behavour.

If time is signed then the value of  (/scheduled < now/) is (/29 < -29/) 
which is false. Hence do not trigger the event. *Incorrect behavour*.

But the value of (/(signed)(scheduled - now) < 0/) is (/signed(29 - 31) 
< 0/) is (/signed(-2) < 0/) is (/-2 < 0/) which is true.  Hence do 
trigger the event.  Correct behavour.

So the two examples above show that an event scheduler can handle 
overflow as long as differences (cast to a signed value) are used. *But 
direct comparisons cannot be used.*

To confirm if this issue affected the issue, I modified 2.4 Openswan to 
have the original now() code, but it had the change to timer.c below.  I 
jumped time back and forth many times, and the command *ipsec auto 
--status* no longer blocked.

So the original infinite loop I observed has been broken.  But when 
doing big jumps in time, the items on the event loop will be either 
delays (if time jumps backwards) or triggered prematurely (time jumps 
forwards).  To solve this issue a monotonic increasing clock is needed.

I've appended a new diff to use clock_systime() and use differences when 
comparing times.

Regards,
Antony.




> | I'll send a new diff file soon covering these comments (ie changing the maths
> | in timer.c and using sys_clock_gettime()).
>
> I'm not yet 100% sold on this whole approach.
> _______________________________________________
> Dev mailing list
> Dev at openswan.org
> http://lists.openswan.org/mailman/listinfo/dev
>    

*New diff file.*

Can you check the Makefile changes, I needed to add -lrt when linking, 
but not sure if I got this part right.

diff --git a/lib/libopenswan/oswtime.c b/lib/libopenswan/oswtime.c
index 6440258..ec0bde5 100644
--- a/lib/libopenswan/oswtime.c
+++ b/lib/libopenswan/oswtime.c
@@ -34,18 +34,13 @@
  time_t
  now(void)
  {
-    static time_t delta = 0
-       , last_time = 0;
-    time_t n = time(NULL);
-
-    passert(n != (time_t)-1);
-    if (last_time > n)
+    struct timespec tp;
+    if (clock_gettime(CLOCK_MONOTONIC, &tp) != 0)
      {
-       openswan_log("time moved backwards %ld seconds", 
(long)(last_time - n));
-       delta += last_time - n;
+       passert(0);
+       return 0;
      }
-    last_time = n;
-    return n + delta;
+    return (time_t)tp.tv_sec;
  }

  /* Names of the months */
diff --git a/programs/Makefile.program b/programs/Makefile.program
index 800e0db..06fd400 100644
--- a/programs/Makefile.program
+++ b/programs/Makefile.program
@@ -37,6 +37,7 @@ endif
  CFLAGS+= ${NAT_DEFS}

  LIBS?=${PROGLIBS} ${OSWLOGLIB} ${OPENSWANLIB} ${CRYPTOLIBS}
+LIBS+= -lrt

  ifeq ($(USE_NAT_TRAVERSAL),true)
  CFLAGS+= -DNAT_TRAVERSAL
diff --git a/programs/pluto/Makefile.options 
b/programs/pluto/Makefile.options
index 99f0196..7c24f92 100644
--- a/programs/pluto/Makefile.options
+++ b/programs/pluto/Makefile.options
@@ -343,3 +343,5 @@ endif
  ifeq ($(USE_NM),true)
  DEFINES+=-DHAVE_NM
  endif
+
+LIBSPLUTO+= -lrt
diff --git a/programs/pluto/timer.c b/programs/pluto/timer.c
index 516aef3..2fa4ee5 100644
--- a/programs/pluto/timer.c
+++ b/programs/pluto/timer.c
@@ -400,7 +400,7 @@ handle_timer_event(void)
      type = ev->ev_type;
      tm = now();

-    if (tm < ev->ev_time)
+    if (tm - ev->ev_time < 0)
      {
         DBG(DBG_CONTROL, DBG_log("called while no event expired 
(%lu/%lu, %s)"
             , (unsigned long)tm, (unsigned long)ev->ev_time
@@ -418,7 +418,7 @@ handle_timer_event(void)
      /*
       * we can get behind, try and catch up all expired events
       */
-    while (ev && tm >= ev->ev_time) {
+    while (ev && (tm - ev->ev_time >= 0)) {

         handle_next_timer_event();



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.openswan.org/pipermail/dev/attachments/20110805/f68bff0f/attachment.html