Page MenuHomePhabricator

review 30 lines of sclockadj inline C code
Closed, ResolvedPublic

Description

sclockadj has a strange 100% CPU bug after suspend and resume. That is tracked as T50.

Perhaps the issue lies in the 30 lines of inline C Code that sclockadj is being using?

TODO:

Details

Impact
Normal

Event Timeline

Patrick created this task.Mar 16 2017, 3:50 PM

I don't see any loop there and only very simple function calls, so I don't see how that would trigger such bug...

Have you tried attaching gdb that that process when it consume 100% CPU? If you have python-dbg installed, you should see python code in backtrace (bt command there).

Thanks for having a look!

sclockadj is written in ruby. Would python-dbg still work there?

Ah, it's ruby... So, python-dbg is irrelevant, but trying gdb may still be a good idea.

Can't find what debian package ship debug symbols, there is no -dbg package there: https://packages.debian.org/source/stretch/ruby2.3

s.sh added a comment.Mar 16 2017, 6:29 PM

The C code itself doesn't have any CPU intensive instruction, if the problem is really what is written as inline C, I suspect the issue is either with

clock_gettime(0, &tps)

or

clock_settime(0, &tps)

please report to me the output of this:

bash: echo $TZ

TZ is unset since it's started as a systemd unit file.

s.sh added a comment.EditedMar 16 2017, 7:37 PM

the clock_gettime and clock_settime functions are passing zero as their first parameters and that means CLOCK_REALTIME. Firstly the process needs root privilege to touch real time clock, secondly you need to somehow run it by strace like:

#strace <command>

to see when it makes CPU run at 100% what's exactly happening (i.e which system calls are repeatedly called at that time).
And at last check whether you can set TZ=:/etc/localtime or not.

Yes, sclockadj runs as root. Usually sclockadj is functional.

Just after suspend it takes 100% CPU. But not forever. If the suspend was short, the 100% CPU usage will not take forever. That's the only bug.

Perhaps it's more likely some bug in the ruby loop? Looks like it tries to catch up with all loop iterations it missed during the suspend.

Will take me a few days until I get to debug this.

JasonJAyalaP added a comment.EditedJun 3 2017, 12:43 AM

I've rewritten the whole thing in C. It was a simple matter since we no longer need the random interval algorithm. Patrick prefers that it imitates ntpd instead of trying to hide.

https://github.com/JasonJAyalaP/sclockadj/blob/master/sclockadj.c

I compiled it (gcc sclockadj) and replaced the ruby script (/usr/lib/sdwdate/sclockadj) with it. Inside sdwdate, I changed the invocation to:

cmd = [
    "sudo",
    "/usr/lib/sdwdate/sclockadj",
    str(self.newdiff_nanoseconds)]

I tested it (sudo ./sclockadj 5000000000 and sudo ./sclockadj -35253463463464634 a few times, then ./usr/bin/sdwdate) and my desktop clock jumps back and forth as expected. (Comment out the sleep(1); line for faster testing).

I suspect that this will fix the hanging

@marmarek Would you be willing to review the C? It's under 75 total lines.

What about using adjtime() syscall instead of all this? It would avoid trashing logs with Time has been changed every single second, and possibly other side effects.

It would avoid trashing logs with Time has been changed every single second, and possibly other side effects.

This btw will be done in debian version 10 codename buster (stretch + 1). Still a long time and worth fixing before that.

What about using adjtime() syscall instead of all this?

Yes, that would be awesome!

We would like to change the time as similar to an existing popular time synchronization mechanism as possible to avoid being fingerprinted as sdwdate (therefore most likely Whonix).

There is (also)...

Which one of these (or are there even others) is best suited for this task?

adjtimex/ntp_adjtime looks quite complex, but also allow precise control on how time should be adjusted. From those two, according to manual page ntp_adjtime is preferred.

Do we need precise control? The only requirement (other than working) is that it imitate an existing linux tool.

A popular existing linux tool.

JasonJAyalaP added a comment.EditedJun 8 2017, 12:02 AM

Using adjtime would be a simple matter (of programming), but only has microsecond precision.

EDIT: I think it maxes out at 30 minutes too.

Where is adjtime being used in existing time sync applications? NTP?
chrony? systemd-timesyncd?

Looks like at least NTP and chrony use ntp_adjtime/adjtimex

OK. It might strain my limited C knowledge, but I'll give it shot.

adjtimex, as far as I can tell, is for tuning the clock to stay accurate. It's not directly for setting a new time. I assume it's used by ntp to speed up and slow down the clock, with more code that checks on it and stops it when it reaches the right time. Reimplementing this is beyond my skill.

For example, if you know your machine is a little fast or a little slow, you can change the "tick" and "frequency" of the OS.

I'm just not sure how to change time without spamming the logs or reimplementing ntp.

Please create a new ticket for porting to some better C function.

JasonJAyalaP closed this task as Resolved.Jun 15 2017, 10:42 PM