Page MenuHomePhorge

Clock Drift Detection
Open, NormalPublic

Description

A potential solution should be a part of sdwdate (or a separate component if you think it has multiple use cases).

ntpd does clock jump detection:

https://unix.stackexchange.com/a/118636


Problems we need to workaround so it becomes possible:

  • On KVM Whonix at least, the hardware timer information is not updated in WS because kvm-clock and others are disabled.
  • Use of a guest agent to pass that kind of information from the host is not an option because its unsafe.
  • Fetching and comparing remote data with the perceived time in the WS poses scalability, performance and bootstrapping problems if the guest time is way off.

Solution concept:

  • The information about the current time is available to code in the GW where kvm-clock is available (via hwclock).
  • Create a systemd service that runs constantly and queries the hwclock on GW. If the drift between system time and hwclock exceeds a threshold it would trigger syncing locally on the GW and send a simple packet pattern to the Whonix internal network.
  • knockd server [0][1] constantly monitors the internal network would trigger the iptables lockdown if it sees the magic knock sequence. Note that no ports needs to be open on WS.

[0] http://www.zeroflux.org/projects/knock
[1] https://packages.debian.org/jessie/knockd

Details

Impact
Normal

Event Timeline

The idea is complicated and the worrying part is the inter-VM communication of this information. Can this be done securely or will it create a massive attack surface on the GW?

All searches on piping commands to remote hosts involve using SSH in some way which I think we should never do...

If T550 cannot be realized in an easy and secure way then we can simply advise users to freshly reboot their WS machines as part of their workflow.

I think that for the GW at least, triggering post-suspend timesync is easier because it has access to an external clock source and doesn't communicate with other VMs.

HulaHoop updated the task description. (Show Details)

Update:

I thought of a way to get around the need for SSH. Its a concept of a simple signalling protocol that doesn't require access to open ports or a client program on the GW. Its similar to something I suggested when discussing alternatives to a CPF long ago but it makes much more sense in this situation.

Perhaps no gw/ws/host communication required in the following concept.

I would not know why hwclock needs to be involved here. After suspend/resume both, hwclock and system clock should still be moved forward as if sdwdate was not involved? (Just the usual suspend/resume process for any system. They also do not have a messed up clock when using suspend/resume.)

After suspend/resume be a leap, a "clock jump". Some time that could not be witnessed to have passed in a script. Should be able to detect. See the following prototype.

clock_jump_detector_monitor:

#!/bin/bash

set -x
set -e

clock_jump_detected() {
   #if sudo service sdwdate status ; then
      #sudo service sdwdate restart || true
      #sudo service sdwdate status || true
   #fi
   sleep 60
}

while true; do
   old_time="$(date)"
   old_unixtime="$(date +%s)"
   sleep 10
   new_unixtime="$(date +%s)"
   new_time="$(date)"
   result="$(( new_unixtime - old_unixtime ))"
   if [ "$result" -ge "60" ]; then
      true "ge"
      clock_jump_detected
   elif [ "$result" -le "-60" ]; then
      true "le"
      clock_jump_detected
   else
      true "No clock jump detected."
   fi
done

Can you try please if that script is able to detect suspend/resume?

For a production version, another feature would be required. Not detecting clock jumps caused by sdwdate itself. So sdwdate would have to create status files for:

  • /var/run/sdwdate/clock_jump_soon.status
  • /var/run/sdwdate/clock_jump_done.status

Then a systemd daemon clock_jump_detector_supervisorscript could use inotifywait.

  • When /var/run/sdwdate/clock_jump_soon.status is created, stop clock_jump_detector_monitor.
  • When /var/run/sdwdate/clock_jump_done.status is created, restart clock_jump_detector_monitor.

Example of using inotfywait:

Test summary:

  • WS paused for two minutes then resumed: No clock jump detected
  • WS left running then host suspended for two minutes then resumed: No clock jump detected
  • GW left running then host suspended for two minutes then resumed: clock_jump_detected

I think the reason it can't work for the WS is because it has no notion of time outside it. It must reference a current time source to know that time has elapsed.

I would not know why hwclock needs to be involved here.

You are right. This seems to work for the GW.

For generating the knock packets (when clock jump detected) we can use scapy:
https://packages.debian.org/jessie/python-scapy

knockd listens on Layer 2 which means its not dependent on the packets reaching the WS ports. The advantage of this is that multiple WSs sharing the internal net can respond to the same knock signal even if its destined for one WS.

Right, clock_jump_detector_monitor works also not in VirtualBox ws (or gw). Both system time (date) and hardware clock (hwclock) do not notice VirtualBox being paused.

Last hope, pm-utils. Did we discuss that yet? Perhaps /etc/pm/sleep.d/ could do? Does pm-utils function in VMs? I did not have any luck back then getting it to work in VirtualBox or Qubes.

Without (virtualizer specific) hooks to be notified about suspend/resume, implementing this is a total mess. Without those also implementing T551 is impossible.

I wonder if it was better to implement a host package that notifies VMs when they get suspend / resume. Then both this T550 and T551 would be easier to implement. (Such a package would likely and unfortunately require virtualizer specific parts.)

Right, clock_jump_detector_monitor works also not in VirtualBox ws (or gw). Both system time (date) and hardware clock (hwclock) do not notice VirtualBox being paused.

Is that true on Linux too? I thought I saw a support thread about VBox 5+ using kvmclock device too: https://www.whonix.org/blog/virtualbox-acceleration-mode

Last hope, pm-utils. Did we discuss that yet? Perhaps /etc/pm/sleep.d/ could do? Does pm-utils function in VMs? I did not have any luck back then getting it to work in VirtualBox or Qubes.

I don't think we talked about this. How would this work?

Without (virtualizer specific) hooks to be notified about suspend/resume, implementing this is a total mess. Without those also implementing T551 is impossible.

I know that KVM guest agent can notify the gateway about a suspend resume. For other hypervisors where this isn't possible you might be able to pause the Gateway upon resume with a script on the host.

Concept:

  • host script pauses GW upon a suspend event.
  • A daemon in WS constantly check to see if Tor is connected. If not it immediately set iptables to fail closed and initiates a sync event. The daemon check intervals are much shorter to prevent a race condition when host resumed.
  • host script unpauses GW after a longer interval to ensure fail closed is enforced.

*The clock_jump_detector_monitor on GW initiates sync when jump detected. (I'll think about workarounds if it still doesn't work with VBox in paravirtual mode)

NB This replaces the earler knockd/scapy proposal.

I wonder if it was better to implement a host package that notifies VMs when they get suspend / resume. Then both this T550 and T551 would be easier to implement. (Such a package would likely and unfortunately require virtualizer specific parts.)

If you are open to implementing host side packages for timesync then I think one which sets a random offset for VMs would be another great thing. Out of scope of this ticket but I'll open a new one if you agree.

All in all the stronger protections will be available for a subset of whonix host configurations anyway - VBox and KVM on Debian and Qubes. That is an acceptable compromise.

Right, clock_jump_detector_monitor works also not in VirtualBox ws (or gw). Both system time (date) and hardware clock (hwclock) do not notice VirtualBox being paused.

Is that true on Linux too? I thought I saw a support thread about VBox 5+ using kvmclock device too: https://www.whonix.org/blog/virtualbox-acceleration-mode

Only tested and noticed to fail on Linux. Using VBox legacy as parallelization method.

Last hope, pm-utils. Did we discuss that yet? Perhaps /etc/pm/sleep.d/ could do? Does pm-utils function in VMs? I did not have any luck back then getting it to work in VirtualBox or Qubes.

I don't think we talked about this. How would this work?

pm-utils notices suspend/resume and can dispatch scripts that are dropped in .d folders. Perfect in theory. However as said above, didn't get it to work in VMs.

Without (virtualizer specific) hooks to be notified about suspend/resume, implementing this is a total mess. Without those also implementing T551 is impossible.

I know that KVM guest agent can notify the gateway about a suspend resume. For other hypervisors where this isn't possible you might be able to pause the Gateway upon resume with a script on the host.

Concept:

  • host script pauses GW upon a suspend event.

That might be implemented with pm-utils that apparently work a lot better on hosts than in VMs.

  • A daemon in WS constantly check to see if Tor is connected. If not it immediately set iptables to fail closed and initiates a sync event. The daemon check intervals are much shorter to prevent a race condition when host resumed.

Constantly check is not going to fly. Also wastes cpu. It needs to be event based. No rushing. No fear of suspend actually happening before the scripts finished. It needs to be event based.

For VirtualBox this could be the way to go:

pm-utils > dispatches scripts on suspend -> blocks suspend until all scripts are done [this probably is an already existing core feature of pm-utils] -> run vboxmanage guestcontrol /path/to/suspend-pre-script/witin/vm -> stop Tor and sdwdate -> wait for that to finish -> actually suspend host

Similar for the suspend-pre-post script which would receive a randomized time from the host, restart Tor and sdwdate.

This is already implemented for Qubes, where this was simple due to their exemplary qrexec.

(That way TCP connections be be avoided. Involving TCP would make all of this a lot more difficult. Then we would have to add a host-only network interface in the VMs, figure out the VM internal IPs from the host, and somewhat reinvent what vboxmanage guestcontrol is doing.)

*The clock_jump_detector_monitor on GW initiates sync when jump detected. (I'll think about workarounds if it still doesn't work with VBox in paravirtual mode)

Once we can dispatch hooks, a script before suspend / after resume, no clock jump detector will be required. Clock jump detector is just a weird workaround for that missing feature in virtualizers. Without such hooks, T551 cannot be effectively implemented.

If you are open to implementing host side packages for timesync then I think one which sets a random offset for VMs would be another great thing. Out of scope of this ticket but I'll open a new one if you agree.

It's a good idea, but I wouldn't know when I find time to implement it. The best we can do is have clear tickets that are actionable, that just need code, by prospective volunteers.


The more I think about it, host packages are really busting it.

  • virtualizer specific
  • host operating specific (Windows, Linux, MAC)
    • linux distribution specific (Debian vs Fedora vs etc.)
  • increasing difficulty to set up Whonix, more difficult to document "you don't just need to install our VMs but also need to install these operating system specific host packages installed"

You're right. My idea is needlessly complicated and I admit I learned a lot from your plan.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-Managing_guest_virtual_machines_with_virsh-Starting_suspending_resuming_saving_and_restoring_a_guest_virtual_machine.html

IIRC VBox disables power states for guests - maybe they can be enabled. KVM supports receiving pm events from the host.

The best we can do is have clear tickets that are actionable, that just need code, by prospective volunteers.

Yes.

The more I think about it, host packages are really busting it.

I'd say there's no rush here. These are "nice to have" features but make no serious differences to Whonix's basic security guarantees. If some of the Whonix features can be usable in Debian then why not?

EDIT:

One thing I forgot is I removed the RTC timer so offsets are no longer possible. I consider the benfits of low resolution timers to outweigh the unavailability of the offset features.

Tested enabling pm settings in KVM and I don't see suspend/hibernate in the VM power options in the menu. VBox threads on SE agree that guest suspend isn't available.

At this point I think this idea is impossible for most virtualizers anyway.

HulaHoop (HulaHoop):

Tested enabling pm settings in KVM and I don't see suspend/hibernate in the VM power options in the menu.

These KDE menus are disabled by Whonix. In plain Debian VMs these should
be visible.

These KDE menus are disabled by Whonix. In plain Debian VMs these should

be visible.

OK. I used systemctl suspend in the guest and it worked. However there is a blocker.

Sending a suspend event from the host via virsh gives:

sudo virsh dompmsuspend Whonix-Workstation --target mem
error: Domain Whonix-Workstation could not be suspended
error: argument unsupported: QEMU guest agent is not configured

Since there are security warnings about installing the guest agent in untrusted VMs - its not possible.

Didn't rehash. What's next here? Looks like we learned a lot, but then things stalled. Could you please rehash, and then create a follow-up ticket with the way forward? @HulaHoop

@Patrick I wrote a rehash. If you think is too complicated, let me know. It was the simplest and most reliable way I could think of:

In absence of a safe way to directly tell the WS about clock drift from the host's hardware clock, I propose a signalling mechanism from the GW to WS to trigger an iptables lockdown combined with activating timesync before re-allowing connections.

Outline:

  • GW has access to host's hwclock since its part of the TCB.
  • Once a clock drift is detected after a suspend/hibernate, a GW service would send a simple packet pattern to a knockd instance on the WS.
  • knockd has the ability to trigger arbitrary commands based on the signal. This makes it flexible enough to use for iptables lockdown, timesync activation followed by iptables release.

EDIT:

How will packets be generated? There are a couple of good ways:

pktgen

is an in-kernel module that does just that. Its part of the linux networking subsystem and has been around a long time. I believe that the Linux net stack has really been well examined over time.

https://wiki.linuxfoundation.org/networking/pktgen


trafgen

If you still don't like the idea of using something in-kernel then there is this tool that comes as part of the netsniff-ng fuzzing suite by Daniel Borkmann. Borkmann is a really talented sec professional who has made some impressive contributions to Linux over time. N.B. that it needs to run as root too. The netsniff-ng toolkit does not depend on the libpcap library - which is important since pcap has had a very bad history of sec vulns.. Moreover, no special operating system patches are needed to run the toolkit.

https://en.wikipedia.org/wiki/Netsniff-ng
https://www.mankier.com/8/trafgen

Its included in Debian.


How can clock drift be measured?

https://unix.stackexchange.com/a/118636

Using ntpdate -d in the example script.

ntpdate in this case is running dormantly with no connections to the outside world configured.

It's a very good rehash!

ntpdate -d, I am afraid, might not work for us. Probably does not work over Tor (if you try that inside Whonix-Workstation)?

iptables lockdown, timesync activation followed by iptables release.

firewall lockdown after boot mostly done:
https://forums.whonix.org/t/firewall-lockdown-until-timesync-is-done/4820

firewall lockdown after clock drift detection: could be regarded a separate feature. Clock fix (sdwdate restart) after clock drift detection is the main feature here. Mostly a technicality.

firewall lockdown after clock drift detection:

  • It would be too slowly activate for already open connections. By the time the gateway detected the clock drift and notified the workstation, reload of the firewall, almost certainly already running applications have emitted traffic.
    • A working solution needs some kind of hook that locks the network before suspend, i.e. what suspend-pre scripts are for, which are not available to us in most virtualizers.
  • May be still worthwhile to prevent new connections, but then we may muddle the water. Implying to provide security that we do not?

Oops didn't realize ntpdate requires query of remote servers. ntpdate is obsolete anyhow but the newer clockdiff still talks to online servers instead of comparing local values. hwclock can give us that:

https://stackoverflow.com/a/40656375

hwclock --compare

will do what you want. This compares the system time against the HW clock time. It does this continuously at 5 second intervals allowing it to compute the PPM drift between the two clocks.


Oops didn't realize ntpdate requires query of remote servers. ntpdate is obsolete anyhow but the newer clockdiff still talks to online servers instead of comapring local values. hwclock can give us that:

https://stackoverflow.com/a/40656375

hwclock --compare

will do what you want. This compare the system time against the HW clock time. It does this continuously at 5 second intervals allowing it to compute the PPM drift between the two clocks.


Without a host side package to run qemu-guest agent commands upon changing power events, this becomes an intractable problem. Without immediately shutting off the network such an implementation doesn't deliver the on guarantees claimed and so I think this is dead in the water.

If there is a theoretical host package it would pass suspend/resume commands to guests in sync with host events. The guest's suspend-pre scripts would take appropriate action of disabling the network.


Available qemu-ga host commands

http://wiki.stoney-cloud.org/wiki/Qemu_Guest_Agent_Integration#Getting_list_of_available_qemu-ga_commands

With qemu-ga code the hwclock drift detection code becomes redundant. If a suspend event is triggered the GW should assume clocks are out of sync and trigger lockdown.

HulaHoop (HulaHoop):

HulaHoop added a comment.

With qemu-ga code the whole clock drift detection code becomes redundant. If a
suspend event is triggered the GW should assume clocks are out of sync and
trigger lockdown.

That would work. I was on wrong path: of course, if the gateway does the
lockdown the workstation will loose connectivity. No reason for the
workstation to enter lockdown mode for the reason of leak prevention.

The workstation entering lockdown mode is still useful, but more for
usability reasons as well as more code reuse.

Yes there are less moving parts especially when multiple WSs share a GW. Some way to exempt timesync traffic from the WS would be needed though.

https://www.redhat.com/archives/libvirt-users/2018-February/msg00083.html
[libvirt-users] QEMU guest-agent safety in hostile VM?

I forgot to use replay all and so most of the correspondence was not public. Forwarded the responses to whonix mail list:
https://www.whonix.org/pipermail/whonix-devel/2018-March/001125.html


Now that I discovered qemy-ga is safe to use in untrusted guests I have asked about the supported commands libvirt can pass thru to the guest.

https://www.redhat.com/archives/libvirt-users/2018-February/msg00085.html
[libvirt-users] Libvirt supported qemu-ga commands

It turns out the QEMU guest agent warning was not relevant to those who use libvirt. With libvirt a safe parser is used. Breakouts can only happen if a process on the host is designed to parse guest input because there is no way to control that otherwise it should be safe for our uses. This potentially simplifies the design in many respects but a host package will still be needed. I will update the task list.

The YAJL parser used in libvirt is tiny, modern (written in2007) and has no CVEs. It is an SAX type event-driven parser unlike the vulnerable, top-down recursive descent type that was used in QEMU.

https://github.com/lloyd/yajl
http://www.craftinginterpreters.com/parsing-expressions.html

The proper and direct way to use virsh to communicate with guest agent:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-using_the_qemu_guest_virtual_machine_agent_protocol_cli-libvirt_commands

To suspend/resume one can use: (NB. Will confirm it first then give feedback)

virsh suspend --mode=agent
virsh resume --mode=agent

Some guest relevant docs:
https://access.redhat.com/solutions/732773

Service needs to be enabled in guest after install:

systemctl enable qemu-guest-agent


The following command can be used to execute any og the API's supported commands:

virsh qemu-agent-command ${VMNAME} '{"execute":"guest-info"}'

http://wiki.stoney-cloud.org/wiki/Qemu_Guest_Agent_Integration#Getting_list_of_available_qemu-ga_commands


In case any Apparmor modification needed. This was a long time ago and was prbably fixed.
https://serverfault.com/questions/672253/how-to-configure-and-use-qemu-guest-agent-in-ubuntu-12-04-my-main-aim-is-to-get


@Patrick

PS. qemu-ga can also query/set the clock from the host but this is not a good idea since we want timesync to do its thing?

Actually we don't have to suspend the guest. Execution of any command on the host after resume is enough to create a uniqu event in the qemu-ga's log file.

Th next step is to use a program like monit to trigger custom commands in response to a specified keyword in the log. The second link documents how. The custom commands are obviosuly iptables lockdown and timesync init.

https://stackoverflow.com/a/17639196
https://mmonit.com/monit/documentation/monit.html#FILE-CONTENT-TEST

monit available in Debian:

https://packages.debian.org/stretch/monit


I will have to summarize all this some place else because the thread has become unmanageable.

NB for the record: with qemu-ga a guest can still shut itself off via crafted input to the agent. So besides removing timer access to the guest, there was no other advantage to removing ACPI.