Page MenuHomePhabricator

speed up libvirt tarball creation time
Open, WishlistPublic

Description

The current tar.xz compression code is a burden since it literally takes hours. Currently building VBox and KVM images until upload finished therefore takes more than a day.



Using tar --xz and --mtime="2014-05-06 00:00:00" so the archives are deterministic.

Using --sparse...

-S, --sparse
    handle sparse files efficiently

The replacement requirements:

  • faster than current one
  • deterministic
  • handle sparse files efficiently
    • currently the result of the compression is reducing the a sparse file with a real size of ~ 4.5 GB (and apparent size 100 GB) workstation qcow2 file to ~ 1.5 GB tar.xz.
    • the new file size should be similarly small
    • (not 100 GB reduced to ~ 30 GB)

The priority is high, since this reduces my motivation to create Non-Qubes-Whonix images.

Details

Impact
Low

Event Timeline

Patrick created this task.Jan 14 2017, 11:57 PM

quote @anonymous1

I'm not experienced as to how to improve xz compression, assuming you don't want to experience with less known compressors.

What I do know is that freearc (with its many unique compression filters and technologies such as srep) and nanozip are the best compressors around in terms of both speed and compression ratio. I don't know the details of the ticket you mentioned but I think srep might be the best tool to speed it up again. But then it is not deterministic, I'm not sure if there is a way to make it so


quote @anonymous1

Could it help trying tar implementation of other programs like 7zip?


quote @anonymous1

bsdtar or star didn't help?

http://unix.stackexchange.com/questions/120091/how-can-i-speed-up-operations-on-sparse-files-with-tar-gzip-rsync

Patrick updated the task description. (Show Details)Jan 15 2017, 6:28 AM

OK so its reproducibility > speed


Most compression algorithms are deterministic. Being "adaptive" in no way contradicts being "deterministic": it only means varying behavior based on input, so if the input is the same, so will be the output.
You can easily verify this by compressing the same file several times using an algorithm of your choice (zip, gzip, bzip2, 7z, etc.) and comparing the outputs. For example on linux, you can run this command several times to compress the file /etc/fstab and compare if its checksum is the same each time: gzip < /etc/fstab | md5sum -

though the algorithm itself is indeed deterministic, the implementation will sometimes store additionnal information (file permissions, timestamps etc) which can make it look like the output is not deterministic. adding a touch on the file between the compress and decompress can generate a different zip even though the file's content did not change. That being said, it's still deterministic once all parameters are factored in.

https://softwareengineering.stackexchange.com/questions/293941/any-deterministic-compression-algorithms-out-there


The next best algo in speed is gzip and it explicitly supports disabling timestamps with

gzip:!timestamp

https://www.freebsd.org/cgi/man.cgi?tar%281%29


With lz4 and tar this may be possible as it is with gzip:

Preserve timestamp when compressing files with lz4 on linux

https://stackoverflow.com/questions/33775634/preserve-timestamp-when-compressing-files-with-lz4-on-linux

@Patrick try asking this on encode.ru. You may get the best answers there

anonymous1 added a comment.EditedJan 19 2017, 9:33 AM

http://www.systutorials.com/qa/4/how-to-efficiently-archive-a-very-large-sparse-file
https://stackoverflow.com/questions/13252682/copying-a-1tb-sparse-file

According to these, there is a minimum kernel version and bsdtar version required for the efficient use of the sparse file. And if there is still a problem, it may be related to the filesystem. One person said it didn't work on reiserfs filesystem

anonymous1 added a comment.EditedJan 19 2017, 9:36 AM

Perhaps Patrick haven't tried bsdtar yet or perhaps it didn't work at the time he asked this question because the required version of bsdtar or other requirements was not in stable repositories

http://unix.stackexchange.com/questions/120091/how-can-i-speed-up-operations-on-sparse-files-with-tar-gzip-rsync

@Patrick good news

tar has finally added support for SEEK_DATA/SEEK_HOLE for sparse file detection in latest version 1.29

https://savannah.gnu.org/forum/forum.php?forum_id=8545

Upgrading to this version should speed up your compression without changing any command. Please let me know how it goes

I see. That's great news indeed. New Whonix builds for Whonix 14 must be
created on Debian stretch anyhow, so we have tar version 1.29 there.
Will try soonish.

Still takes 70 minutes to compress both images (inside a VM, having an SSD already).

anonymous1 added a comment.EditedJan 21 2017, 1:35 AM

is there any improvement? how long did it take before?

have you tried any other tool to compare and find out if tar code is the culprit? could you give srep a try just for comparison?

have you tried lowering the xz compression level? even the lowest level is said to compress better and faster than highest gzip compression

https://stackoverflow.com/questions/34464534/use-xz-instead-of-gz-very-slow

levels 0 1 2 should be much faster than the default 6, starting from 3 it seems to slow down considerably. I think 2 would be optimal but try 0 too

https://stephane.lesimple.fr/blog/2010-07-20/lzop-vs-compress-vs-gzip-vs-bzip2-vs-lzma-vs-lzma2xz-benchmark-reloaded.html

Maybe because the xz version (5.1.1) in Jessie is single threaded? xz 5.2 gains this feature.

any improvement?

The xz command just uses about 10% of CPU and just about 90 MB RAM. iotop -a is below 1%.

Building on Debian stretch with xz-utils 5.2.2-1.2 / tar 1.29b-1.1. Using ext4 as file system.

Any idea why system load is so low? I'd like a much higher load so it goes faster.

libvirt_compress function at time of writing:
https://github.com/Whonix/whonix-developer-meta-files/blob/bb1907e319acda314a1c57df200ff1696f979971/release/prepare_release#L35-L98

The compression command from bash xtrace.

tar --create --verbose --owner=0 --group=0 --numeric-owner --mode=go=rX,u+rw,a-s --sort=name --sparse '--mtime=2015-10-21 00:00Z' --xz --directory=/home/user/whonix_binary --file Whonix-Gateway-14.0.0.4.0.libvirt.xz Whonix-Gateway-14.0.0.4.0.qcow2 Whonix-Gateway-14.0.0.4.0.xml Whonix_external_network-14.0.0.4.0.xml Whonix_internal_network-14.0.0.4.0.xml

The whole prepare_release script now took 21 minutes for Whonix-Gateway. libvirt archive creation is still that part that takes the longest time. Archive size: 1.2 GB.

By adding environment variable XZ_OPT="-0", time is down to 4:45 min, size is up to 1.4 GB.

By adding environment variable XZ_OPT="-2", time is down to 7:45 min, size is up to 1.3 GB.

(XZ_OPT="-0 --fail" makes it fail as expected. Did that as a test to see if environment variable XZ_OPT is honored.)


@Patrick good news

tar has finally added support for SEEK_DATA/SEEK_HOLE for sparse file detection in latest version 1.29

https://savannah.gnu.org/forum/forum.php?forum_id=8545

Upgrading to this version should speed up your compression without changing any command. Please let me know how it goes

As per https://savannah.gnu.org/forum/forum.php?forum_id=8545 it should be automatically using seek hole detection on systems that support it. How do I find out if my system supports it or how to enable it?


If you have other ideas to speed it up / shrink size while keeping it reproducible, could you suggest changes to the prepare_release script please? Perhaps by making a github pull request?

Other than lowering the compression level, maybe tar doesn't support multi threaded compression whereas on 7zip with xz or 7z compression I can utilize all of my cpu cores.

please try p7zip:
https://packages.debian.org/stretch/p7zip-full
https://sourceforge.net/projects/p7zip/files/p7zip/16.02/

anonymous1 added a comment.EditedMar 10 2017, 6:03 AM

xz doesn't seem to store file names or timestamps so it should be reproducible, you could still use tar with reproducible options to tar the files and maybe combine with p7zip's xz compression

anonymous1 added a comment.EditedMar 10 2017, 6:20 AM

there is some related information here:
http://stackoverflow.com/questions/12313242/utilizing-multi-core-for-targzip-bzip-compression-decompression

you could also try XZ Utils, it has multi-threaded compression support for some time, perhaps you could do this

tar --use-compress-program=xz

but it may not be multi-threaded then. the documentation for xz states that:

  • Multi-threaded compression can be enabled with the --threads (-T) option.

I'm not sure if you can use this option from inside tar though

you could also try lowering or increasing the compression dictionary size to see how it affects the size and speed, however I don't know the commands

anonymous1 added a comment.EditedMar 10 2017, 12:41 PM

There is another tool called pxz (parallel xz): https://packages.debian.org/stretch/main/pxz

Check this page:

https://www.peterdavehello.org/2015/02/use-multi-threads-to-compress-files-when-taring-something/

Once you get multi-threading, you could try highest compression levels to see how it goes

And make sure that multi-threading doesn't somehow break reproducibility

There is a good list here: http://askubuntu.com/questions/258202/multi-core-compression-tools

anonymous1 added a comment.EditedMar 10 2017, 12:57 PM

It seems you could use something like this with xz utils:

export XZ_OPT="--threads=0"

 -T threads, --threads=threads
Specify the number of worker threads to use. Setting threads to a special value 0 makes xz use as many threads as there are CPU cores on the system.
The actual number of threads can be less than threads if the input file is not big enough for threading with the given settings or if using more
threads would exceed the memory usage limit.

Currently the only threading method is to split the input into blocks and compress them independently from each other. The default block size
depends on the compression level and can be overriden with the --block-size=size option.

https://manpages.debian.org/testing/xz-utils/xz.1.en.html


"--threads=0" results in 100% CPU usage, yay!


XZ_OPT="-0 --threads=0"

  • Time down to 1:35.
  • Size up to 1.4 GB.

XZ_OPT="-9 --extreme --threads=0"

  • 6:56
  • 1.2 GB

XZ_OPT="-6 --threads=0"

  • 4:28
  • 1.2 GB
  • reproducible: yes

XZ_OPT="-6 --threads=0"
installed pxz
replaced --xz with --use-compress-program=pxz
(really uses pxz and not xz as per ps aux)

  • 4:54
  • 1.2 GB

replaced --xz with --use-compress-program=pxz

  • 4:54
  • 1.2 GB

Looks like using tar with --use-compress-program=pxz really is not worth it.

Perhaps worth trying pxz directly without tar? But then we might be back to non-reproducibility. Needs testing. Does pxz support to auto detect how much threads can be used maximum?

set and export XZ_OPT="--threads=0" makes sense either way. Therefore added.

set and export XZ_OPT="--threads=0" to speed up libvirt archive creation

This might also speed up other operations where xz is used internally by
other packages.

Thanks to @anonmos1 for the suggestion!

https://phabricator.whonix.org/T605

https://github.com/Whonix/Whonix/commit/5d125180051fa55b5ec1ce50e16cdc8db6d6906d

anonymous1 added a comment.EditedMar 10 2017, 6:52 PM

But I have a feeling it would produce different archives with different number of threads, single core vs dual core vs quad core vs custom vm cores

you could test this by setting the threads to 1, 2, 3, 8 and so on

anonymous1 (anonymous1):

anonymous1 added a comment.

But I have a feeling it would produce different archives with different number of threads, single core vs dual core vs quad core vs custom vm cores

Good point.

Just now tested --threads=1 vs --threads=8. Different checksum.

--threads=8 vs --threads=8 however results in the same checksum.

Perhaps I should change --threads=0 to --threads=8? (Assuming quad core
with 2 threads per core?)

May not be a problem on slower machines. --threads=30 (for testing
purposes, exceeding available threads) also worked for me.

anonymous1 (anonymous1):

anonymous1 added a comment.

you could also try lowering or increasing the compression dictionary size to see how it affects the size and speed, however I don't know the commands

Is this different from -0 to -9 (--extreme) compression settings?

In other words... Do you think it is worth to play with various
"--dict=" settings independent from the compression level setting for
better speeds or smaller file sizes?

I think the default settings are optimal

--threads=8 should still work on slower machines, however it would work like --threads=4 or --threads=2 I guess. In that case choosing the default threads is up to you, could you try with 4? it may not be too different from 8

anonymous1 added a comment.EditedMar 10 2017, 8:20 PM

If you have 8 threads and if using more than 8 produces same checksum as 8, then what I said would be true

I would recommend 4 max, but it's your choice

It's also a good idea to test with same threads on different machines whether there is any variation or not

anonymous1 added a comment.EditedMar 10 2017, 8:38 PM

I think in the worst case you could care less about a perfectly reproducible end archive (tar.xz) and instead focus on the extracted (tar) file being reproducible, for example linux kernel files are compressed either xz or gz but only the tar itself is signed

I think the default settings are optimal

Okay.

--threads=8 should still work on slower machines, however it would work like --threads=4 or --threads=2 I guess. In that case choosing the default threads is up to you, could you try with 4? it may not be too different from 8

4 uses only 50% of CPU.

Done, made that 8:

https://github.com/Whonix/Whonix/commit/17581ebbd05cc04f5ed52637e675481ddecc0845

If you have 8 threads and if using more than 8 produces same checksum as 8, then what I said would be true

I would recommend 4 max, but it's your choice

It's also a good idea to test with same threads on different machines whether there is any variation or not

Theoretically lets say a single core machine might produce a different checksum than a quad core due to threads. But I doubt that. It's probably not using physical cpu threads but virtual cpu threads. top -H shows easily more than 500 virtual threads on a usual linux system.

A few more threads than physical threads will probably only have a negligible performance penalty. 8 vs 4 should not matter on slow system. (However I speculate 10000 threads would cause significant overhead.)

I think in the worst case you could care less about a perfectly reproducible end archive (tar.xz) and instead focus on the extracted (tar) file being reproducible

Having the final file reproducible makes verification instructions and automation a lot easier.

  • Then it's just "rebuild the libvirt.xz, and compare the hashes".
  • Otherwise it's "rebuild libvirt.qcow2, download libvirt.xz, extract the qcow2, and compare the hashes, of the qcow2 files not libvirt.xz files.

Hypothetical the compressed libvirt.xz could contain an exploit against xz that compromises the system during decompression. By having reproducible libvirt.xz we can exclude that.

For now it does not really matter if libvirt.xz is reproducible. It's very far forward thinking, since Whonix reproducible images are for now far away unfortunately, see:

https://forums.whonix.org/t/is-whonix-reproducible-yet-backdoor-protection

anonymous1 added a comment.EditedMar 10 2017, 9:26 PM

Could you please check how long it takes with 4 threads, using %50 of cpu is expected, it does not necessarily mean it will take twice as long

What I expect is that a PC with 4 threads may not reproduce the same archive even if --threads=8 doesn't give any error, it may produce an archive like if you used --threads=4

If 4 threads doesn't change the build time much, that would be a safer default

This quoted part indicates physical cpu threads:

Setting threads to a special value 0 makes xz use as many threads as there are CPU cores on the system

anonymous1 (anonymous1):

anonymous1 added a comment.

Could you please check how long it takes with 4 threads, using %50 of
cpu is expected, it does mean it will take twice as long

Takes 1 minute longer.

What I expect is that a PC with 4 threads may not reproduce the same
archive even if --threads=8 doesn't give any error, it may produce an
archive like if you used --threads=4

It doesn't. I already tried some number that exceeds my physical cores
more than twice (30) and had the same checksum each time I used the same
number of threads (30). I am pretty sure it's virtual, not physical
threads. At that level at abstraction, it would make little sense to see
"wanted 30 threads, but just got 4 physical cores, will reduce threads
silently to 8".

Did you compare your --threads=30 archive with --threads=8 archive?

They may turn out to be the same, or you can try any number bigger than 8

If that's the case, 4 is a safer default for reproducibility, with little impact on speed

I mean reproducibility "between" computers, not on the same one

I may be wrong, the best way to test this is to maybe create the same archive with half of the available cores in a VM however I can't do this, I don't have debian stretch

But at least one thing is clear, memory requirement is directly proportional to number of threads and if the machine at hand does not meet those requirements it will lower the number of these threads

I will try this with xz utils binaries first in Windows host and then with half of the available cores in Windows VM

anonymous1 added a comment.EditedMar 10 2017, 11:51 PM

Sorry for all this confusion, I think it is only a difference between whether the program "tries" to operate in a single-threaded mode or multi-threaded mode, when we use --threads=1 or don't specify it (default is 1) it compresses the whole file in a single block, however setting --threads 0 or bigger than 1 triggers the multi-threaded mode and the file is split into blocks depending on the compression level and then compressed resulting in a difference in the archive file. how many threads actually used is irrelevant. changing compression level or manually specifying the block sizes will change the outcome.

with setting --threads 8 instead of 0 we actually enforce the multi-threaded mode (splitting the file into blocks) and prevent at least one cause of non-determinism: when --threads is set to 0 inside a single core machine, xz operates in "single-threaded mode" and compresses the file in a single block whereas setting this to anything higher than 1 enforces "multi-threaded mode" without being multi-threaded at all but still splits and compresses the file in blocks. You can see this with a VM

so it should be safe to set --threads to 8 or 16 or higher, this option means: "use at most NUM threads"

If you have time could you check how long it takes with 5 or 6 threads? I think it will be near equal to 8, not for reproducibility reasons just for efficient use of system resources. There is probably no reason to use 16 cores on a machine that supports it which would be overkill

Did you compare your --threads=30 archive with --threads=8 archive?

Doesn't make a difference in speed.

If you have time could you check how long it takes with 5 or 6 threads? I think it will be near equal to 8, not for reproducibility reasons just for efficient use of system resources. There is probably no reason to use 16 cores on a machine that supports it which would be overkill

6 causes 75% CPU, takes 4:59 minutes.

Actually I am for maximum system resource usage as default value. Capping stuff should be user opt-in, custom settings though the operating system by the user. (Run in a VM or use other tools to add caps to the build script.)

Patrick lowered the priority of this task from High to Wishlist.Sun, Dec 9, 6:52 AM