Improving bulk performance in debhelper

Since debhelper/10.3, there has been a number of performance related changes.  The vast majority primarily improves bulk performance or only have visible effects at larger “input” sizes.

Most visible cases are:

  • dh + dh_* now scales a lot better for large number of binary packages.  Even more so with parallel builds.
  • Most dh_* tools are now a lot faster when creating many directories or installing files.
  • dh_prep and dh_clean now bulk removals.
  • dh_install can now bulk some installations.  For a concrete corner-case, libssl-doc went from approximately 11 seconds to less than a second.  This optimization is implicitly disabled with –exclude (among other).
  • dh_installman now scales a lot better with many manpages.  Even more so with parallel builds.
  • dh_installman has restored its performance under fakeroot (regression since 10.2.2)

 

For debhelper, this mostly involved:

  • avoiding fork+exec of commands for things doable natively in perl.  Especially, when each fork+exec only process one file or dir.
  • bulking as many files/dirs into the call as possible, where fork+exec is still used.
  • caching / memorizing slow calls (e.g. in parts of pkgfile inside Dh_Lib)
  • adding an internal API for dh to do bulk check for pkgfiles. This is useful for dh when checking if it should optimize out a helper.
  • and, of course, doing things in parallel where trivially possible.

 

How to take advantage of these improvements in tools that use Dh_Lib:

  • If you use install_{file,prog,lib,dir}, then it will come out of the box.  These functions are available in Debian/stable.  On a related note, if you use “doit” to call “install” (or “mkdir”), then please consider migrating to these functions instead.
  • If you need to reset owner+mode (chown 0:0 FILE + chmod MODE FILE), consider using reset_perm_and_owner.  This is also available in Debian/stable.
    • CAVEAT: It is not recursive and YMMV if you do not need the chown call (due to fakeroot).
  • If you have a lot of items to be processed by a external tool, consider using xargs().  Since 10.5.1, it is now possible to insert the items anywhere in the command rather than just in the end.
  • If you need to remove files, consider using the new rm_files function.  It removes files and silently ignores if a file does not exist. It is also available since 10.5.1.
  • If you need to create symlinks, please consider using make_symlink (available in Debian/stable) or make_symlink_raw_target (since 10.5.1).  The former creates policy compliant symlinks (e.g. fixup absolute symlinks that should have been relative).  The latter is closer to a “ln -s” call.
  • If you need to rename a file, please consider using rename_path (since 10.5).  It behaves mostly like “mv -f” but requires dest to be a (non-existing) file.
  • Have a look at whether on_pkgs_in_parallel() / on_items_in_parallel() would be suitable for enabling parallelization in your tool.
    • The emphasis for these functions is on making parallelization easy to add with minimal code changes.  It pre-distributes the items which can lead to unbalanced workloads, where some processes are idle while a few keeps working.

Credits:

I would like to thank the following for reporting performance issues, regressions or/and providing patches.  The list is in no particular order:

  • Helmut Grohne
  • Kurt Roeckx
  • Gianfranco Costamagna
  • Iain Lane
  • Sven Joachim
  • Adrian Bunk
  • Michael Stapelberg

Should I have missed your contribution, please do not hesitate to let me know.

 

This entry was posted in Debhelper, Debian. Bookmark the permalink.

1 Response to Improving bulk performance in debhelper

  1. Pingback: “debhelper-compat (= 12)” is now released | nthykier

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.