Numbers and lintian

Lintian uses some small scripts to “collect” data from packages. In daily talk, they are usually referred to as “collection” scripts. Lintian uses files to track the status of “collection” scripts between runs. Consider the following directory listing on lintian.d.o:

$ ls -al laboratory/binary/eclipse/.*-*
-rw-r--r-- 1 user group  45 Jul 18 01:21 laboratory/binary/eclipse/.ar-info-1
-rw-r--r-- 1 user group  45 Jul 18 01:21 laboratory/binary/eclipse/.bin-pkg-control-1
-rw-r--r-- 1 user group  45 Jul 18 01:21 laboratory/binary/eclipse/.changelog-file-1
-rw-r--r-- 1 user group  45 Jul 18 01:21 laboratory/binary/eclipse/.copyright-file-1
-rw-r--r-- 1 user group  45 Jul 18 01:21 laboratory/binary/eclipse/.debian-readme-1
-rw-r--r-- 1 user group  45 Jul 18 01:21 laboratory/binary/eclipse/.doc-base-files-1
-rw-r--r-- 1 user group  45 Jul 18 01:21 laboratory/binary/eclipse/.fields-1

This example shows that a bunch of “collections” scripts have been run for the eclipse binary package. Each of these files contain the version of Lintian that created them (or last wrote them) and a timestamp.

Why the interest in these files? Let us go a bit back in time to the 15th of June 2011. That was the day where Lintian 2.5.1 was uploaded to unstable. That version of Lintian had 17 collection scripts[1] for binary packages. So every binary package would have 17 files and there are… over 35 000 binary packages in the Debian archive.

Ouch, so that makes about 595 000 files if we use this on all binary packages on the archive. The size of each of those files are about 45 bytes, so that is a total of 25.5 MB for all of these files[2]. So other than the “inode abuse”, this is not too bad. A little du -h should confirm this…

$ du -h laboratory/binary/eclipse/.*-*
4.0K    laboratory/binary/eclipse/.ar-info-1
4.0K    laboratory/binary/eclipse/.bin-pkg-control-1
4.0K    laboratory/binary/eclipse/.changelog-file-1
4.0K    laboratory/binary/eclipse/.copyright-file-1
4.0K    laboratory/binary/eclipse/.debian-readme-1
4.0K    laboratory/binary/eclipse/.doc-base-files-1
4.0K    laboratory/binary/eclipse/.fields-1

Whaaa… oh – the file system uses a block-size of 4K bytes, so I guess we have to pay a full block for these files. Let’s see what that gives, 595 000 times 4 kB is … 2.27 GB…


That was in the (not too distant) past. About a month later (12th of July), the code creating these files are refactored into:

sub _mark_coll_finished {
    my ($self, $collname, $collver) = @_;
    # In the "old days" we would also write the Lintian version and the time
    # stamp in these files, but since we never read them it seems like overkill.
    #  - for the timestamp we could use the mtime of the file anyway
    return touch_file "$self->{base_dir}/.$collname-$collver";

This turns out to be a very space-saving change if we ask du -h:

$ du -h laboratory/binary/lintian/.*-*
0       laboratory/binary/lintian/.ar-info-1
0       laboratory/binary/lintian/.bin-pkg-control-1
0       laboratory/binary/lintian/.changelog-file-1
0       laboratory/binary/lintian/.copyright-file-1
0       laboratory/binary/lintian/.debian-readme-1
0       laboratory/binary/lintian/.doc-base-files-1

Existing files are not emptied, so we still have some old non-empty files left on lintian.d.o. Nevertheless that was the "nice story" about "the side effect of being lazy sometimes reduces space waste"[3].

But we have another "number" problem. If you grep for "Too many links" in the lintian.log from lintian.d.o you should see:

$ grep -i "too many links" logs/lintian.log
mkdir: cannot create directory `/srv/': Too many links
mkdir: cannot create directory `/srv/': Too many links
mkdir: cannot create directory `/srv/': Too many links
mkdir: cannot create directory `/srv/': Too many links
mkdir: cannot create directory `/srv/': Too many links
mkdir: cannot create directory `/srv/': Too many links
mkdir: cannot create directory `/srv/': Too many links
mkdir: cannot create directory `/srv/': Too many links
$ grep -i "too many links" logs/lintian.log | wc -l

In the Lintian Laboratory every package is unpacked in a directory based on its type and name (as hinted in the output above). The problem is that ext3 has a limit of the amount sub-directories a directory can have[4]. This limit is just shy of 32 000 and (as a reminder) Debian has over 35 000 binary packages. So a lot of packages are currently not checked on lintian.d.o (#641468).

The current feature branch for #641468 already solves the directory limit issue (by using a "mirror-like" pool layout). It also removes the need for "1 file per completed collection" by writing this information in the existing "per entry" status file.
I hope we can get the branch merged into master within 2-3 weeks, though there are still a few issues that needs to be worked out before then.


[1] Actually it had 18 collections, but one of them are "auto-removed" so it is better not to include it in this case.

[2] Content alone - metadata like permissions and such is ignored.

[3] I admit it was more luck than intention!

[4] The limit comes from the "32 000 hardlink per inode" limitation.

This entry was posted in Debian, Lintian. Bookmark the permalink.

One Response to Numbers and lintian

  1. Pingback: Getting space for more packages | nthykier

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s