The other day, I wrote about our recent performance tuning in lintian. Among other things, we reduced the memory usage by ~33%. The effect was also reproducible on libreoffice (4.2.5-1 plus its 170-ish binaries, arch amd64), which started at ~515 MB and was reduced to ~342 MB. So this is pretty great in its own right…
But at this point, I have seen what was in “Pandora’s box”. By which, I mean the two magical numbers 1.7kB per file and 2.2kB per directory in the package (add +250-300 bytes per entry in binary packages). This is before even looking at data from file(1), readelf, etc. Just the raw index of the package.
Depending on your point of view, 1.7-2.2kB might not sound like a lot. But for the lintian source with ~1 500 directories and ~3 300 non-directories, this sums up to about 6.57MB out of the (then) usage at 12.53MB. With the recent changes, it dropped to about 1.05kB for files and 1.5kB for dirs. But even then, the index is still 4.92MB (out of 8.48MB).
This begs the question, what do you get for 1.05kB in perl? The following is a dump of the fields and their size in perl for a given entry:
lintian/vendors/ubuntu/main/data/changes-file/known-dists: 1077.00 B _path_info: 24.00 B date: 44.00 B group: 42.00 B name: 123.00 B owner: 42.00 B parent_dir: 24.00 B size: 42.00 B time: 42.00 B (overhead): 694.00 B
With time, date, owner and group being fixed sized strings (at most 15 characters). The size and _path_info fields being integers, parent_dir a reference (nulled). Finally, the name being a variable length string. Summed the values take less than half of the total object size. The remainder of ~700 bytes is just “overhead”.
Time for another clean up:
- The ownership fields are usually always “root/root” (0/0). So let’s just omit them when they satisfy said assumption. [f627ef8]
- This is especially true for source packages where lintian ignores the actual value and just uses “root/root”.
- The Lintian::Path API has always had a “cop-out” on the size field for non-files and it happens to be 0 for these. Let’s omit the field if the value was zero and save 0.17MB on lintian. [5cd2c2b]
- Bonus: Turns out we can save 18 bytes per non-zero “size” by insisting on the value being an int.
- Unsurprisingly, the date and time fields can trivially be merged into one. In fact, that makes “time” redundant as nothing outside Lintian::Path used its value. So say goodbye to “time” and good day to 0.36MB more memory. [f1a7826]
Which leaves us now with:
lintian/vendors/ubuntu/main/data/changes-file/known-dists: 698.00 B _path_info: 24.00 B date_time: 56.00 B name: 123.00 B parent_dir: 24.00 B size: 24.00 B (overhead): 447.00 B
Still a ~64% overhead, but at least we reduced the total size by 380 bytes (585 bytes for entries in binary packages). With these changes, the memory used for the lintian source index is now down to 3.62MB. This brings the total usage down to 7.01MB, which is a reduction to 56% of the original usage (a.k.a. “the-almost-but-not-quite-50%-reduction”).
But at least the results also carried over to libreoffice, which is now down to 284.83 MB (55% of original). The chromium-browser (source-only, version 32.0.1700.123-2) is down to 111.22MB from 179.44MB (61% of original, better results expected if processed with binaries).
In closing, Lintian 2.5.34 will use slightly less memory than 2.5.33.