Marc Espie talks about OpenBSD ports internals

Marc Espie is an interesting sort of chap living in Paris, France. When tired, he watches a lot of DVDs. He enjoys juggling and riding through Paris on rollerblades.

Sometimes he hacks on the OpenBSD operating system. We tricked him into talking about time machines - and the internals of OpenBSD ports.

How and when did you get involved with OpenBSD?

About six years ago. Only OS where I could figure out WHAT to grab before installing it on my-then Amiga. Then I found minor faults in some manpages, the rest is history...

What areas of OpenBSD do you mostly work within?

Ports infrastructure, make, m4, gcc, documentation... too many things, in fact.

For people unfamiliar with ports, what are the basic concepts?

Transparent infrastructure, compared to other packaging systems. Oriented towards the production of binary packages, and robustness.

Most people should only see the binary packages and pkg_add.

For the rare people building packages from source, the idea is that you can do make package, and the system will grab everything and build a package, but in a transparent way: figuring out a port's Makefile is simple. Decomposing what happens into distinct steps (fetch, extract, patch, configure, build, fake, package) is trivial.

Compared to other BSDs, there are carefully chosen knobs you can tweak (FLAVORS) and the package system is consistent: each knob will change the package name, package dependencies will work, and take those into account if needed.

There are also lots of automatic/semi-automatic checks now that make it really hard to have packages you can't install later, or that won't work, because of shared library issues, for instance.

We also try to shrink our infrastructure instead of growing it. The less special cases we have to handle, the less bugs we get. So we try really hard to unify things, and to provide tools to porters that actually get used.

About a year ago, you started rewriting the package tools and making changes to the ports framework. How much of the added feature set was planned from the beginning? How much did things change as you went along? Was this a one-man effort?

Well, actually it started earlier than that. There was a careful planning phase the general public didn't see. Fiddling with perl, mostly. And a big part of the clean-up was done to simplify the move.

Apart from that, most things you see today were actually planned from the start. The initial change took a long time because it was mostly setting up a simple, safe, robust infrastructure that I could take where I wanted. There were a few unplanned changes and bug-fixes.... I fully expected that there would be some unforeseen issues, and the infrastructure was designed to be powerful enough to deal with changes. So far, make update-plist is the most complicated part of the new infrastructure, and the one that triggerred a few interesting changes.

This would give the impression this is a one-man effort, but actually, I discussed a lot of issues with my friends, and I've had a lot of help testing things, and fixing stuff. Naddy, Peter, Nikolai, JMC, Theo and Todd have been very helpful.

Most of the code is mine, but testing and making comments is a BIG part of the process.

Are the changes you're doing supposed to mimic/inspired by any other particular package tools?

Yes and no. I've read through rpm and apt specs and code, for instance, and I'm familiar with some aspects of emerge, or of the perl CPAN system.

I've mostly read that code to make certain this was NOT what I wanted to do. The design specs of the new pkg tools were geared towards pkg update from the start. It was obvious from the beginning the packing-list stuff had to be object-oriented, and getting that right was the most difficult effort. One other fun design spec was to see how far I could get without adding a lot of new data structures (or cached data) to /var/db/pkg. So far, there's been very little change to it, and the `flat' text files cope very well, pkg_add and pkg_delete are still very fast, and I'm quite proud of that. ;-) Compared to rpm, I didn't want to redesign yet another script language, and so far it has proven succesful. There's been some independent reinvention. Shared libraries are central to update issues and need explicit first-class support. That was an independent re-discovery, but we end up having mechanisms very similar to what rpm does in the area. Also, using checksums to figure out when to update configuration files is a feature we share with rpm. To be fair, I don't know whether it's a natural way to do things, or some trick I just remembered from past experience with rpms a long time ago.

The dpkg/apt stuff has had less visible influence, though we may get some of our interactive choice features from them later.

The newcomer, pkg_merge, directly comes from ideas I first saw in NeXtStep.

In general, other package systems are mostly programs I read for design ideas, but the compromises are so vastly different there's little influence I can trace.

In what ways, if any, is ports functionality limited by being implemented in the make language? Would you prefer another language, or is make well suited for the task?

make is very well suited for basic ports work. In the past, some stuff was not very cohesive, and we had tens of little scripts that all did the same things over and over. Having the package tools in perl means that now, those scripts all refer to the same library pieces, and are much more cohesive.

Can you tell us something about what the FAKE framework is used for and how it works?

FAKE is a direct consequence of wanting binary packages that work. To do that, one must force *all* package installations to go through pkg_add.

Quite simply, FAKE is a magic trick. You tell the software `okay, let's install stuff, but use this special funny area that's not really where you're going to end up'. So, you install stuff under a fake/ directory hierarchy, then prepare the package using pkg_create, then do the actual install with pkg_add.

You just need to tweak the install process so that stuff goes into fake/usr/local/bin instead of /usr/local/bin, for instance. In many modern software, you just need to set DESTDIR=fake. In other cases, this is a bit more complicated. A lot of Makefiles use variables like bindir=/usr/local/bin to specify where a binary should go. Well, if you execute make bindir=/fake/usr/local/bin, then the value you set on the command line will override the Makefile contents.

Together, these solve 99% of all package installs. The rest needs patches and special cases...

One of the new features is that most but not all packages can now be updated. What prevents some packages from being updateable?

First, you can force updates in many cases. It's just not as safe as the packages that are updateable without questions.

Mostly, there are dependency issues. Complex systems can usually not be updated piecewise. In the next release of OpenBSD, there will probably be some new fangles to pkg_add that should allow you to update several inter-dependent packages at once, and to figure out (more or less automatically) small units of two or three packages that need to be updated together.

There's also the complex issue of data migration. Database formats, for instance. This needs very careful testing. We won't do this until we're certain how to do it.

I often say `you can't predict the future'. Some packages just have issues in their current incarnations, and the `next' version can't cope with all issues present in today's packages. This will get much smoother as we figure out what annotations we have to put into packages to make every update simple.

And finally, we don't know yet how to update running programs, but that's actually a `non-safe' aspect of package updates.

If you look at existing package systems that support updates, you either have an `anything goes' approach (mainly, updates often go wrong and leave you with a half-baked system), or a VERY complicated system that deals with a lot of issues and needs a HUGE amount of specific annotations to take every update scenario into account. Look at debian, for instance. There's a huge amount of work there that makes most update scenario workable.

We don't have the manpower to go that way, and I don't believe it's a good idea anyways. Lots of new update scenarios will turn up to be special cases. Which means a lot of scripts. A lot of tests. And invariably, some tests won't be done, and some cases will fail. So less cases, less tests, better robustness.

Part of making it possible to update ports required adding a WANTLIB marker to the Makefile of close to all ports. However, the ports framework can assist the port maintainer in making this list. If the ports framework can figure this out, why can this not just be done dynamically then?

Because we want reproduceable binary packages. WANTLIB determination is directly dependent on configure results. And configure results may change when you install more shit on your machine. So, this acts as a fail-safe that reminds the port maintainer he HAS to do some work. Likewise, packing-lists are not generated automatically, even though this is possible in 99% of the cases.

Those are very important quality factors. We don't want ports done by people who do not understand the issues involved. I make no apology for that: the port system should get as simple as it can, but not simpler. You must have some technical experience to be able to prepare a port. The `writing' part of ports is not designed for morons.

Do you consider the ports framework to be fairly feature-complete, or are you planning any big changes?

No real big changes, but an interesting number of small changes. If you take a time-machine and look at the ports framework in OpenBSD 3.9, you'll find that a lot of things have become way easier. ;-)

Specifically, streamlining updates. And more seamless integration of ports and packages.

Is there any possibility that support for (possibly optional) signed packages in the future? Are there any major problems in doing this?

Minor problems. I mostly need to write the missing code.

What are the most computationally expensive operations going on in the framework? Are any features not going in because they require expensive computation?

Simulating pkg additions/deletions is very intensive, since the tools need to keep track of a lot of stuff internally. It was the main optimization I did for 3.7.

Handling shared stuff can be rather intensive too (shared directory deletions for instance) when you have 500~600 installed packages.

The current way shared libraries are handled consumes quite a bit of processing power, since it now verifies that the dependency chain leads to the right library.

It's mostly reading and processing big lists. This got optimized as far as it could.

One feature which isn't going in so far is anything that depends on processing lots of uninstalled packages. Say, when you look for a dependency, you always go for the default dependency. There's also no provide/require mechanism like in other package systems, because it would need databases of packages.

I'm resisting incorporating this kind of stuff, I know how quickly caches and databases can get out of synch. I believe this approach to be more user-friendly. Each time you hide some complex mechanism in the tools, this makes for one more notion the user has to master before they can use the tools. I won't quote names, but I remember at least two or three gore stories related to package systems that cache/duplicate important information and that yield VERY hard to understand problems when the information gets out of synch.

We'll see how far we can go without this, and we'll add it if we need...

As far as I'm concerned, I'd rather streamline the existing mechanisms some more, and fix the few unintuitive aspects of our current package systems. (being more explicit about what's going on when you end up with a partial package, or explaining issues with `old' shared libraries when needed).

There's some obvious hubris involved: your average developer naturally cares a lot about what he does, and so writing all sorts of cool features and new documentation for your tools gives you a warm fuzzy feeling. Well, the cold reality is something entirely different. The less people have to remember about how pkg_add works, the happier they will be.

Ideally, you would just enter pkg_add, and have the system do what you want. Right now, you still need a fairly large number of switches to do stuff... This *will* get simpler.

For instance, my ideal way of updating packages would be to enter.

# pkg_add -u

and have the system automaticall figure out *everything*.

Automated installs are nice if you need to set up a large number of similar boxes or reinstall boxes on a regular basis. Are there any problems in making the install script in siteXY.tgz install packages?

Exactly one single problem: it won't work on machines without enough RAM. pkg_add needs a running perl interpreter. This consumes a few megabytes of memory. All other problems related to that have been solved. pkg_add now transparently works in all kinds of chroot'ed environments.

Another question related to automated installs. Is it possible to make a list or configuration file which specifies that certain ports should always be built in a certain flavour? It might be nice for a setup with many similar boxes during upgrades if such a specification can just be distributed as part of siteXY.tgz.

People who ask for that haven't grasped some fine points in the documentation. ;-)

Ouch. If we had been paying more attention, what would we have learned?

You can just create a list of PKGPATH and build those ports. Have a look at /usr/ports/infrastructure/plist sometime. That's exactly the feature you're asking for.

And have a look at package contents, notice the line that says:

@comment subdir=archivers/zip cdrom=yes ftp=yes

Yes, you can use this to go from a package to the corresponding port specification.

When making ports that run under apache, often special care must be taken to ensure that they run properly when chrooted as is the default mode of operation. Often this requires copying lots of stuff into the chroot and the port maintainer must write instructions on how to populate it properly. Would it be hard or even possible to automate this task by using e.g. systrace with automatic policy generation to generate a list of needed files, you think?

Installing packages in a chroot environment is now possible. It will tell you what libraries are missing. I don't believe it's a good idea to go to some finer-grained approach. You mostly either want a port under chroot, or not. There could be a few scripts to simplify this. It would have to be done by people actually interested in running complex things under apache's chroot.

What would you say is the most overlooked feature of pkg_*?

The perl modules are there so that someone could write some cool tools directly. So far, all stuff I've seen are stupid shell scripts that invoke the pkg tools. Using perl directly would be so much more powerful.

Everything is integrated. It's more a unified language for package handling rather than separate tools.

Thank you for taking the time to share these thoughts with us, Marc!