the simplest thing that could possibly teach me more about postfix

Earlier today, I tweeted:

If I wrote a book, I might call it, “Everything I Learned About Programming, I Learned By Adding More Print Statements”

This is a little funny, but mostly serious. Anybody who has had the misfortunate to watch me try to fix anything has seen my great love affair with printing more crap. When I am deep in bug fixing, my code is brimming with noise like this:

say "about to do stuff"; # debug
do_stuff;
say "just did stuff"; # debug

Later, I’ll tell Vim to :g/# debug/d and everything will go back to normal.

This isn’t a weird practice. Doctors on TV (who, I am assured, are just like real life doctors) are always injecting radioactive dye. The Mythbusters put food color or colored smoke in some moving medium practically every other week. The issue is simple: most of what software is doing is invisible, and if you make it visible, you can see why it is totally failing to do anything remotely like it’s supposed to. There are a lot of great tools for this – debuggers, DTrace, and even (if you’re very lucky) built-in logging facilities. I like them all. It’s just that adding print statements is easier for some very large fraction of problems.

So, what does this have to do with Postfix?

At work, we move a lot of mail through Postfix. By “a lot” I just mean that it takes more than one server and I have to pay attention to the performance of our custom code. We have a lot of customizations to our mailflow to deal with virtual users, spam filtering, rate limiting, and so on. These are implemented with different parts of Postfix’s customization facilities, of which I’ll be talking about two: table lookups and policy services.

Often, all you want to do is say “mail for this domain goes to X, mail for this other domain goes to Y,” or “reject all mail from jerk@example.com.” These kinds of rules are easy to do with a table. Postfix has all kinds of places to add table lookups. For the two examples, we might have postfix configuration like this:

# in main.cf:
transport_maps = hash:/etc/postfix/domain-transports
smtpd_sender_restrictions = check_sender_access hash:/etc/postfix/sender-access

# in domain-transports:
X                   smtp:x-transport.mydomain.com
Y                   smtp:y-transport.mydomain.com

# in sender-access
jerk@example.com    551 you are a jerk

The “hash” in the configuration file says what kind of table it is. To simplify, it’s saying that we’re going to look up fixed strings and get fixed results based on them. During the SMTP conversation, Postfix looks up “jerk@example.com” in the sender access table, finds the 551 result, and rejects inbound mail. There are other kinds of tables, like PCRE for regex lookups and TCP tables to allow arbitrary logic behind the scenes. They’re easy to use:

smtpd_sender_restrictions = check_sender_access tcp:localhost:9000

Postfix will connect to localhost:9000 and this transaction can happen:

Postfix:  GET jerk@example.com
Server :  200 551 you are a jerk

This lets you, for example, store all these blacklists in some centrally-managed database and not try to build text files on all your servers.

The other facility I mentioned was the policy server. This is quite similar to a TCP table, but the transaction is more complicated:

# in your main.cf
smtp_sender_restrictions = check_policy_server inet:localhost:9001

# over port 9001...
Postfix: request=smtpd_access_policy
         protocol_state=RCPT
         protocol_name=SMTP
         helo_name=some.domain.tld
         queue_id=8045F2AB23
         sender=jerk@example.com
         recipient=bar@foo.tld
         recipient_count=0
         client_address=1.2.3.4
         client_name=another.domain.tld
         reverse_client_name=another.domain.tld
         instance=123.456.7
         sasl_username=friend@example.com
         (blank line)
Server:  action=551 you are a jerk
         (blank line)

This is obviously useful for doing all kinds of more complex policies, like IP in conjunction with sender, or HELO name verification, or applying different policies to different users based on their SASL authentication.

The event that set all my extra print statements, and thus this journal entry, into motion was related to making some bugfixes to one of our policy servers. It had a race condition, and when clients’ accounts were owned, spambots were much more likely to exploit the race than humans. The race had to go – but early attempts to fix it were miserably unsuccessful. I added some logging, and messages were being logged many times. It was time to finally admit my ignorance: most of our Postfix-interfacing code was written by a long-departed employee, and when he left, I never really stepped up to learn everything I should have.

I read a lot of documentation, and I felt only somewhat more educated. The Postfix documentation is pretty good, but it’s much more a reference than a tutorial, and I needed a way to see what the heck the various settings would do. The solution was simple. I would do what I always did: add more print statements.

I wanted to be able to see everything happening at each check, so I gave every check both a TCP table lookup and a policy service lookup by adding the following to the main.cf of a testing server:

smtpd_client_restrictions = check_policy_service inet:localhost:9001

smtpd_helo_restrictions = check_policy_service inet:localhost:9001,
                                 check_helo_access tcp:localhost:9002

smtpd_sender_restrictions = check_policy_service inet:localhost:9001

smtpd_recipient_restrictions = check_policy_service inet:localhost:9001,
                                 check_recipient_access tcp:localhost:9002,
                                 reject

smtpd_data_restrictions = check_policy_service inet:localhost:9001
smtpd_end_of_data_restrictions = check_policy_service inet:localhost:9001

Then all I had to do was set up servers on those ports. This was easy! I just used Net::Server and a few lines of <STDIN>, and print. Now every restriction was being used (to get a more or less no-op answer) and was consulting both a table and a policy server. The servers turned the queries they got into JSON and printed it. I could send a message (with Net::SMTP) to my test server and see all the radioactive dye swirl around.

The problem was immediately incredibly obvious: all of the non-DATA restrictions were being consulted only at RCPT time, and once per recipient! Even though I could take action as soon as I saw the sender, it was checking later, and many times. By default, Postfix allows 1,000 SMTP recipients, meaning we could consult 999 times more than needed! Armed with this knowledge, the documentation became very fruitful: the smtpd_delay_reject setting defaulted to true, and turned all the pre-RCPT checks into RCPT checks. With that off, I could reject things as early as connection time.

I also realized that relying on certain data in the DATA was no good. You can’t get a list of all the recipients, for example, only their count. This makes sense, and is documented, but seeing it happen is a much more effective way for me to realize this. I learn in two ways: learning by futzing around and learning by staring at moving parts. Here, I put the two together, with stellar results. Even when adding the print statements required building a virtual machine and writing a few TCP servers, it was a pretty easy and effective way to figure out what to do next.

I put my little TCP servers and test-mail-sending script and a few other odds and ends in my postfixplex GitHub repository.

Written on June 14, 2011