benzedrine.ch - OpenBSD Packet Filter - How it started (June 2001)

How it started (June 2001)

This is my first humble attempt at hacking with the OpenBSD kernel. Since as of late we lack a packet filter, this looks like a good start. Even if in all probability this won't end up as a useable piece of code, it might prove to be instructive nontheless.

Where to get the packets

The right place to interface with the kernel is netinet/ip_input.c and ip_output.c. I placed the hooks between the checksum calculation and the call to the lower level (ethernet) functions. Possibly they can/should be moved slightly, for instance to use the kernel's fragments reassembly.

Adding a pseudo device

A character type pseudo device /dev/pf looks like a good interface between the kernel code and userland. Using ioctl(), a userland tool can load filter rules into the kernel, query stats, etc.

Stateful filter and NAT

Currently, stateful filtering and NAT is implemented. The ruleset is identical to IPF, minus some options that are not yet implemented (keep frags, return-icmp, etc.).

A state basically consists of three address/port pairs. One for the external host, one for the gateway and one for the internal machine that is being NATed (in case it is a NATed connection, otherwise two pairs are identical).

Each stateful connection creates a state, and its address is inserted into two trees and one linked list. One tree is sorted on external and gateway address/port, the other on external and internal address/port. The entries in each tree are unique (no duplicate keys). When a packet comes in on the (external) interface, the external-gateway tree is searched. When a packet goes out, the external-internal tree is searched. NATed connections always have state, and the three address/port pairs are used to translate addresses. There is no other NAT mapping table of any kind. Tree searches (the primary operation which occurs for every packet) are O(log n), inserts and removals as well (they occur only once for each connection).

The linked list is used to traverse all states when expired states are purged. This is a O(n) operation, but occurs at most once every 10 seconds (not for every packet). A sorted container doesn't make sense: the expire time of a state (the key) usually changes with each packet.

The design handles large numbers states (concurrent connections) decently. Right now, the memory usage for the states is the limit. There is much to be optimized (whole bytes are wasted to hold bits, etc.).

Here's a sample filter rule file and nat rule file.

Note that while protocols and ports can be given as names (translated through /etc/protocols and /etc/services), icmp-type and code must be numeric. The parser doesn't do much error checking yet, so watch the parser output to verify you get the rules that you meant.

The filter rules are now applied after NAT takes place (both in and out), which is the same behaviour as IPF.

Missing parts, open questions

keep frags

Currently, fragments are handled cheaply. The first fragment, if it contains the complete IP and TCP/UDP/ICMP headers, is filtered according to the rules. Any fragments with higher offsets (that don't contain header data) are passed without checks. If the headers themselves are fragmented (this almost certainly is an attack, I don't know of any real-life MTU less than 64), those packet(s) are dropped. This, of course, is not "keep frags". But it's cheap, memory-wise, for the filter. And with the first fragment being filtered, I can't see how an attacker can gain anything because of the higher fragments being passed without being filtered. The destination host will, missing the first fragment, be unable to reassemble the fragments. If someone can show me how this is dangerous and worth the time and memory of a fragment cache on the filter, I'm happy to listen. Consider that if legitimate connections get fragmented, the cache would require large amounts of memory, especially for high-speed connections (remember that the fragments can arrive in any order), I can easily imagine several dozen megabytes.

Please let me know if you have some ideas or know of a good reference for these tasks.

Installation

Here's the code so far, based on 2.9-current, relative to /usr/src/sys/. It's possible to use the code on -stable (which still has IPF installed), but care must be taken to disable IPF correctly. If you want this, ask me. It's a pain to keep even one set of patches relatively up to date, I stopped doing it for -stable.

Note: This is a historical snapshot from June 26th 2001, the last version prior to the import into the OpenBSD source tree. pfm was later rewritten using yacc and renamed to pfctl. The entire source, including kernel (with an AVL tree implementation) and userland, was only 3118 lines of text. It has grow a bit, since.

All further changes to the source code are documented in the OpenBSD CVS tree, which you can browse using cvsweb:

Latest changes (newest at the top)

Sync with -current, updated patches
Converted to KNF according to style(9).
Fixed return-rst for outgoing packets, call looutput(lo0ifp, ...). Now works for NATed packets as well.
Learned about m_pullup() today and fixed the gross bugs (overflowing mbufs) that were previously used. Odd that it would even work at all before.
Changes to the ip_input() and ip_output() patches, pullup requires changing the mbuf pointer. Now I see what and why IPF did there.
Handle fragments (see above).
Handle mbuf chains that contain split IP and TCP/UDP/ICMP headers.
Added pfm log command, which starts the data gathering for packets going through that interface.
Working on a (simple) visualization tool for the statistics gathered by the filter. Here the output from a 512/128 kbps ADSL.
The 100 mbps LAN is filtered by a humble P-200, and its CPU is still >60% idle, even when the interface is saturated (100 mbps in both directions). I have yet to test with a large number of states.
Disabled return-rst for outgoing packets, calling ip_output() like I did isn't good (can crash).
Use AVL tree for states instead of simple linked list. Search is O(log n) instead of O(n), at the cost of O(log n) for insert/removal. Since a state is looked up an order of magnitude more often than it's inserted and removed, this is a good deal. Use two trees, for two different keys.
Purge expired states not with every passing packet, but only once in 10 seconds. This has O(n) cost.
Rewrote the ICMP error message handler, it's clearer with the new trees.
Added some statistics gathering. Eventually, a cron job can create nice graphical views, like the link below shows.
Added state code for ICMP. ICMP error messages (which refer to a TCP/UDP packet) are handled using the TCP/UDP states. ICMP queries create their own states. Now NATed machines can happily ping and traceroute.
NATed/RDRed packets always create state when they're not blocked.
Adjusted TCP timeouts. Now half-closed connections (from stupid web browsers) timeout soon.
Supply separate patch tarballs for 2.9-current and -stable. Be sure to use the right ones. Applying the wrong ones will not fail (the original files are very similar), but the kernel may crash when the packet filter is started. When in doubt, patch the files manually. Note that -current uses major number 71 and -stable 70.
Allow packets with seq == seqlo-1, this was the single most often reported mismatch in my tests. Mismatches are now very rare, mostly late acks that can be safely dropped. Please report other mismatches.
Added state code for UDP, timeouts are 20s/60s, depending on whether at least three packets go back and forth.
Added '!' (not) for address/mask syntax. Useful in some cases. Note that a space is required between ! and the address. Also valid for NAT and RDR rules.
Added (optional) protocol specifier to NAT and RDR rule syntax.
Reduced log messages, now only state mismatches and errors are logged, plus rule matches with the 'log' option.
NAT and RDR now apply before the filter checks again.
Prepare code for UDP states.
Extended states and show age and expire time from pfm. Sorted by age.
NAT and RDR redesigned, now they use the state list, which is kind of nice. Note that the order is wrong again (filter rules apply before translation, I'll fix that soon).
Extended pfm, can load NAT/RDR files and show/clear all lists.
State tracking is pretty accurate now, I get stray packets every couple of hours only. Could be further relaxed, but the packets that get dropped now deserve it. I'm running this now 24h/day on two busy NAT gateways to test. ;)
State code debugged and enhanced. There are still some cases of stray packets that don't match the sequence ranges, mostly delayed retransmissions and 'noise' at the end of the connection. But 90% of all Internet TCP connections are tracked perfectly here.
Added 'return-rst' option. No 'return-icmp' yet.