How it started (June 2001)
This is my first humble attempt at hacking with the OpenBSD kernel. Since
as of late we
lack
a packet filter, this looks like a good start. Even if in all probability this
won't end up as a useable piece of code, it might prove to be instructive
nontheless.
Where to get the packets
The right place to interface with the kernel is netinet/ip_input.c and
ip_output.c. I placed the hooks between the checksum calculation and the
call to the lower level (ethernet) functions. Possibly they can/should
be moved slightly, for instance to use the kernel's fragments reassembly.
Adding a pseudo device
A character type pseudo device /dev/pf looks like a good interface between
the kernel code and userland. Using ioctl(), a userland tool can load filter
rules into the kernel, query stats, etc.
Stateful filter and NAT
Currently, stateful filtering and NAT is implemented.
The ruleset is identical to IPF, minus some options that are not yet
implemented (keep frags, return-icmp, etc.).
A state basically consists of three address/port pairs. One for the external
host, one for the gateway and one for the internal machine that is being NATed
(in case it is a NATed connection, otherwise two pairs are identical).
Each stateful connection creates a state, and its address is inserted into
two trees and one linked list. One tree is sorted on external and gateway
address/port, the other on external and internal address/port. The entries
in each tree are unique (no duplicate keys). When a packet comes in on the
(external) interface, the external-gateway tree is searched. When a packet
goes out, the external-internal tree is searched. NATed connections always
have state, and the three address/port pairs are used to translate addresses.
There is no other NAT mapping table of any kind. Tree searches (the primary
operation which occurs for every packet) are O(log n), inserts and removals
as well (they occur only once for each connection).
The linked list is used to traverse all states when expired states are
purged. This is a O(n) operation, but occurs at most once every 10 seconds
(not for every packet). A sorted container doesn't make sense: the expire
time of a state (the key) usually changes with each packet.
The design handles large numbers states (concurrent connections) decently.
Right now, the memory usage for the states is the limit. There is much to
be optimized (whole bytes are wasted to hold bits, etc.).
Here's a sample filter rule file and
nat rule file.
Note that while protocols and ports can be given as names (translated through
/etc/protocols and /etc/services), icmp-type and code must be numeric. The
parser doesn't do much error checking yet, so watch the parser output to
verify you get the rules that you meant.
The filter rules are now applied after NAT takes place (both in and out), which
is the same behaviour as IPF.
Missing parts, open questions
keep frags
Currently, fragments are handled cheaply. The first fragment, if it contains
the complete IP and TCP/UDP/ICMP headers, is filtered according to the rules.
Any fragments with higher offsets (that don't contain header data) are passed
without checks. If the headers themselves are fragmented (this almost certainly
is an attack, I don't know of any real-life MTU less than 64), those packet(s) are
dropped. This, of course, is not "keep frags". But it's cheap, memory-wise,
for the filter. And with the first fragment being filtered, I can't see how
an attacker can gain anything because of the higher fragments being passed
without being filtered. The destination host will, missing the first fragment,
be unable to reassemble the fragments. If someone can show me how this is
dangerous and worth the time and memory of a fragment cache on the filter,
I'm happy to listen. Consider that if legitimate connections get fragmented,
the cache would require large amounts of memory, especially for high-speed
connections (remember that the fragments can arrive in any order), I can
easily imagine several dozen megabytes.
Please let me know if you
have some ideas or know of a good reference for these tasks.
Installation
Here's the code so far, based on 2.9-current, relative to /usr/src/sys/. It's
possible to use the code on -stable (which still has IPF installed), but care
must be taken to disable IPF correctly. If you want this, ask me. It's a pain
to keep even one set of patches relatively up to date, I stopped doing it for
-stable.
Note: This is a historical snapshot from June 26th 2001, the last
version prior to the import into the OpenBSD source tree. pfm was later
rewritten using yacc and renamed to pfctl. The entire source, including
kernel (with an AVL tree implementation) and userland, was only 3118
lines of text. It has grow a bit, since.
All further changes to the source code are documented in the OpenBSD
CVS tree, which you can browse using cvsweb:
Latest changes (newest at the top)
- Sync with -current, updated patches
- Converted to KNF according to style(9).
- Fixed return-rst for outgoing packets, call looutput(lo0ifp, ...). Now
works for NATed packets as well.
- Learned about m_pullup() today and fixed the gross bugs (overflowing
mbufs) that were previously used. Odd that it would even work at all
before.
- Changes to the ip_input() and ip_output() patches, pullup requires
changing the mbuf pointer. Now I see what and why IPF did there.
- Handle fragments (see above).
- Handle mbuf chains that contain split IP and TCP/UDP/ICMP headers.
- Added pfm log command, which starts the data gathering for packets going
through that interface.
- Working on a (simple) visualization tool for the statistics gathered by
the filter. Here the output from a
512/128 kbps ADSL.
The 100 mbps LAN is filtered by a humble P-200, and its CPU is still
>60% idle, even when the interface is saturated (100 mbps in both
directions). I have yet to test with a large number of states.
- Disabled return-rst for outgoing packets, calling ip_output() like I did
isn't good (can crash).
- Use AVL tree for states instead of simple linked list. Search is O(log n)
instead of O(n), at the cost of O(log n) for insert/removal. Since a
state is looked up an order of magnitude more often than it's inserted
and removed, this is a good deal. Use two trees, for two different keys.
- Purge expired states not with every passing packet, but only once in
10 seconds. This has O(n) cost.
- Rewrote the ICMP error message handler, it's clearer with the new trees.
- Added some statistics gathering. Eventually, a cron job can create nice
graphical views, like the link below shows.
- Added state code for ICMP. ICMP error messages (which refer to a
TCP/UDP packet) are handled using the TCP/UDP states. ICMP queries
create their own states. Now NATed machines can happily ping and
traceroute.
- NATed/RDRed packets always create state when they're not blocked.
- Adjusted TCP timeouts. Now half-closed connections (from stupid
web browsers) timeout soon.
- Supply separate patch tarballs for 2.9-current and -stable. Be sure
to use the right ones. Applying the wrong ones will not fail (the
original files are very similar), but the kernel may crash when
the packet filter is started. When in doubt, patch the files manually.
Note that -current uses major number 71 and -stable 70.
- Allow packets with seq == seqlo-1, this was the single most often
reported mismatch in my tests. Mismatches are now very rare, mostly
late acks that can be safely dropped. Please report other mismatches.
- Added state code for UDP, timeouts are 20s/60s, depending on whether
at least three packets go back and forth.
- Added '!' (not) for address/mask syntax. Useful in some cases. Note
that a space is required between ! and the address. Also valid for
NAT and RDR rules.
- Added (optional) protocol specifier to NAT and RDR rule syntax.
- Reduced log messages, now only state mismatches and errors are logged,
plus rule matches with the 'log' option.
- NAT and RDR now apply before the filter checks again.
- Prepare code for UDP states.
- Extended states and show age and expire time from pfm. Sorted by age.
- NAT and RDR redesigned, now they use the state list, which is kind of
nice. Note that the order is wrong again (filter rules apply before
translation, I'll fix that soon).
- Extended pfm, can load NAT/RDR files and show/clear all lists.
- State tracking is pretty accurate now, I get stray packets every
couple of hours only. Could be further relaxed, but the packets that
get dropped now deserve it. I'm running this now 24h/day on two busy
NAT gateways to test. ;)
- State code debugged and enhanced. There are still some cases of stray
packets that don't match the sequence ranges, mostly delayed
retransmissions and 'noise' at the end of the connection. But 90% of all
Internet TCP connections are tracked perfectly here.
- Added 'return-rst' option. No 'return-icmp' yet.