sysadmin- gnegg

GitHub as an identity provider

As I write these words, I have just finally disabled our LDAP/Active Directory server based on Samba which is incredibly hard to update, to backup and to re-create in case of emergencies.

We’ve traditionally used Gmail for Email and Calendar access and LDAP/Active Directory for everything else: The office WLAN, VPN access, internal support tool access, SSH access, you name it.

However, that solution came with operational drawbacks for us:

Unix support for Active Directory is flaky at best and the UI that Windows once provided for setting up LDAP attributes required for Unix users has stopped working with Windows 8
Samba is hard to install and keep up to date. Setting up an Active Directory domain creates a ton of local state that’s difficult to back up and impossible to put into a configuration management system.
The LDAP server being reachable is a prerequisite for authentication on machines to work. This meant that tunnels needed to be created for various machines to have access to that.
Each Debian update came with new best practice for LDAP authentication of users, so each time new tools and configuration needed to be learned, tested and applied.
As we’re working with contractors giving temporary access to our support tools is difficult because we need to create temporary LDAP accounts for them.
Any benefits that could have been had by having workstations use a centralized user account database have evaporated over time as our most-used client OS (macOS) lost more and more central directory support.

On the other hand, everybody at our company has a GitHub account and so do contractors, so we’re already controlling user access via GitHub.

GitHub provides excellent support for account security too, especially once we could force 2FA to be enabled on all users. Also over time, the reliance on locally installed tools grew smaller and smaller when on the other hand, most cloud services were making use of provide OAuth Sign-in-with-Github functionality.

It became clear that stuff would work ever so much better if only we could get rid of that LDAP/Active Directory thing.

So I started a multi-year endeavor to get rid of LDAP. It wouldn’t have needed to be multi-year, but as this happened mostly as side-projects, I tackled them whenever I had the opportunity.

I’m fully aware that this puts us into the position where we’re dependent on our Github subscription to be active. But so does a lot of our daily development work. We use Issues, Pull Requests, GitHub Actions, you name it. While git itself is, of course, decentralized, all the other services provided by GitHub are not.

Then again, while we would be in a really bad spot with regards to our development processes, unhooking our glue code from GitHub and changing it to a traditional username/password solution would be very feasible even in a relative short time-frame (much shorter than the disruption to the rest of our processes).

Which means that I’m ready to increase the dependency on GitHub even more and use them as our identity provider.

The first thing to change was the internal support tools our support team uses to get authenticated access to our sites for support purposes. The interface between that tool and the sites has always been using signed access tickets (think JWT but slightly different, mostly because pre-dating JWT by about 5 years) to give users access. The target site itself did not need access to LDAP, only the support tool needed it to authenticate our support team members.

So unhooking that tool from LDAP and hooking it up to Github was the first step. Github has well-documented support for writing OAuth client apps to authenticate users.

Next was authenticating users to give SSH access to production machines. This was solved by teaching our support tool how to sign SSH public keys and by telling the production machines to trust that CA.

I wrote a small utility in Swift to send an SSH public key to the support tool to have it signed and to install the certificate in the SSH agent. The certificates have a short lifetime ranging from one day to at most one week (depending on user) and using the GitHub API, the central tool knows about team memberships which allows us to confer different permissions on different servers based on team membership.

None of the SSH servers can do all of the certificate validation entirely locally (due to the short lifetime we can live without a CRL), independent of network access (which there is none of for some machines).

Which meant that SSH access is now possible independently of LDAP and even network availability. And it’s using a mechanism that’s very simple and comes with zero dependencies aside of openssh itself.

Then came the VPN. I’ve run IPSec with IKEv2 to provide authenticated access to parts of the production network. It (or rather the RADIUS server it used) needed access to LDAP and even though it was using stock PFSense IPSec support, it was unreliable and needed restarts with some regularity.

This was entirely replaced by a SSH bastion host and ProxyJump in conjunction with above SSH certificates. No more LDAP and production access based on GitHub group membership of GitHub accounts. While it never happened and I would be very wary of allowing it, this would allow us to give selective access to machines to contractors based on nothing but their GitHub account (and who doesn’t have one of those these days).

Behind the production network, there’s another, darker part of the infrastructure. That’s the network where all the remote management interfaces and the virtual machine hosts are connected to. This one is absolutely critical and access to it is naturally very restricted.

The bastion host described above does not have access to that network.

In comes the next hat our support tool/github integration is wearing: It can synchronize Tailscale ACLs with Github and it can dynamically alter the ACL to give temporary access to specific users.

Tailscale itself uses Github as an identity provider too (and also supports custom identity providers, so, again, losing GitHub for this would not be the end of the world) and our support tool uses the GitHub and Tailscale APIs to make sure that only users in a specific GitHub group get access to Tailscale at all.

So everybody who needs network access that’s not doable or not convenient via the SSH bastion host has a Tailscale account (very few) and of those even fewer users are in a GitHub Team that causes our support tool to allow for such users to request temporary (30 min max) access to the super secret backstage network.

Which completely removes the last vestiges of the VPN from the picture and leaves us with just one single dependency: The office wifi.

Even though the office network really isn’t in a privileged position (any more), I want access to that network to be authenticated and I want to be able to revoke access to individual users.

Which is why we have always used Enterprise WPA over RADIUS against Active Directory/SAMBA to authenticate WiFi access to the office network.

This has now been replaced by, you guessed it, our support tool which creates and stores a unique completely random password for each user in a specific Github Team and offers an API endpoint to be used by the freeRADIUS rlm_rest module to authenticate those users. In order to still have Wifi even when our office internet access is unavailable (though I can’t really see why we’d need that given our reliance on cloud-based services), I added a local proxy in front of such API endpoint that serves a stale response in case of errors (for some hours – long enough for us to be able to fix the internet outage but short enough to not let no-longer-authenticated users access the network).

With this last step, our final dependency on LDAP was finally dropped and all our our identity management is now out-sourced to GitHub, so I could finally issue that one last command

shutdown -h now

Tailscale on PFSense

For a bit more than a year, I’m a user of Tailscale which is a service that builds an overlay network on top of Wireguard while relying on OAuth with third party services for authentication

It’s incredibly easy to get going with Tailscale and the free tier they provide is more than good enough for the common personal use cases (in my case: tech support for my family).

Most of the things that are incredibly hard to set up with traditional VPN services just work out of the box or require a minimal amount of configuration. Heck, even more complicated things like tunnel splitting and DNS resolution in different private subnets just work. It’s magic.

While I have some gripes that prevent me from switching all our company VPN connectivity over to them, those are a topic for a future blog post.

The reason I’m writing here right now is that a few weeks ago, Netgate and Tailscale announced a Tailscale package for PFSense. As a user of both PFSense and Tailscale, this allowed me to get rid of a VM that does nothing but be a Tailscale exit node and subnet router and instead use the Tailscale package to do this on PFSense.

However, doing this for a week or so has revealed some very important things to keep in mind which I’m posting about here because other people (and that includes my future self) will run into these issues and some are quite devastating:

When using the Tailscale package on PFSense, you will encounter two issues directly caused by Tailscale, but both of which also seen in other reports when you search for the issue on the internet, so you might be led astray when debugging it.

Connection loss

The first one is the bad one: After some hours of usage, an interface on your PFSense box will become unreachable, dropping all traffic through it. A reboot will fix it and when you then look at the system log, you will find many lines like

arpresolve: can't allocate llinfo for <IP-Address> on <interface>

This will happen if one of your configured gateways in “System > Routing” is reachable both by a local connection and through Tailscale by subnet router (even if your PFSense host itself is told to advertise that route).

I might have overdone the fixing, but here’s all the steps I have taken

Tell Tailscale on PFSense to never use any advertised routes (“VPN > Tailscale > Settings”, uncheck “Accept subnet routes that other nodes advertise.”
Disable gateway monitoring under “System > Routing > Gateways” by clicking the pencil next to the gateway in question.

I think what happens is that PFSense will accidentally believe that the subnet advertised via Tailscale is not local and will then refuse to add the address of that gateway to its local ARP table.

IMHO, this is a bug in Tailscale. It should never mess with interfaces its exposing as a subnet router to the overlay network.

Log Spam

The second issue is not as bad but as the effect is so far removed from the cause, it’s still worth talking about it.

When looking at the system log (which you will do for above issue), you will see a ton of entries like

sshguard: Exiting on signal
sshguard: Now monitoring attacks.

What happens is that PFSense moved a few releases ago from a binary ring-buffer for logging to a more naïve approach to check once a minute whether a log file is too big, then rotating it and restarting daemons logging to that file.

If a daemon doesn’t have a built-in means for re-opening log files, PFSense will kill and restart the daemon, which happens to be the case for sshguard.

So the question is: Why is the log file being rotated every minute? This is caused by the Tailscale overlay network and the firewall by default blocking Tailscale traffic (UDP port 41641) to the WAN interface and also by default logging every dropped packet.

In order to fix this and assuming you trust Tailscale and their security update policies (which you probably should given that you just installed their package on a gateway), you need to create a rule to allow UDP port 41641 on the WAN interface.

This, too, IMHO is a bug in the Tailscale package: If your package opens port 41614 on an interface on a machine whose main purpose is being a firewall, you should probably also make sure that traffic to that port is not blocked.

With these two configuration changes in place, the network is stable and the log spam has gone away.

What’s particularly annoying about these two issues is that Googling for either of the two error messages will yield pages and pages of results, none of which apply because they will have many more possible causes and because Tailscale is a very recent addition to PFSense.

This is why I decided to post this article in order to provide one more result in Google and this time combining the two keywords: Tailscale and PFSense, in the hope of helping fellow admins who run into the same issues after installing Tailscale on their routers.

Joining Debian to ActiveDirectory

This blog post is a small list of magic incantations and to be issued and animals to be sacrificed in order to join a Unix machine (Debian in this case) to a (samba-powered) ActiveDirectory domain.

All of these things have to be set up correctly or you will suffer eternal damnation in non-related-error-message hell:

Make absolutely sure that DNS works correctly
- the new member server’s hostname must be in the DNS domain of the AD Domain
- This absolutely includes reverse lookups.
- Same goes for the domain controller. Again: Absolutely make sure that you set up a correct PTR record for your domain controller or you will suffer the curse of GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Server not found in Kerberos database)
Disable IPv6 everywhere. I normally always advocate against disabling IPv6 in order to solve problems and instead just solve the problem, but bugs exist. Failing to disable IPv6 on either the server or the client will also cause you to suffer in Server not found in Kerberos database hell.
If you made previous attempts to join your member server, even when you later left the domain again, there’s probably a lingering host-name added by a previous dns update attempt. If that exists, your member server will be in ERROR_DNS_UPDATE_FAILED hell even if DNS is configured correctly.
- In order to check, use samba-tool on the domain controller samba-tool dns query your.dc.ip.address your.domain.name memberservername ALL
- If there’s a hostname, get rid of it using samba-tool dns delete your.dc.ip.address your.domain.name memberservername A ip.returned.above
make sure that the TLS certificate served by your AD server is trusted, either directly or chained to a trusted root. If you’re using a self-signed root (you’re probably doing that), add the root as a PEM-File (but with .crt extension!) to /usr/local/share/ca-certificates/ ad run /usr/sbin/update-ca-certificates. If you fail to do this correctly, you will suffer in ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1) hell (no. Nothing will inform you of a certificate error – all you get is can't connect)
In order to check that everything is set up correctly, before even trying realmd or sssd, use ldapsearch: ldapsearch -H ldap://your.dc.host/ -Y GSSAPI -N -b "dc=your,dc=base,dc=dn" "(objectClass=user)"
Aside of all that, you can follow this guide, but also make sure that you manually install the krb5-user package. The debian package database has a missing dependency, so the package doesn’t get pulled in even though it is required.

All in all, this was a really bad case of XKCD 979 and in case you ask yourself whether I’m bitter, then let me tell you, that yes. I am bitter.

I can totally see that there are a ton of moving parts involved in this and I’m willing to nudge some of these parts in order to get the engine up and running. But it would definitely help if the various tools involved would give me meaningful log output. samba on the domain controller doesn’t log, tcpdump is pointless thanks to SSL everywhere, realmd fails silently while still saying that everything is ok (also, it’s unconditionally removing the config files it feeds into the various moving parts involved, so good luck trying to debug this), sssd emits cryptic error messages (see above) and so on.

Anyways. I’m just happy I go this working and now for reproducing it one more time, but this time recording everything in Ansible.

Geek heaven

If I had to make a list of attributes I would like the ISP of my dream to
have, then, I could write quite the list:

I would really like to have native IPv6 support. Yes. IPv4 will be sufficient for a very long time, but unless pepole start having access to IPv6, it’ll never see the wide deployment it needs if we want the internet to continue to grow. An internet where addresses are only available to people with a lot of money is not an internet we all want to be subjected to (see my post «asking for permission»)
I would want my ISP to accept or even support network neutrality. For this to be possible, the ISP of my dreams would need to be nothing but an ISP so their motivations (provide better service) align with mine (getting better service). ISPs who also sell content have all the motivation to provide crappy Internet service in order to better sell their (higher-margin) content.
If I have technical issues, I want to be treated as somebody who obviously has a certain level of technical knowledge. I’m by no means an expert in networking technology, but I do know about powering it off and on again. If I have to say «shibboleet» to get to a real technicial, so be it, but if that’s not needed, that’s even better.
The networking technology involved in getting me the connectivity I want should be widely available and thus easily replacable if something breaks.
The networking technology involved should be as simple as possible: The more complex the hardware involved, the more stuff can break, especially when you combine cost-pressure for end-users with the need for high complexity.
The network equipment I’m installing at my home and which has thus access to my LAN needs to be equipment I own and I control fully. I do not accept leased equipment to which I do not have full access to.
And last but not least, I would really like to have as much bandwidth as possible

I’m sure I’m not alone with these wishes, even though, for «normal people» they might seem strange.

But honestly: They just don’t know it, but they too have the same interests. Nobody wants an internet that works like TV where you pay for access to a curated small list of “approved” sites (see network neutrality and IPv6 support).

Nobody wants to get up and reboot their modem here and then because it crashed. Nobody wants to be charged with downloading illegal content because their Wifi equipment was suddenly repurposed as an open access point for other customers of an ISP.

Most of the wishes I list above are the basis needed for these horror scenarios never coming to pass, however unlikely the might seem now (though getting up and rebooting the modem/router is something we already have to deal with today).

So yes. While it’s getting rarer and rarer to get all the points of my list fulfilled, to the point where I though this to be impossible to get all of it, I’m happy to say that here in Switzerland, there is at least one ISP that does all of this and more.

I’m talking about Init7 and especially their awesome FTTH offering Fiber7 which very recently became available in my area.

Let’s deal with the technology aspect first as this really isn’t the important point of this post: What you get from them is pure 1Gbit/s Ethernet. Yes, they do sell you a router box if you want one, but you can just as well just get a simple media converter, or just an SFP module to plug into any (managed) switch (with SFP port).

If you have your own routing equipment, be it a linux router like my shion or be it any Wifi Router, there’s no need to add any kind of additional complexity to your setup.

No additional component that can crash, no software running in your home to which you don’t have your password to and certainly no sneakily opened public WLANs (I’m looking at you, cablecom).

Of course you get native IPv6 (a /48 which incidentally is room for 281474976710656 whole internets in your apartment) too.

But what’s really remarkable about Init7 isn’t the technical aspect (though, again, it’s bloody amazing), but everything else:

Init7 was one of the first ISPs in Switzerland to offer IPv6 to end users.
Init7 doesn’t just support network neutrality.
They actively fight for it
They explicitly state
that they are not selling content and they don’t intend to start doing so. They are just an ISP and as such their motivations totally align with mine.

There are a lot of geeky soft factors too:

Their press releases are written in Open Office (check the PDF properties
of this one for example)
I got an email from a technical person on their end that was written using
f’ing Claws Mail on Linux
Judging from the Recieved headers of their Email, they are using IPv6 in their internal LAN – down to the desktop workstations. And related to that:
The machines in their LAN respond to ICMPv6 pings which is utterly crazy cool. Yes. They are firewalled (cough I had to try. Sorry.), but they let ICMP through. For the not as technical readers here: This is as good an internet citizen as you will ever see and it’s extremely unexpected these days.

If you are a geek like me and if your ideals align with the ones I listed above, there is no question: You have to support them. If you can have their Fiber offering in your area, this is a no-brainer. You can’t get synchronous 1GBit/s for CHF 64ish per month anywhere else and even if you did, it wouldn’t be plain Ethernet either.

If you can’t have their fiber offering, it’s still worth considering their other offers. They do have some DSL based plans which of course are technically inferior to plain ethernet over fiber, but you would still support one of the few remaining pure ISPs.

It doesn’t have to be Init7 either. For all I know there are many others, maybe even here in Switzerland. Init7 is what I decided to go with initially because of the Gbit, but the more I leared about their philosophy, the less important the bandwith got.

We need to support companies like these because companies like these are what ensures that the internet of the future will be as awesome as the internet is today.

Thoughts on IPv6

A few months ago, the awesome provider Init7 has released their
awesome FTTH offering Fiber7 which provides
synchronous 1GBit/s access for a very fair price. Actually, they are by
far the cheapest provider for this kind of bandwith.

Only cablecom comes close at matching them bandwidth wise with their 250Mbits
package, but that’s 4 times less bandwith for nearly double the price. Init7
also is one of the only providers who officially states that
their triple-play strategy is that they don’t do it. Huge-ass kudos for
that.

Also, their technical support is using Claws Mail on GNU/Linux – to give you
some indication of the geek-heaven you get when signing up with them.

But what’s really exciting about Init7 is their support for IPv6. In-fact,
Init7 was one of the first (if not the first) providers to offer IPv6 for
end users. Also, we’re talking about a real, non-tunneled, no strings attached
plain /48.

In case that doesn’t ring a bell, a /48 will allow for 2¹⁶ networks
consisting of 2⁶⁴ hosts each. Yes. That’s that many hosts.

In eager anticipation of getting this at home natively (of course I ordered
Fiber7 the moment I could at my place), I decided to play with IPv6 as far as
I could with my current provider, which apparently lives in the stone-age and
still doesn’t provide native v6 support.

After getting abysmal pings using 6to4 about a year ago, this time I decided
to go with tunnelbroker which these days also
provides a nice dyndns-alike API for updating the public tunnel endpoint.

Let me tell you: Setting this up is trivial.

Tunnelbroker provides you with all the information you need for your tunnel
and with the prefix of the /64 you get from them and setting up for your own
network is trivial using radvd.

The only thing that’s different from your old v4 config: All your hosts will
immediately be accessible from the public internet, so you might want to
configure a firewall from the get-go – but see later for some thoughts in that
matter.

But this isn’t any different from the NAT solutions we have currently. Instead
of configuring port forwarding, you just open ports on your router, but the
process is more or less the same.

If you need direct connectivity however, you can now have it. No strings attached.

So far, I’ve used devices running iOS 7 and 8, Mac OS X 10.9 and 10.10,
Windows XP, 7 and 8 and none of them had any trouble reaching the v6 internet.
Also, I would argue that configuring radvd is easier than configuring DHCP.
There’s less thought involved for assigning addresses because
autoconfiguration will just deal with that.

For me, I had to adjust how I’m thinking about my network for a bit and I’m
posting here in order to explain what change you’ll get with v6 and how some
paradigms change. Once you’ve accepted these changes, using v6 is trivial and
totally something you can get used to.

Multi-homing (multiple adresses per interface) was something you’ve rarely
done in v4. Now in v6, you do that all the time. Your OSes go as far as to
grab a new random one every few connections in order to provide a means of
privacy.
The addresses are so long and hex-y – you probably will never remember them.
But that’s ok. In general, there are much fewer cases where you worry about
the address.
- Because of multi-homing every machine has a guaranteed static address
  (built from the MAC address of the interface) by default, so there’s no
  need to statically assign addresses in many cases.
- If you want to assign static addresses, just pick any in your /64.
  Unless you manually hand out the same address to two machines,
  autoconfiguration will make sure no two machines pick the same address.
  In order to remember them, feel free to use cute names – finally you got
  some letters and leetspeak to play with.
- To assign a static address, just do it on the host in question. Again,
  autoconfig will make sure no other machine gets the same address.
And with Zeroconf (avahi / bonjour), you have fewer and fewer oportunities
to deal with anything that’s not a host-name anyways.
You will need a firewall because suddenly all your machines will be
accessible for the whole internet. You might get away with just the local
personal firewall, but you probably should have one on your gateway.
While that sounds like higher complexity, I would argue that the complexity
is lower because if you were a responsible sysadmin, you were dealing with
both NAT and a firewall whereas with v6, a firewall is all you need.
Tools like nat-pmp or upnp don’t support v6 yet as far as I can see, so
applications in the trusted network can’t yet punch holes in the firewall
(what is the equivalent thing to forwarding ports in the v4 days).

Overall, getting v6 running is really simple and once you adjust your mindset
a bit, while stuff is unusual and taking some getting-used-to, I really don’t
see v6 as being more complicated. Quite to the contrary actually.

As I’m thinking about firewalls and opening ports, actually, as hosts get
wiser about v6, you actually really might get away without a strict firewall
as hosts could grab a new random v6 address for every connection they want to
use and then they would just bind their servers to that address.

Services binding to all addresses would never bind to these temporary addresses.

That way none of the services brought up by default (you know – all those
ports open on your machine when it runs) would be reachable from the outside.
What would be reachable is the temporary addresses grabbed by specific
services running on your machine.

Yes. An attacker could port-scan your /64 and try to find the non-temporary
address, but keep in mind that finding that one address out of 2⁶⁴
addresses would mean that you have to port-scan 4 billion traditional v4
internets per attack target (good luck) or randomly guessing with an average
chance of 1:2⁶³ (also good luck).

Even then a personal firewall could block all unsolicited packets from
non-local prefixes to provide even more security.

As such, we really might get away without actually needing a firewall at the
gateway to begin with which will actually go great lengths at providing the
ubiquitous configuration-free p2p connectivity that would be ever-so-cool and
which we have lost over the last few decades.

Me personally, I’m really happy to see how simple v6 actually is to get
implemented and I’m really looking forward to my very own native /48 which I’m
probably going to get somehwere in September/October-ish.

Until then, I’ll gladly play with my tunneled /64 (for now still firewalled,
but I’ll investigate into how OS X and Windows deal with the temporary
addresses they use which might allow me to actually turn the firewall off).

when in doubt – SSL

Since 2006, as part of our product, we are offering barcode scanners
with GSM support to either send orders directly to the vendor or to
transmit products into the web frontend where you can further edit them.

Even though the devices (Windows Mobile. Crappy. In progress of
updating) do support WiFi, we really only support GSM because that means we don’t have to share the end users infrastructure.

This is a huge plus because it means that no matter how locked-down the
customer’s infrastructure, no matter how crappy the proxy, no matter the IDS in use, we’ll always be able to communicate with our server.

Until, of course, the mobile carrier most used by our customers decides
to add a “transparent” (do note the quotes) proxy to the mix.

We were quite stomped last week when we got reports of an HTTP error 408 to be reported by the mobile devices, especially because we were not seeing error 408 in our logs.

Worse, using tcpdump has clearly shown how we were getting a RST
packet from the client, sometimes before sending data, sometimes while
sending data.

Strange: Client is showing 408, server is seeing a RST from the client.
Doesnt’ make sense.

Tethering my Mac using the iPhones personal hotspot feature and a SIM
card of the mobile provider in question made it clear: No longer are we
talking directly to our server. No. What the client receives is a 408
HTML formatted error message by a proxy server.

Do note the “DELETE THIS LINE” and “your organization here” comments.
What a nice touch. Somebody was really spending alot of time getting
this up and running.

Now granted, taking 20 seconds before being able to produce a response
is a bit on the longer side, but unfortunately, some versions of the
scanner software require gzip compression and gzip compression needs to
know the full size of the body to compress, so we have to prepare the
full response (40 megs uncompressed) before being able to send anything
– that just takes a while.

But consider long-polling or server sent events – receiving a 408 after
just 20 seconds? That’s annoying, wasting resources and probably not
something you’re prepared for.

Worse, nobody was notified of this change. For 7 years, the clients
were able to connect directly to our server. Then one day it changes
and now they aren’t. No communication, no time to prepare and
certainly too strict limits in order to not affect anything (not
just us – see my remark about long polling).

The solution in the end is, like so often, to use SSL. SSL connections
are opaque to any reverse proxy. A proxy can’t decrypt the data without
the client noticing. An SSL connection can’t be inspected and an SSL
connection can’t be messed with.

Sure enough: The exact same request that fails with that 408 over HTTP
goes through nicely using HTTPS.

This trick works every time when somebody is messing with your
connection. Something f’ing up your WebSocket connection? Use SSL!
Something messing with your long-polling? Use SSL. Something
decompressing your response but not stripping off the Content-Encoding
header (yes. that happend to me once)? Use SSL. Something replacing
arbitrary numbers in your response with asterisks (yepp. happened too)?
You guessed it: Use SSL.

Of course, there are three things to keep in mind:

Due to the lack of SNI in the world’s most used OS and Browser
combination (any IE under Windows XP), every SSL site you host requires
one dedicated IP address. Which is bad considering that we are running
out of addresses.
All of the bigger mobile carriers have their CA in the browsers
trusted list. Aside of ethics, there is no reason what so ever for them
to not start doing all the crap I described and just re-encrypting the
connection, faking a certificate using their trusted ones.
failing that, they still might just block SSL at some point, but as
more and more sites are going SSL only (partially for above reasons no
doubt), outright blocking SSL is going to be more and more unlikely to
happen.

So. Yes. When in doubt: Use SSL. Not only does that help your users
privacy, it also fixes a ton of technical issues created by practically
non-authorized third-party messing with you.

how to accept SSL client certificates

Yesterday I was asked on twitter how you would use client certificates
on a web server in order to do user authentication.

Client certificates are very handy in a controlled environment and they
work really well to authenticate API requests. They are, however,
completely unusable for normal people.

Getting meaningful information from client side certificates is
something that’s happening as part of the SSL connection setup, so it
must be happening on whatever piece of your stack that terminates the
client’s SSL connection.

In this article I’m going to look into doing this with nginx and Apache
(both traditional frontend web servers) and in node.js which you might
be using in a setup where clients talk directly to your application.

In all cases, what you will need is a means for signing certificates in
order to ensure that only client certificates you signed get access to
your server.

In my use cases, I’m usually using openssl which comes with some
subcommands and helper script to run as a certificate authority. On the
Mac if you prefer a GUI, you can use Keychain Access which has all you
need in the “Certificate Assistant” submenu of the application menu.

Next, you will need the public key of your users. You can have them
send in a traditional CSR and sign that on the command line (use
openssl req to create the CSR, use openssl ca to sign it), or you
can have them submit an HTML form using the <keygen> tag (yes. that
exists. Read up on it on MDN
for example).

You absolutely never ever in your lifetime want the private key of
the user. Do not generate a keypair for the user. Have them generate a
key and a CSR, but never ever have them send the key to you. You only
need their CSR (which contains their public key, signed by their
private key) in order to sign their public key.

Ok. So let’s assume you got that out of your way. What you have now is
your CAs certificate (usually self-signed) and a few users which now
own certificates you have signed for them.

Now let’s make use of this (I’m assuming you know reasonably well how
to configure these web servers in general. I’m only going into the
client certificate details).

nginx

For nginx, make sure you have enabled SSL using the usual steps. In
addition to these, set ssl_client_certificate
(docs)
to the path of your CA’s certificate. nginx will only accept client
certificates that have been signed by whatever ssl_client_certificate
you have configured.

Furthermore, set ssl_verify_client
(docs)
to on. Now only requests that provide a client certificate signed by
above CA will be allowed to access your server.

When doing so, nginx will set a few additional variables for you to
use, most importantly $ssl_client_cert (full certificate),
$ssl_client_s_dn (the subject name of the client certificate),
$ssl_client_serial (the serial number your CA has issued for their
certificate) and most importantly $ssl_client_verify which you should
check for SUCCESS.

Use fastcgi_param or add_header to pass these variables through to
your application (in the case of add_header make sure that it was
really nginx who set it and not a client faking it).

I’ll talk about what you do with these variables a bit later on.

Apache

As with nginx, ensure that SSL is enabled. Then set
SSLCACertificateFile to the path to your CA’s certificate. Then set
SSLVerifyClient to require
(docs).

Apache will also set many variables for you to use in your application.
Most notably SSL_CLIENT_S_DN (the subject of the client
certificate)and SSL_CLIENT_M_SERIAL (the serial number your CA has
issued). The full certificate is in SSL_CLIENT_CERT.

node.js

If you want to handle the whole SSL stuff on your own, here’s an
example in node.js. When you call http.createServer
(docs),
pass in some options. One is requestCert which you would set to true.
The other is is ca which you should set to an array of strings in PEM
format which is your CA’s certificate.

Then you can check whether the certificate check was successful by
looking at the client.authorized property of your request object.

If you want to get more info about the certificate, use
request.connection.getPeerCertificate().

what now?

Once you have the information about the client certificate (via
fastcgi, reverse proxy headers or apache variables in your module),
then the question is what you are going to do with that information.

Generally, you’d probably couple the certificate’s subject and its
serial number with some user account and then use the subject and
serial as a key to look up the user data.

As people get new certificates issued (because they might expire), the
subject name will stay the same, but the serial number will change, so
depending on your use-case use one or both.

There are a couple of things to keep in mind though:

Due to a flaw in the SSL protocol which was discovered in 2009,
you cannot safely have only parts of your site require a certificate.
With most client libraries, this is an all-or-nothing deal. There is
a secure renegotiation, but I don’t think it’s widely supported at
the moment.
There is no notion of signing out. The clients have to present their
certificate, so your clients will always be signed on (which might
be a good thing for your use-case)
The UI in traditional browsers to handle this kind of thing is
absolutely horrendous.
I would recommend using this only for APIs or with managed devices
where the client certificate can be preinstalled silently.

You do however gain a very good method for uniquely identifying
connecting clients without a lot of additional protocol overhead. The
SSL negotiation isn’t much different whether the client is presenting a
certificate or not. There’s no additional application level code
needed. Your web server can do everything that’s needed.

Also, there’s no need for you to store any sensitive information. No
more leaked passwords, no more fear of leaking passwords. You just
store whatever information you need from the certificate and make sure
they are properly signed by your CA.

As long as you don’t lose your CAs private key, you can absolutely
trust your clients and no matter how much data they get when they
break into your web server, they won’t get passwords, not the ability
to log in as any user.

Conversely though, make sure that you keep your CA private key
absolutely safe. Once you lose it, you will have to invalidate all
client certificates and your users will have to go through the process
of generating new CSRs, sending them to you and so on. Terribly
inconvenient.

In the same vein: Don’t have your CA certificate expire too soon. If it
does expire, you’ll have the same issue at hand as if you lost your
private key. Very annoying. I learned that the hard way back in
2001ish and that was only for internal use.

If you need to revoke a users access, either blacklist his serial
number in your application or, much better, set up a proper CRL for
your certificate authority and have your web server check that.

So. Client certificates can be useful tool in some situations. It’s
your job to know when, but at least now you have some hints to get you
going.

Me personally, I was using this once around 2009ish for a REST
API, but I have since replaced that with oAuth because that’s what most
of the users knew best (read: “at all”). Depending on the audience,
client certificates might be totally foreign to them.

But if it works for you, perfect.

tempalias.com – sysadmin work

This is yet another episode in the development diary behind the creation of a new web service. Read the previous installment here.

Now that I made the SMTP proxy do its thing and that I’m able to serve out static files, I though it was time to actually set up the future production environment so that I can give it some more real-world testing and to check the general stability of the solution when exposed to the internet.

So I went ahead and set up a new VM using Ubuntu Lucid beta, running the latest (HEAD) redis and node and finally made it run the tempalias daemons (which I consolidated into one opening SMTP and HTTP ports at the same time for easier handling).

I always knew that deployment will be something of a problem to tackle. SMTP needs to run on port 25 (if you intend to be running on the machine listed as MX) and HTTP should run on port 80.

Both being sub 1024 in consequence require root privileges to listen on and I definitely didn’t want to run the first ever node.js code I’ve written to run with root privileges (even though it’s a VM – I don’t like to hand out free root on a machine that’s connected to the internet).

So additional infrastructure was needed and here’s what I came up with:

The tempalias web server listens only on localhost on port 8080. A reverse nginx proxy listens on public port 80 and forwards the requests (all of them – node is easily fast enough to serve the static content). This solves another issue I had which is HTTP content compression: Providing compression (Content-Encoding: gzip) is imperative these days and yet not something I want to implement myself in my web application server.

Having the reverse proxy is a tremendous help as it can handle the more advanced webserver tasks – like compression.

I quickly noticed though that the stable nginx release provided with Ubuntu Lucid didn’t seem to be willing to actually do the compression despite it being turned on. A bit of experimentation revealed that stable nginx, when comparing content-types for gzip_types checks the full response content-type including the charset header.

As node-paperboy adds the “;charset: UTF-8” to all requests it serves, the default setting didn’t compress. Thankfully though, nginx could live with

gzip_types "text/javascript; charset: UTF-8" "text/html; charset: UTF-8"

so that settled the compression issue.

Update: of course it should be “charset=UTF-8” instread of “charset: UTF-8” – with the equal sign, nginx actually compresses correctly. My patch to paperboy has since been accepted by upstream, so you won’t have to deal with this hassle.

Next was SMTP. As we are already an SMTP proxy and there are no further advantages of having incoming connections proxied further (no compression or anything), I wanted clients to somehow directly connect to the node daemon.

I quickly learned that even the most awesome iptables setup won’t make the Linux kernel accept on the lo interface anything that didn’t originate from lo, so no amount of NATing allows you to redirect a packet from a public interface to the local interface.

Hence I went by reconfiguring the SMTP server component of tempalias to listen on all interfaces, port 2525 and then redirect the port of packets on the public port from 25 to 2525.

This of course left the port 2525 open on the public interface which I don’t like.

A quickly created iptables rule rejecting (as opposed to dropping – I don’t want a casual port scanner to know that iptables magic is going on) any traffic going to 2525 also dropped the redirected traffic which of course wasn’t much help.

In comes the MARK extension. Here’s what I’ve done:

# mark packets going to port 25
iptables -t mangle -A PREROUTING -i eth0 -p tcp --dport 25 -j MARK --set-mark 99

# redirect packets going to port 25 to 2525
iptables -t nat -A PREROUTING -p tcp -i eth0 --dport 25 -j REDIRECT --to-ports 2525

# drop all incoming packets to 2525 which are not marked
iptables -A INPUT -i eth0 -p tcp --dport 2525 -m mark ! --mark 99 -j REJECT

So. Now the host responds on public port 25 (but not on public port 2525).

Next step was to configure DNS and tell Richard to create himself an alias using

curl --no-keepalive -H "Content-Type: application/json" 
     --data-binary '{"target":"t@example.com","days": 3,"max-usage": 5}' 
     -qsS http://tempalias.com/aliases

(yes. you too can do that right now – it’s live baby!)

Of course it blew up the moment the redis connection timed out, taking the whole node server with it.

Which was the topic of yesterdays coding session: The redis-node-client library is very brittle what connection tracking and keeping is concerned. I needed something quick, so I hacked the library to provide an additional very explicit connection management method.

Then I began discussing the issues I was having with redis-node-client’s author. He’s such a nice guy and we had one hell of a nice discussion which is still ongoing, so I will probably have to rewrite the backend code once more once we found out how to do this the right way.

Between all that sysadmin and library-fixing time, unfortunately, I didn’t yet have time to do all too much on the public facing website: http://tempalias.com at this point contains nothing but a gradient. But it’s a really nice gradient. One of the best.

Today: More redis-node-client hacking (provided I get another answer from fictorial) or finally some real HTML/CSS work (which I’m not looking forward to).

This is taking shape.

Dependent on working infrastructure

If you create and later deploy and run a web application, then you are dependent on a working infrastructure: You need a working web server, you need a working application server and in most cases, you’ll need a working database server.

Also, you’d want a solution that always and consistently works.

We’ve been using lighttpd/FastCGI/PHP for our deployment needs lately. I’ve preferred this to apache due to the easier configuration possible with lighty (out of the box automated virtual hosting for example), the potentially higher performance (due to long-running FastCGI processes) and the smaller amount of memory consumed by lighttpd.

But last week, I had to learn the price of walking off the beaten path (Apache, mod_php).

In one particular constellation, the lighty, fastcgi, php combination, running on a Gentoo box sometimes (read: 50% of the time) a certain script didn’t output all the data it should have. Instead, lighty randomly sent out RST packets. This without any indication of what could be wrong in any of the involved log files.

Naturally, I looked everywhere.

I read the source code of PHP. I’ve created reduced test cases. I’ve tried workarounds.

The problem didn’t go away until I tested the same script with Apache.

This is where I’m getting pragmatic: I depend on a working infrastructure. I need it to work. Our customers need it to work. I don’t care who is to blame. Is it PHP? Is it lighty? Is it Gentoo? Is it the ISP (though it would have to be on the senders end as I’ve seen the described failure with different ISPs)?

I don’t care.

My interest is in developing a web application. Not in deploying one. Not really, anyways.

I’m willing (and able) to fix bugs in my development environment. I may even be able to fix bugs in my deployment platform. But I’m certainly not willing to. Not if there is a competing platform that works.

So after quite some time with lighty and fastcgi, it’s back to Apache. The prospect of having a consistently working backed largely outweighs the theoretical benefits of memory savings, I’m afraid.