GitHub as an identity provider

As I write these words, I have just finally disabled our LDAP/Active Directory server based on Samba which is incredibly hard to update, to backup and to re-create in case of emergencies.

We’ve traditionally used Gmail for Email and Calendar access and LDAP/Active Directory for everything else: The office WLAN, VPN access, internal support tool access, SSH access, you name it.

However, that solution came with operational drawbacks for us:

  • Unix support for Active Directory is flaky at best and the UI that Windows once provided for setting up LDAP attributes required for Unix users has stopped working with Windows 8
  • Samba is hard to install and keep up to date. Setting up an Active Directory domain creates a ton of local state that’s difficult to back up and impossible to put into a configuration management system.
  • The LDAP server being reachable is a prerequisite for authentication on machines to work. This meant that tunnels needed to be created for various machines to have access to that.
  • Each Debian update came with new best practice for LDAP authentication of users, so each time new tools and configuration needed to be learned, tested and applied.
  • As we’re working with contractors giving temporary access to our support tools is difficult because we need to create temporary LDAP accounts for them.
  • Any benefits that could have been had by having workstations use a centralized user account database have evaporated over time as our most-used client OS (macOS) lost more and more central directory support.

On the other hand, everybody at our company has a GitHub account and so do contractors, so we’re already controlling user access via GitHub.

GitHub provides excellent support for account security too, especially once we could force 2FA to be enabled on all users. Also over time, the reliance on locally installed tools grew smaller and smaller when on the other hand, most cloud services were making use of provide OAuth Sign-in-with-Github functionality.

It became clear that stuff would work ever so much better if only we could get rid of that LDAP/Active Directory thing.

So I started a multi-year endeavor to get rid of LDAP. It wouldn’t have needed to be multi-year, but as this happened mostly as side-projects, I tackled them whenever I had the opportunity.

I’m fully aware that this puts us into the position where we’re dependent on our Github subscription to be active. But so does a lot of our daily development work. We use Issues, Pull Requests, GitHub Actions, you name it. While git itself is, of course, decentralized, all the other services provided by GitHub are not.

Then again, while we would be in a really bad spot with regards to our development processes, unhooking our glue code from GitHub and changing it to a traditional username/password solution would be very feasible even in a relative short time-frame (much shorter than the disruption to the rest of our processes).

Which means that I’m ready to increase the dependency on GitHub even more and use them as our identity provider.

The first thing to change was the internal support tools our support team uses to get authenticated access to our sites for support purposes. The interface between that tool and the sites has always been using signed access tickets (think JWT but slightly different, mostly because pre-dating JWT by about 5 years) to give users access. The target site itself did not need access to LDAP, only the support tool needed it to authenticate our support team members.

So unhooking that tool from LDAP and hooking it up to Github was the first step. Github has well-documented support for writing OAuth client apps to authenticate users.

Next was authenticating users to give SSH access to production machines. This was solved by teaching our support tool how to sign SSH public keys and by telling the production machines to trust that CA.

I wrote a small utility in Swift to send an SSH public key to the support tool to have it signed and to install the certificate in the SSH agent. The certificates have a short lifetime ranging from one day to at most one week (depending on user) and using the GitHub API, the central tool knows about team memberships which allows us to confer different permissions on different servers based on team membership.

None of the SSH servers can do all of the certificate validation entirely locally (due to the short lifetime we can live without a CRL), independent of network access (which there is none of for some machines).

Which meant that SSH access is now possible independently of LDAP and even network availability. And it’s using a mechanism that’s very simple and comes with zero dependencies aside of openssh itself.

Then came the VPN. I’ve run IPSec with IKEv2 to provide authenticated access to parts of the production network. It (or rather the RADIUS server it used) needed access to LDAP and even though it was using stock PFSense IPSec support, it was unreliable and needed restarts with some regularity.

This was entirely replaced by a SSH bastion host and ProxyJump in conjunction with above SSH certificates. No more LDAP and production access based on GitHub group membership of GitHub accounts. While it never happened and I would be very wary of allowing it, this would allow us to give selective access to machines to contractors based on nothing but their GitHub account (and who doesn’t have one of those these days).

Behind the production network, there’s another, darker part of the infrastructure. That’s the network where all the remote management interfaces and the virtual machine hosts are connected to. This one is absolutely critical and access to it is naturally very restricted.

The bastion host described above does not have access to that network.

In comes the next hat our support tool/github integration is wearing: It can synchronize Tailscale ACLs with Github and it can dynamically alter the ACL to give temporary access to specific users.

Tailscale itself uses Github as an identity provider too (and also supports custom identity providers, so, again, losing GitHub for this would not be the end of the world) and our support tool uses the GitHub and Tailscale APIs to make sure that only users in a specific GitHub group get access to Tailscale at all.

So everybody who needs network access that’s not doable or not convenient via the SSH bastion host has a Tailscale account (very few) and of those even fewer users are in a GitHub Team that causes our support tool to allow for such users to request temporary (30 min max) access to the super secret backstage network.

Which completely removes the last vestiges of the VPN from the picture and leaves us with just one single dependency: The office wifi.

Even though the office network really isn’t in a privileged position (any more), I want access to that network to be authenticated and I want to be able to revoke access to individual users.

Which is why we have always used Enterprise WPA over RADIUS against Active Directory/SAMBA to authenticate WiFi access to the office network.

This has now been replaced by, you guessed it, our support tool which creates and stores a unique completely random password for each user in a specific Github Team and offers an API endpoint to be used by the freeRADIUS rlm_rest module to authenticate those users. In order to still have Wifi even when our office internet access is unavailable (though I can’t really see why we’d need that given our reliance on cloud-based services), I added a local proxy in front of such API endpoint that serves a stale response in case of errors (for some hours – long enough for us to be able to fix the internet outage but short enough to not let no-longer-authenticated users access the network).

With this last step, our final dependency on LDAP was finally dropped and all our our identity management is now out-sourced to GitHub, so I could finally issue that one last command

shutdown -h now

Tailscale on PFSense

For a bit more than a year, I’m a user of Tailscale which is a service that builds an overlay network on top of Wireguard while relying on OAuth with third party services for authentication

It’s incredibly easy to get going with Tailscale and the free tier they provide is more than good enough for the common personal use cases (in my case: tech support for my family).

Most of the things that are incredibly hard to set up with traditional VPN services just work out of the box or require a minimal amount of configuration. Heck, even more complicated things like tunnel splitting and DNS resolution in different private subnets just work. It’s magic.

While I have some gripes that prevent me from switching all our company VPN connectivity over to them, those are a topic for a future blog post.

The reason I’m writing here right now is that a few weeks ago, Netgate and Tailscale announced a Tailscale package for PFSense. As a user of both PFSense and Tailscale, this allowed me to get rid of a VM that does nothing but be a Tailscale exit node and subnet router and instead use the Tailscale package to do this on PFSense.

However, doing this for a week or so has revealed some very important things to keep in mind which I’m posting about here because other people (and that includes my future self) will run into these issues and some are quite devastating:

When using the Tailscale package on PFSense, you will encounter two issues directly caused by Tailscale, but both of which also seen in other reports when you search for the issue on the internet, so you might be led astray when debugging it.

Connection loss

The first one is the bad one: After some hours of usage, an interface on your PFSense box will become unreachable, dropping all traffic through it. A reboot will fix it and when you then look at the system log, you will find many lines like

arpresolve: can't allocate llinfo for <IP-Address> on <interface>
I’m in so much pain right now

This will happen if one of your configured gateways in “System > Routing” is reachable both by a local connection and through Tailscale by subnet router (even if your PFSense host itself is told to advertise that route).

I might have overdone the fixing, but here’s all the steps I have taken

  • Tell Tailscale on PFSense to never use any advertised routes (“VPN > Tailscale > Settings”, uncheck “Accept subnet routes that other nodes advertise.”
  • Disable gateway monitoring under “System > Routing > Gateways” by clicking the pencil next to the gateway in question.

I think what happens is that PFSense will accidentally believe that the subnet advertised via Tailscale is not local and will then refuse to add the address of that gateway to its local ARP table.

IMHO, this is a bug in Tailscale. It should never mess with interfaces its exposing as a subnet router to the overlay network.

Log Spam

The second issue is not as bad but as the effect is so far removed from the cause, it’s still worth talking about it.

When looking at the system log (which you will do for above issue), you will see a ton of entries like

sshguard: Exiting on signal
sshguard: Now monitoring attacks.
this can’t be good. Can it?

What happens is that PFSense moved a few releases ago from a binary ring-buffer for logging to a more naïve approach to check once a minute whether a log file is too big, then rotating it and restarting daemons logging to that file.

If a daemon doesn’t have a built-in means for re-opening log files, PFSense will kill and restart the daemon, which happens to be the case for sshguard.

So the question is: Why is the log file being rotated every minute? This is caused by the Tailscale overlay network and the firewall by default blocking Tailscale traffic (UDP port 41641) to the WAN interface and also by default logging every dropped packet.

In order to fix this and assuming you trust Tailscale and their security update policies (which you probably should given that you just installed their package on a gateway), you need to create a rule to allow UDP port 41641 on the WAN interface.

much better now

This, too, IMHO is a bug in the Tailscale package: If your package opens port 41614 on an interface on a machine whose main purpose is being a firewall, you should probably also make sure that traffic to that port is not blocked.

With these two configuration changes in place, the network is stable and the log spam has gone away.

What’s particularly annoying about these two issues is that Googling for either of the two error messages will yield pages and pages of results, none of which apply because they will have many more possible causes and because Tailscale is a very recent addition to PFSense.

This is why I decided to post this article in order to provide one more result in Google and this time combining the two keywords: Tailscale and PFSense, in the hope of helping fellow admins who run into the same issues after installing Tailscale on their routers.

Joining Debian to ActiveDirectory

This blog post is a small list of magic incantations and to be issued and animals to be sacrificed in order to join a Unix machine (Debian in this case) to a (samba-powered) ActiveDirectory domain.

All of these things have to be set up correctly or you will suffer eternal damnation in non-related-error-message hell:

  • Make absolutely sure that DNS works correctly
    • the new member server’s hostname must be in the DNS domain of the AD Domain
    • This absolutely includes reverse lookups.
    • Same goes for the domain controller. Again: Absolutely make sure that you set up a correct PTR record for your domain controller or you will suffer the curse of GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Server not found in Kerberos database)
  • Disable IPv6 everywhere. I normally always advocate against disabling IPv6 in order to solve problems and instead just solve the problem, but bugs exist. Failing to disable IPv6 on either the server or the client will also cause you to suffer in Server not found in Kerberos database hell.
  • If you made previous attempts to join your member server, even when you later left the domain again, there’s probably a lingering host-name added by a previous dns update attempt. If that exists, your member server will be in ERROR_DNS_UPDATE_FAILED hell even if DNS is configured correctly.
    • In order to check, use samba-tool on the domain controller samba-tool dns query your.dc.ip.address your.domain.name memberservername ALL
    • If there’s a hostname, get rid of it using samba-tool dns delete your.dc.ip.address your.domain.name memberservername A ip.returned.above
  • make sure that the TLS certificate served by your AD server is trusted, either directly or chained to a trusted root. If you’re using a self-signed root (you’re probably doing that), add the root as a PEM-File (but with .crt extension!) to /usr/local/share/ca-certificates/ ad run /usr/sbin/update-ca-certificates. If you fail to do this correctly, you will suffer in ldap_sasl_interactive_bind_s: Can't contact LDAP server (-1) hell (no. Nothing will inform you of a certificate error – all you get is can't connect)
  • In order to check that everything is set up correctly, before even trying realmd or sssd, use ldapsearch: ldapsearch -H ldap://your.dc.host/ -Y GSSAPI -N -b "dc=your,dc=base,dc=dn" "(objectClass=user)"
  • Aside of all that, you can follow this guide, but also make sure that you manually install the krb5-user package. The debian package database has a missing dependency, so the package doesn’t get pulled in even though it is required.

All in all, this was a really bad case of XKCD 979 and in case you ask yourself whether I’m bitter, then let me tell you, that yes. I am bitter.

I can totally see that there are a ton of moving parts involved in this and I’m willing to nudge some of these parts in order to get the engine up and running. But it would definitely help if the various tools involved would give me meaningful log output. samba on the domain controller doesn’t log, tcpdump is pointless thanks to SSL everywhere, realmd fails silently while still saying that everything is ok (also, it’s unconditionally removing the config files it feeds into the various moving parts involved, so good luck trying to debug this), sssd emits cryptic error messages (see above) and so on.

Anyways. I’m just happy I go this working and now for reproducing it one more time, but this time recording everything in Ansible.

Geek heaven

If I had to make a list of attributes I would like the ISP of my dream to
have, then, I could write quite the list:

  • I would really like to have native IPv6 support. Yes. IPv4 will be sufficient for a very long time, but unless pepole start having access to IPv6, it’ll never see the wide deployment it needs if we want the internet to continue to grow. An internet where addresses are only available to people with a lot of money is not an internet we all want to be subjected to (see my post «asking for permission»)
  • I would want my ISP to accept or even support network neutrality. For this to be possible, the ISP of my dreams would need to be nothing but an ISP so their motivations (provide better service) align with mine (getting better service). ISPs who also sell content have all the motivation to provide crappy Internet service in order to better sell their (higher-margin) content.
  • If I have technical issues, I want to be treated as somebody who obviously has a certain level of technical knowledge. I’m by no means an expert in networking technology, but I do know about powering it off and on again. If I have to say «shibboleet» to get to a real technicial, so be it, but if that’s not needed, that’s even better.
  • The networking technology involved in getting me the connectivity I want should be widely available and thus easily replacable if something breaks.
  • The networking technology involved should be as simple as possible: The more complex the hardware involved, the more stuff can break, especially when you combine cost-pressure for end-users with the need for high complexity.
  • The network equipment I’m installing at my home and which has thus access to my LAN needs to be equipment I own and I control fully. I do not accept leased equipment to which I do not have full access to.
  • And last but not least, I would really like to have as much bandwidth as possible

I’m sure I’m not alone with these wishes, even though, for «normal people» they might seem strange.

But honestly: They just don’t know it, but they too have the same interests. Nobody wants an internet that works like TV where you pay for access to a curated small list of “approved” sites (see network neutrality and IPv6 support).

Nobody wants to get up and reboot their modem here and then because it crashed. Nobody wants to be charged with downloading illegal content because their Wifi equipment was suddenly repurposed as an open access point for other customers of an ISP.

Most of the wishes I list above are the basis needed for these horror scenarios never coming to pass, however unlikely the might seem now (though getting up and rebooting the modem/router is something we already have to deal with today).

So yes. While it’s getting rarer and rarer to get all the points of my list fulfilled, to the point where I though this to be impossible to get all of it, I’m happy to say that here in Switzerland, there is at least one ISP that does all of this and more.

I’m talking about Init7 and especially their awesome FTTH offering Fiber7 which very recently became available in my area.

Let’s deal with the technology aspect first as this really isn’t the important point of this post: What you get from them is pure 1Gbit/s Ethernet. Yes, they do sell you a router box if you want one, but you can just as well just get a simple media converter, or just an SFP module to plug into any (managed) switch (with SFP port).

If you have your own routing equipment, be it a linux router like my shion or be it any Wifi Router, there’s no need to add any kind of additional complexity to your setup.

No additional component that can crash, no software running in your home to which you don’t have your password to and certainly no sneakily opened public WLANs (I’m looking at you, cablecom).

Of course you get native IPv6 (a /48 which incidentally is room for 281474976710656 whole internets in your apartment) too.

But what’s really remarkable about Init7 isn’t the technical aspect (though, again, it’s bloody amazing), but everything else:

  • Init7 was one of the first ISPs in Switzerland to offer IPv6 to end users.
  • Init7 doesn’t just support network neutrality.
    They actively fight for it
  • They explicitly state
    that they are not selling content and they don’t intend to start doing so. They are just an ISP and as such their motivations totally align with mine.

There are a lot of geeky soft factors too:

  • Their press releases are written in Open Office (check the PDF properties
    of this one for example)
  • I got an email from a technical person on their end that was written using
    f’ing Claws Mail on Linux
  • Judging from the Recieved headers of their Email, they are using IPv6 in their internal LAN – down to the desktop workstations. And related to that:
  • The machines in their LAN respond to ICMPv6 pings which is utterly crazy cool. Yes. They are firewalled (cough I had to try. Sorry.), but they let ICMP through. For the not as technical readers here: This is as good an internet citizen as you will ever see and it’s extremely unexpected these days.

If you are a geek like me and if your ideals align with the ones I listed above, there is no question: You have to support them. If you can have their Fiber offering in your area, this is a no-brainer. You can’t get synchronous 1GBit/s for CHF 64ish per month anywhere else and even if you did, it wouldn’t be plain Ethernet either.

If you can’t have their fiber offering, it’s still worth considering their other offers. They do have some DSL based plans which of course are technically inferior to plain ethernet over fiber, but you would still support one of the few remaining pure ISPs.

It doesn’t have to be Init7 either. For all I know there are many others, maybe even here in Switzerland. Init7 is what I decided to go with initially because of the Gbit, but the more I leared about their philosophy, the less important the bandwith got.

We need to support companies like these because companies like these are what ensures that the internet of the future will be as awesome as the internet is today.

Thoughts on IPv6

A few months ago, the awesome provider Init7 has released their
awesome FTTH offering Fiber7 which provides
synchronous 1GBit/s access for a very fair price. Actually, they are by
far the cheapest provider for this kind of bandwith.

Only cablecom comes close at matching them bandwidth wise with their 250Mbits
package, but that’s 4 times less bandwith for nearly double the price. Init7
also is one of the only providers who officially states that
their triple-play strategy is that they don’t do it. Huge-ass kudos for
that.

Also, their technical support is using Claws Mail on GNU/Linux – to give you
some indication of the geek-heaven you get when signing up with them.

But what’s really exciting about Init7 is their support for IPv6. In-fact,
Init7 was one of the first (if not the first) providers to offer IPv6 for
end users. Also, we’re talking about a real, non-tunneled, no strings attached
plain /48.

In case that doesn’t ring a bell, a /48 will allow for 216 networks
consisting of 264 hosts each. Yes. That’s that many hosts.

In eager anticipation of getting this at home natively (of course I ordered
Fiber7 the moment I could at my place), I decided to play with IPv6 as far as
I could with my current provider, which apparently lives in the stone-age and
still doesn’t provide native v6 support.

After getting abysmal pings using 6to4 about a year ago, this time I decided
to go with tunnelbroker which these days also
provides a nice dyndns-alike API for updating the public tunnel endpoint.

Let me tell you: Setting this up is trivial.

Tunnelbroker provides you with all the information you need for your tunnel
and with the prefix of the /64 you get from them and setting up for your own
network is trivial using radvd.

The only thing that’s different from your old v4 config: All your hosts will
immediately be accessible from the public internet, so you might want to
configure a firewall from the get-go – but see later for some thoughts in that
matter.

But this isn’t any different from the NAT solutions we have currently. Instead
of configuring port forwarding, you just open ports on your router, but the
process is more or less the same.

If you need direct connectivity however, you can now have it. No strings attached.

So far, I’ve used devices running iOS 7 and 8, Mac OS X 10.9 and 10.10,
Windows XP, 7 and 8 and none of them had any trouble reaching the v6 internet.
Also, I would argue that configuring radvd is easier than configuring DHCP.
There’s less thought involved for assigning addresses because
autoconfiguration will just deal with that.

For me, I had to adjust how I’m thinking about my network for a bit and I’m
posting here in order to explain what change you’ll get with v6 and how some
paradigms change. Once you’ve accepted these changes, using v6 is trivial and
totally something you can get used to.

  • Multi-homing (multiple adresses per interface) was something you’ve rarely
    done in v4. Now in v6, you do that all the time. Your OSes go as far as to
    grab a new random one every few connections in order to provide a means of
    privacy.
  • The addresses are so long and hex-y – you probably will never remember them.
    But that’s ok. In general, there are much fewer cases where you worry about
    the address.

    • Because of multi-homing every machine has a guaranteed static address
      (built from the MAC address of the interface) by default, so there’s no
      need to statically assign addresses in many cases.
    • If you want to assign static addresses, just pick any in your /64.
      Unless you manually hand out the same address to two machines,
      autoconfiguration will make sure no two machines pick the same address.
      In order to remember them, feel free to use cute names – finally you got
      some letters and leetspeak to play with.
    • To assign a static address, just do it on the host in question. Again,
      autoconfig will make sure no other machine gets the same address.
  • And with Zeroconf (avahi / bonjour), you have fewer and fewer oportunities
    to deal with anything that’s not a host-name anyways.
  • You will need a firewall because suddenly all your machines will be
    accessible for the whole internet. You might get away with just the local
    personal firewall, but you probably should have one on your gateway.
  • While that sounds like higher complexity, I would argue that the complexity
    is lower because if you were a responsible sysadmin, you were dealing with
    both NAT and a firewall whereas with v6, a firewall is all you need.
  • Tools like nat-pmp or upnp don’t support v6 yet as far as I can see, so
    applications in the trusted network can’t yet punch holes in the firewall
    (what is the equivalent thing to forwarding ports in the v4 days).

Overall, getting v6 running is really simple and once you adjust your mindset
a bit, while stuff is unusual and taking some getting-used-to, I really don’t
see v6 as being more complicated. Quite to the contrary actually.

As I’m thinking about firewalls and opening ports, actually, as hosts get
wiser about v6, you actually really might get away without a strict firewall
as hosts could grab a new random v6 address for every connection they want to
use and then they would just bind their servers to that address.

Services binding to all addresses would never bind to these temporary addresses.

That way none of the services brought up by default (you know – all those
ports open on your machine when it runs) would be reachable from the outside.
What would be reachable is the temporary addresses grabbed by specific
services running on your machine.

Yes. An attacker could port-scan your /64 and try to find the non-temporary
address, but keep in mind that finding that one address out of 264
addresses would mean that you have to port-scan 4 billion traditional v4
internets per attack target (good luck) or randomly guessing with an average
chance of 1:263 (also good luck).

Even then a personal firewall could block all unsolicited packets from
non-local prefixes to provide even more security.

As such, we really might get away without actually needing a firewall at the
gateway to begin with which will actually go great lengths at providing the
ubiquitous configuration-free p2p connectivity that would be ever-so-cool and
which we have lost over the last few decades.

Me personally, I’m really happy to see how simple v6 actually is to get
implemented and I’m really looking forward to my very own native /48 which I’m
probably going to get somehwere in September/October-ish.

Until then, I’ll gladly play with my tunneled /64 (for now still firewalled,
but I’ll investigate into how OS X and Windows deal with the temporary
addresses they use which might allow me to actually turn the firewall off).

when in doubt – SSL

Since 2006, as part of our product, we are offering barcode scanners
with GSM support to either send orders directly to the vendor or to
transmit products into the web frontend where you can further edit them.

Even though the devices (Windows Mobile. Crappy. In progress of
updating) do support WiFi, we really only support GSM because that means we don’t have to share the end users infrastructure.

This is a huge plus because it means that no matter how locked-down the
customer’s infrastructure, no matter how crappy the proxy, no matter the IDS in use, we’ll always be able to communicate with our server.

Until, of course, the mobile carrier most used by our customers decides
to add a “transparent” (do note the quotes) proxy to the mix.

We were quite stomped last week when we got reports of an HTTP error 408 to be reported by the mobile devices, especially because we were not seeing error 408 in our logs.

Worse, using tcpdump has clearly shown how we were getting a RST
packet from the client, sometimes before sending data, sometimes while
sending data.

Strange: Client is showing 408, server is seeing a RST from the client.
Doesnt’ make sense.

Tethering my Mac using the iPhones personal hotspot feature and a SIM
card of the mobile provider in question made it clear: No longer are we
talking directly to our server. No. What the client receives is a 408
HTML formatted error message by a proxy server.

Do note the “DELETE THIS LINE” and “your organization here” comments.
What a nice touch. Somebody was really spending alot of time getting
this up and running.

Now granted, taking 20 seconds before being able to produce a response
is a bit on the longer side, but unfortunately, some versions of the
scanner software require gzip compression and gzip compression needs to
know the full size of the body to compress, so we have to prepare the
full response (40 megs uncompressed) before being able to send anything
– that just takes a while.

But consider long-polling or server sent events – receiving a 408 after
just 20 seconds? That’s annoying, wasting resources and probably not
something you’re prepared for.

Worse, nobody was notified of this change. For 7 years, the clients
were able to connect directly to our server. Then one day it changes
and now they aren’t. No communication, no time to prepare and
certainly too strict limits in order to not affect anything (not
just us – see my remark about long polling).

The solution in the end is, like so often, to use SSL. SSL connections
are opaque to any reverse proxy. A proxy can’t decrypt the data without
the client noticing. An SSL connection can’t be inspected and an SSL
connection can’t be messed with.

Sure enough: The exact same request that fails with that 408 over HTTP
goes through nicely using HTTPS.

This trick works every time when somebody is messing with your
connection. Something f’ing up your WebSocket connection? Use SSL!
Something messing with your long-polling? Use SSL. Something
decompressing your response but not stripping off the Content-Encoding
header (yes. that happend to me once)? Use SSL. Something replacing
arbitrary numbers in your response with asterisks (yepp. happened too)?
You guessed it: Use SSL.

Of course, there are three things to keep in mind:

  1. Due to the lack of SNI in the world’s most used OS and Browser
    combination (any IE under Windows XP), every SSL site you host requires
    one dedicated IP address. Which is bad considering that we are running
    out of addresses.

  2. All of the bigger mobile carriers have their CA in the browsers
    trusted list. Aside of ethics, there is no reason what so ever for them
    to not start doing all the crap I described and just re-encrypting the
    connection, faking a certificate using their trusted ones.

  3. failing that, they still might just block SSL at some point, but as
    more and more sites are going SSL only (partially for above reasons no
    doubt), outright blocking SSL is going to be more and more unlikely to
    happen.

So. Yes. When in doubt: Use SSL. Not only does that help your users
privacy, it also fixes a ton of technical issues created by practically
non-authorized third-party messing with you.

how to accept SSL client certificates

Yesterday I was asked on twitter how you would use client certificates
on a web server in order to do user authentication.

Client certificates are very handy in a controlled environment and they
work really well to authenticate API requests. They are, however,
completely unusable for normal people.

Getting meaningful information from client side certificates is
something that’s happening as part of the SSL connection setup, so it
must be happening on whatever piece of your stack that terminates the
client’s SSL connection.

In this article I’m going to look into doing this with nginx and Apache
(both traditional frontend web servers) and in node.js which you might
be using in a setup where clients talk directly to your application.

In all cases, what you will need is a means for signing certificates in
order to ensure that only client certificates you signed get access to
your server.

In my use cases, I’m usually using openssl which comes with some
subcommands and helper script to run as a certificate authority. On the
Mac if you prefer a GUI, you can use Keychain Access which has all you
need in the “Certificate Assistant” submenu of the application menu.

Next, you will need the public key of your users. You can have them
send in a traditional CSR and sign that on the command line (use
openssl req to create the CSR, use openssl ca to sign it), or you
can have them submit an HTML form using the <keygen> tag (yes. that
exists. Read up on it on MDN
for example).

You absolutely never ever in your lifetime want the private key of
the user. Do not generate a keypair for the user. Have them generate a
key and a CSR, but never ever have them send the key to you. You only
need their CSR (which contains their public key, signed by their
private key) in order to sign their public key.

Ok. So let’s assume you got that out of your way. What you have now is
your CAs certificate (usually self-signed) and a few users which now
own certificates you have signed for them.

Now let’s make use of this (I’m assuming you know reasonably well how
to configure these web servers in general. I’m only going into the
client certificate details).

nginx

For nginx, make sure you have enabled SSL using the usual steps. In
addition to these, set ssl_client_certificate
(docs)
to the path of your CA’s certificate. nginx will only accept client
certificates that have been signed by whatever ssl_client_certificate
you have configured.

Furthermore, set ssl_verify_client
(docs)
to on. Now only requests that provide a client certificate signed by
above CA will be allowed to access your server.

When doing so, nginx will set a few additional variables for you to
use, most importantly $ssl_client_cert (full certificate),
$ssl_client_s_dn (the subject name of the client certificate),
$ssl_client_serial (the serial number your CA has issued for their
certificate) and most importantly $ssl_client_verify which you should
check for SUCCESS.

Use fastcgi_param or add_header to pass these variables through to
your application (in the case of add_header make sure that it was
really nginx who set it and not a client faking it).

I’ll talk about what you do with these variables a bit later on.

Apache

As with nginx, ensure that SSL is enabled. Then set
SSLCACertificateFile to the path to your CA’s certificate. Then set
SSLVerifyClient to require
(docs).

Apache will also set many variables for you to use in your application.
Most notably SSL_CLIENT_S_DN (the subject of the client
certificate)and SSL_CLIENT_M_SERIAL (the serial number your CA has
issued). The full certificate is in SSL_CLIENT_CERT.

node.js

If you want to handle the whole SSL stuff on your own, here’s an
example in node.js. When you call http.createServer
(docs),
pass in some options. One is requestCert which you would set to true.
The other is is ca which you should set to an array of strings in PEM
format which is your CA’s certificate.

Then you can check whether the certificate check was successful by
looking at the client.authorized property of your request object.

If you want to get more info about the certificate, use
request.connection.getPeerCertificate().

what now?

Once you have the information about the client certificate (via
fastcgi, reverse proxy headers or apache variables in your module),
then the question is what you are going to do with that information.

Generally, you’d probably couple the certificate’s subject and its
serial number with some user account and then use the subject and
serial as a key to look up the user data.

As people get new certificates issued (because they might expire), the
subject name will stay the same, but the serial number will change, so
depending on your use-case use one or both.

There are a couple of things to keep in mind though:

  • Due to a flaw in the SSL protocol which was discovered in 2009,
    you cannot safely have only parts of your site require a certificate.
    With most client libraries, this is an all-or-nothing deal. There is
    a secure renegotiation, but I don’t think it’s widely supported at
    the moment.
  • There is no notion of signing out. The clients have to present their
    certificate, so your clients will always be signed on (which might
    be a good thing for your use-case)
  • The UI in traditional browsers to handle this kind of thing is
    absolutely horrendous.
    I would recommend using this only for APIs or with managed devices
    where the client certificate can be preinstalled silently.

You do however gain a very good method for uniquely identifying
connecting clients without a lot of additional protocol overhead. The
SSL negotiation isn’t much different whether the client is presenting a
certificate or not. There’s no additional application level code
needed. Your web server can do everything that’s needed.

Also, there’s no need for you to store any sensitive information. No
more leaked passwords, no more fear of leaking passwords. You just
store whatever information you need from the certificate and make sure
they are properly signed by your CA.

As long as you don’t lose your CAs private key, you can absolutely
trust your clients and no matter how much data they get when they
break into your web server, they won’t get passwords, not the ability
to log in as any user.

Conversely though, make sure that you keep your CA private key
absolutely safe. Once you lose it, you will have to invalidate all
client certificates and your users will have to go through the process
of generating new CSRs, sending them to you and so on. Terribly
inconvenient.

In the same vein: Don’t have your CA certificate expire too soon. If it
does expire, you’ll have the same issue at hand as if you lost your
private key. Very annoying. I learned that the hard way back in
2001ish and that was only for internal use.

If you need to revoke a users access, either blacklist his serial
number in your application or, much better, set up a proper CRL for
your certificate authority and have your web server check that.

So. Client certificates can be useful tool in some situations. It’s
your job to know when, but at least now you have some hints to get you
going.

Me personally, I was using this once around 2009ish for a REST
API, but I have since replaced that with oAuth because that’s what most
of the users knew best (read: “at all”). Depending on the audience,
client certificates might be totally foreign to them.

But if it works for you, perfect.

How I back up gmail

There was a discussion on HackerNews about Gmail having lost the email in some accounts. One sentiment in the comments was clear:

It’s totally the users problem if they don’t back up their cloud based email.

Personally, I think I would have to agree:

Google is a provider like every other ISP or basically any other service too. There’s no reason to believe that your data is more save on Google than it is any where else. Now granted, they are not exactly known for losing data, but there’s other things that can happen.

Like your account being closed because whatever automated system believed your usage patterns were consistent with those of a spammer.

So the question is: What would happen if your Google account wasn’t reachable at some point in the future?

For my company (using commercial Google Apps accounts), I would start up that IMAP server which serves all mail ever sent to and from Gmail. People would use the already existing webmail client or their traditional IMAP clients. They would lose some productivity, but no single byte of data.

This was my condition for migrating email over to Google. I needed to have a back up copy of that data. Otherwise, I would not have agreed to switch to a cloud based provider.

The process is completely automated too. There’s not even a backup script running somewhere. Heck, not even the Google Account passwords have to be stored anywhere for this to work.

So. How does it work then?

Before you read on, here are the drawbacks of the solution:

  • I’m a die-hard Exim fan (long story. It served me very well once – up to saving-my-ass level of well), so the configuration I’m outlining here is for Exim as the mail relay.
  • Also, this only works with paid Google accounts. You can get somewhere using the free ones, but you don’t get the full solution (i.e. having a backup of all sent email)
  • This requires you to have full control over the MX machine(s) of your domain.

If you can live with this, here’s how you do it:

First, you set up your Google domain as normal. Add all the users you want and do everything else just as you would do it in a traditional set up.

Next, we’ll have to configure Google Mail for two-legged OAuth access to our accounts. I’ve written about this before. We are doing this so we don’t need to know our users passwords. Also, we need to enable the provisioning API to get access to the list of users and groups.

Next, our mail relay will have to know about what users (and groups) are listed in our Google account. Here’s what I quickly hacked together in Python (my first Python script ever – be polite while flaming) using the GData library:

import gdata.apps.service

consumer_key = 'yourdomain.com'
consumer_secret = '2-legged-consumer-secret' #see above
sig_method = gdata.auth.OAuthSignatureMethod.HMAC_SHA1

service = gdata.apps.service.AppsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key,
  consumer_secret=consumer_secret, two_legged_oauth=True)

res = service.RetrieveAllUsers()
for entry in res.entry:
    print entry.login.user_name

import gdata.apps.groups.service

service = gdata.apps.groups.service.GroupsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key,
  consumer_secret=consumer_secret, two_legged_oauth=True)
res = service.RetrieveAllGroups()
for entry in res:
    print entry['groupName']

Place this script somewhere on your mail relay and run it in a cron job. In my case, I’m having its output redirected to /etc/exim4/gmail_accounts. The script will emit one user (and group) name per line.

Next, we’ll deal with incoming email:

In the Exim configuration of your mail relay, add the following routers:

yourdomain_gmail_users:
  driver = accept
  domains = yourdomain.com
  local_parts = lsearch;/etc/exim4/gmail_accounts
  transport_home_directory = /var/mail/yourdomain/${lc:$local_part}
  router_home_directory = /var/mail/yourdomain/${lc:$local_part}
  transport = gmail_local_delivery
  unseen

yourdomain_gmail_remote:
  driver = accept
  domains = yourdomain.com
  local_parts = lsearch;/etc/exim4/gmail_accounts
  transport = gmail_t

yourdomain_gmail_users is what creates the local copy. It accepts all mail sent to yourdomain.com, if the local part (the stuff in front of the @) is listed in that gmail_accounts file. Then it sets up some paths for the local transport (see below) and marks the mail as unseen so the next router gets a chance too.

Which is yourdomain_gmail_remote. This one is again checking domain and the local part and if they match, it’s just delegating to the gmail_t remote transport (which will then send the email to Google).

The transports look like this:

gmail_t:
  driver = smtp
  hosts = aspmx.l.google.com:alt1.aspmx.l.google.com:
    alt2.aspmx.l.google.com:aspmx5.googlemail.com:
    aspmx2.googlemail.com:aspmx3.googlemail.com:
    aspmx4.googlemail.com
  gethostbyname

gmail_local_delivery:
  driver = appendfile
  check_string =
  delivery_date_add
  envelope_to_add
  group=mail
  maildir_format
  directory = MAILDIR/yourdomain/${lc:$local_part}
  maildir_tag = ,S=$message_size
  message_prefix =
  message_suffix =
  return_path_add
  user = Debian-exim
  create_file = anywhere
  create_directory

the gmail_t transport is simple. The local one you might have to patch up users and groups plus the location where you what to write the mail to.

Now we are ready to reconfigure Google as this is all that’s needed to get a copy of every inbound mail into a local maildir on the mail relay.

Here’s what you do:

  • You change the MX of your domain to point to this relay of yours

The next two steps are the reason you need a paid account: These controls are not available for the free accounts:

  • In your Google Administration panel, you visit the Email settings and configure the outbound gateway. Set it to your relay.
  • Then you configure your inbound gateway and set it to your relay too (and to your backup MX if you have one).

This screenshot will help you:

gmail config

All email sent to your MX (over the gmail_t transport we have configured above) will now be accepted by gmail.

Also, Gmail will now send all outgoing Email to your relay which needs to be configured to accept (and relay) email from Google. This pretty much depends on your otherwise existing Exim configuration, but here’s what I added (which will work with the default ACL):

hostlist   google_relays = 216.239.32.0/19:64.233.160.0/19:66.249.80.0/20:
    72.14.192.0/18:209.85.128.0/17:66.102.0.0/20:
    74.125.0.0/16:64.18.0.0/20:207.126.144.0/20
hostlist   relay_from_hosts = 127.0.0.1:+google_relays

And lastly, the tricky part: Storing a copy of all mail that is being sent through Gmail (we are already correctly sending the mail. What we want is a copy):

Here is the exim router we need:

gmail_outgoing:
  driver = accept
  condition = "${if and{
    { eq{$sender_address_domain}{yourdomain.com} }
    {=={${lookup{$sender_address_local_part}lsearch{/etc/exim4/gmail_accounts}{1}}}{1}}} {1}{0}}"
  transport = store_outgoing_copy
  unseen

(did I mention that I severely dislike RPN?)

and here’s the transport:

store_outgoing_copy:
  driver = appendfile
  check_string =
  delivery_date_add
  envelope_to_add
  group=mail
  maildir_format
  directory = MAILDIR/yourdomain/${lc:$sender_address_local_part}/.Sent/
  maildir_tag = ,S=$message_size
  message_prefix =
  message_suffix =
  return_path_add
  user = Debian-exim
  create_file = anywhere
  create_directory

The maildir I’ve chosen is the correct one if the IMAP-server you want to use is Courier IMAPd. Other servers use different methods.

One little thing: When you CC or BCC other people in your domain, Google will send out multiple copies of the same message. This will yield some message duplication in the sent directory (one per recipient), but as they say: Better backup too much than too little.

Now if something happens to your google account, just start up an IMAP server and have it serve mail from these maildir directories.

And remember to back them up too, but you can just use rsync or rsnapshot or whatever other technology you might have in use. They are just directories containing one file per email.

Find relation sizes in PostgreSQL

Like so many times before, today I was yet again in the situation where I wanted to know which tables/indexes take the most disk space in a particular PostgreSQL database.

My usual procedure in this case was to dt+ in psql and scan the sizes by eye (this being on my development machine, trying to find out the biggest tables I could clean out to make room).

But once you’ve done that a few times and considering that dt+ does nothing but query some PostgreSQL internal tables, I thought that I want this solved in an easier way that also is less error prone. In the end I just wanted the output of dt+ sorted by size.

The lead to some digging in the source code of psql itself (src/bin/psql) where I quickly found the function that builds the query (listTables in describe.c), so from now on, this is what I’m using when I need to get an overview over all relation sizes ordered by size in descending order:

select
  n.nspname as "Schema",
  c.relname as "Name",
  case c.relkind
     when 'r' then 'table'
     when 'v' then 'view'
     when 'i' then 'index'
     when 'S' then 'sequence'
     when 's' then 'special'
  end as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner",
  pg_catalog.pg_size_pretty(pg_catalog.pg_relation_size(c.oid)) as "Size"
from pg_catalog.pg_class c
 left join pg_catalog.pg_namespace n on n.oid = c.relnamespace
where c.relkind IN ('r', 'v', 'i')
order by pg_catalog.pg_relation_size(c.oid) desc;

Of course I could have come up with this without source code digging, but honestly, I didn’t know about relkind s, about pg_size_pretty and pg_relation_size (I would have thought that one to be stored in some system view), so figuring all of this out would have taken much more time than just reading the source code.

Now it’s here so I remember it next time I need it.

Google Apps – Provisioning – Two-Legged OAuth

Our company uses Google Apps premium for Email and shared documents, but in order to have more freedom in email aliases, in order to have more control over email routing and finally, because there are a couple of local parts we use to direct mail to some applications, all our mail, even though it’s created in Google Apps and finally ends up in Google Apps, goes via a central mail relay we are running ourselves (well. I’m running it).

Google Apps premium allows you to do that and it’s a really cool feature.

One additional thing I’m doing on that central relay is to keep a backup of all mail that comes from Google or goes to Google. The reason: While I trust them not to lose my data, there are stories around of people losing their accounts to Googles anti-spam automatisms. This is especially bad as there usually is nobody to appeal to.

So I deemed it imperative that we store a backup of every message so we can move away from google if the need to do so arises.

Of course that means though that our relay needs to know what local parts are valid for the google apps domain – after all, I don’t want to store mail that would later be bounced by google. And I’d love to bounce directly without relaying the mail unconditionally, so that’s another reason why I’d want to know the list of users.

Google provides their provisioning API to do that and using the GData python packages, you can easily access that data. In theory.

Up until very recently, the big problem was that the provisioning API didn’t support OAuth. That meant that my little script that retreives the local parts had to have a password of an administrator which is something that really bugged me as it meant that either I store my password in the script or I can’t run the script from cron.

With the Google Apps Marketplace, they fixed that somewhat, but it still requires a strange dance:

When you visit the OAuth client configuration (https://www.google.com/a/cpanel/YOURDOMAIN/ManageOauthClients), it lists you domain with the note “This client has access to all APIs.”.

This is totally not true though as Google’s definition of “all” apparently doesn’t include “Provisioning” :-)

To make two-legged OAuth work for the provisioning API, you have to explicitly list the feeds. In my case, this was Users and Groups:

Under “Client Name”, add your domain again (“example.com”) and unter One or More API Scopes, add the two feeds like this: “https://apps-apis.google.com/a/feeds/group/#readonly,https://apps-apis.google.com/a/feeds/user/#readonly”

This will enable two-legged OAuth access to the user and group lists which is what I need in my little script:

import gdata.apps.service
import gdata.apps.groups.service

consumer_key = 'YOUR.DOMAIN'
consumer_secret = 'secret' #check advanced / OAuth in you control panel
sig_method = gdata.auth.OAuthSignatureMethod.HMAC_SHA1

service = gdata.apps.service.AppsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key, consumer_secret=consumer_secret, two_legged_oauth=True)

res = service.RetrieveAllUsers()
for entry in res.entry:
    print entry.login.user_name

service = gdata.apps.groups.service.GroupsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key, consumer_secret=consumer_secret, two_legged_oauth=True)
res = service.RetrieveAllGroups()
for entry in res:
    print entry['groupName']