How I back up gmail

There was a discussion on HackerNews about Gmail having lost the email in some accounts. One sentiment in the comments was clear:

It’s totally the users problem if they don’t back up their cloud based email.

Personally, I think I would have to agree:

Google is a provider like every other ISP or basically any other service too. There’s no reason to believe that your data is more save on Google than it is any where else. Now granted, they are not exactly known for losing data, but there’s other things that can happen.

Like your account being closed because whatever automated system believed your usage patterns were consistent with those of a spammer.

So the question is: What would happen if your Google account wasn’t reachable at some point in the future?

For my company (using commercial Google Apps accounts), I would start up that IMAP server which serves all mail ever sent to and from Gmail. People would use the already existing webmail client or their traditional IMAP clients. They would lose some productivity, but no single byte of data.

This was my condition for migrating email over to Google. I needed to have a back up copy of that data. Otherwise, I would not have agreed to switch to a cloud based provider.

The process is completely automated too. There’s not even a backup script running somewhere. Heck, not even the Google Account passwords have to be stored anywhere for this to work.

So. How does it work then?

Before you read on, here are the drawbacks of the solution:

  • I’m a die-hard Exim fan (long story. It served me very well once – up to saving-my-ass level of well), so the configuration I’m outlining here is for Exim as the mail relay.
  • Also, this only works with paid Google accounts. You can get somewhere using the free ones, but you don’t get the full solution (i.e. having a backup of all sent email)
  • This requires you to have full control over the MX machine(s) of your domain.

If you can live with this, here’s how you do it:

First, you set up your Google domain as normal. Add all the users you want and do everything else just as you would do it in a traditional set up.

Next, we’ll have to configure Google Mail for two-legged OAuth access to our accounts. I’ve written about this before. We are doing this so we don’t need to know our users passwords. Also, we need to enable the provisioning API to get access to the list of users and groups.

Next, our mail relay will have to know about what users (and groups) are listed in our Google account. Here’s what I quickly hacked together in Python (my first Python script ever – be polite while flaming) using the GData library:

import gdata.apps.service

consumer_key = 'yourdomain.com'
consumer_secret = '2-legged-consumer-secret' #see above
sig_method = gdata.auth.OAuthSignatureMethod.HMAC_SHA1

service = gdata.apps.service.AppsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key,
  consumer_secret=consumer_secret, two_legged_oauth=True)

res = service.RetrieveAllUsers()
for entry in res.entry:
    print entry.login.user_name

import gdata.apps.groups.service

service = gdata.apps.groups.service.GroupsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key,
  consumer_secret=consumer_secret, two_legged_oauth=True)
res = service.RetrieveAllGroups()
for entry in res:
    print entry['groupName']

Place this script somewhere on your mail relay and run it in a cron job. In my case, I’m having its output redirected to /etc/exim4/gmail_accounts. The script will emit one user (and group) name per line.

Next, we’ll deal with incoming email:

In the Exim configuration of your mail relay, add the following routers:

yourdomain_gmail_users:
  driver = accept
  domains = yourdomain.com
  local_parts = lsearch;/etc/exim4/gmail_accounts
  transport_home_directory = /var/mail/yourdomain/${lc:$local_part}
  router_home_directory = /var/mail/yourdomain/${lc:$local_part}
  transport = gmail_local_delivery
  unseen

yourdomain_gmail_remote:
  driver = accept
  domains = yourdomain.com
  local_parts = lsearch;/etc/exim4/gmail_accounts
  transport = gmail_t

yourdomain_gmail_users is what creates the local copy. It accepts all mail sent to yourdomain.com, if the local part (the stuff in front of the @) is listed in that gmail_accounts file. Then it sets up some paths for the local transport (see below) and marks the mail as unseen so the next router gets a chance too.

Which is yourdomain_gmail_remote. This one is again checking domain and the local part and if they match, it’s just delegating to the gmail_t remote transport (which will then send the email to Google).

The transports look like this:

gmail_t:
  driver = smtp
  hosts = aspmx.l.google.com:alt1.aspmx.l.google.com:
    alt2.aspmx.l.google.com:aspmx5.googlemail.com:
    aspmx2.googlemail.com:aspmx3.googlemail.com:
    aspmx4.googlemail.com
  gethostbyname

gmail_local_delivery:
  driver = appendfile
  check_string =
  delivery_date_add
  envelope_to_add
  group=mail
  maildir_format
  directory = MAILDIR/yourdomain/${lc:$local_part}
  maildir_tag = ,S=$message_size
  message_prefix =
  message_suffix =
  return_path_add
  user = Debian-exim
  create_file = anywhere
  create_directory

the gmail_t transport is simple. The local one you might have to patch up users and groups plus the location where you what to write the mail to.

Now we are ready to reconfigure Google as this is all that’s needed to get a copy of every inbound mail into a local maildir on the mail relay.

Here’s what you do:

  • You change the MX of your domain to point to this relay of yours

The next two steps are the reason you need a paid account: These controls are not available for the free accounts:

  • In your Google Administration panel, you visit the Email settings and configure the outbound gateway. Set it to your relay.
  • Then you configure your inbound gateway and set it to your relay too (and to your backup MX if you have one).

This screenshot will help you:

gmail config

All email sent to your MX (over the gmail_t transport we have configured above) will now be accepted by gmail.

Also, Gmail will now send all outgoing Email to your relay which needs to be configured to accept (and relay) email from Google. This pretty much depends on your otherwise existing Exim configuration, but here’s what I added (which will work with the default ACL):

hostlist   google_relays = 216.239.32.0/19:64.233.160.0/19:66.249.80.0/20:
    72.14.192.0/18:209.85.128.0/17:66.102.0.0/20:
    74.125.0.0/16:64.18.0.0/20:207.126.144.0/20
hostlist   relay_from_hosts = 127.0.0.1:+google_relays

And lastly, the tricky part: Storing a copy of all mail that is being sent through Gmail (we are already correctly sending the mail. What we want is a copy):

Here is the exim router we need:

gmail_outgoing:
  driver = accept
  condition = "${if and{
    { eq{$sender_address_domain}{yourdomain.com} }
    {=={${lookup{$sender_address_local_part}lsearch{/etc/exim4/gmail_accounts}{1}}}{1}}} {1}{0}}"
  transport = store_outgoing_copy
  unseen

(did I mention that I severely dislike RPN?)

and here’s the transport:

store_outgoing_copy:
  driver = appendfile
  check_string =
  delivery_date_add
  envelope_to_add
  group=mail
  maildir_format
  directory = MAILDIR/yourdomain/${lc:$sender_address_local_part}/.Sent/
  maildir_tag = ,S=$message_size
  message_prefix =
  message_suffix =
  return_path_add
  user = Debian-exim
  create_file = anywhere
  create_directory

The maildir I’ve chosen is the correct one if the IMAP-server you want to use is Courier IMAPd. Other servers use different methods.

One little thing: When you CC or BCC other people in your domain, Google will send out multiple copies of the same message. This will yield some message duplication in the sent directory (one per recipient), but as they say: Better backup too much than too little.

Now if something happens to your google account, just start up an IMAP server and have it serve mail from these maildir directories.

And remember to back them up too, but you can just use rsync or rsnapshot or whatever other technology you might have in use. They are just directories containing one file per email.

Find relation sizes in PostgreSQL

Like so many times before, today I was yet again in the situation where I wanted to know which tables/indexes take the most disk space in a particular PostgreSQL database.

My usual procedure in this case was to dt+ in psql and scan the sizes by eye (this being on my development machine, trying to find out the biggest tables I could clean out to make room).

But once you’ve done that a few times and considering that dt+ does nothing but query some PostgreSQL internal tables, I thought that I want this solved in an easier way that also is less error prone. In the end I just wanted the output of dt+ sorted by size.

The lead to some digging in the source code of psql itself (src/bin/psql) where I quickly found the function that builds the query (listTables in describe.c), so from now on, this is what I’m using when I need to get an overview over all relation sizes ordered by size in descending order:

select
  n.nspname as "Schema",
  c.relname as "Name",
  case c.relkind
     when 'r' then 'table'
     when 'v' then 'view'
     when 'i' then 'index'
     when 'S' then 'sequence'
     when 's' then 'special'
  end as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner",
  pg_catalog.pg_size_pretty(pg_catalog.pg_relation_size(c.oid)) as "Size"
from pg_catalog.pg_class c
 left join pg_catalog.pg_namespace n on n.oid = c.relnamespace
where c.relkind IN ('r', 'v', 'i')
order by pg_catalog.pg_relation_size(c.oid) desc;

Of course I could have come up with this without source code digging, but honestly, I didn’t know about relkind s, about pg_size_pretty and pg_relation_size (I would have thought that one to be stored in some system view), so figuring all of this out would have taken much more time than just reading the source code.

Now it’s here so I remember it next time I need it.

Windows 2008 / NAT / Direct connections

Yesterday I ran into an interesting problem with Windows 2008’s implementation of NAT (don’t ask – this was the best solution – I certainly don’t recommend using Windows for this purpose).

Whenever I enabled the NAT service, I was unable to reliably connect to the machine via remote desktop or even any other service that machine was offering. Packets sent to the machine were dropped as if a firewall was in between, but it wasn’t and the Windows firewall was configured to allow remote desktop connections.

Strangely, sometimes and from some hosts I was able to make a connection, but not consistently.

After some digging, this turned out to be a problem with the interface metrics and the server tried to respond over the interface with the private address that wasn’t routed.

So if you are in the same boat, configure the interface metrics of both interfaces manually. Set the metric of the private interface to a high value and the metrics of the public (routed) one to a low value.

At least for me, this instantly fixed the problem.

digg bar controversy

Update: I’ve actually written this post yesterday and scheduled it for posting today. In the mean time, digg has found an even better solution and only shows their bar for logged in users. Still – a solution like the one provided here would allow for the link to go to the right location regardless of the state of the digg bar settings.

Recently, digg.com added a controversial feature, the digg bar, which basically frames every posted link in a little IFRAME.

Rightfully so, webmasters were concerned about this and quite quickly, we had the usual religious war going on between the people finding the bar quite useful and the webmasters hating it for lost page rank, even worse recognition of their site and presumed affiliation with digg.

Ideas crept up over the weekend, but turned out not to be so terribly good.

Basically it all boils down to digg.com screwing up on this, IMHO.

I know that they let you turn off that dreaded digg bar, but all the links on their page still point to their own short url. Only then is the decision made whether to show the bar or not.

This means that all links on digg currently just point to digg itself, not awarding any linked page with anything but the traffic which they don’t necessarily want. Digg-traffic isn’t worth much in terms of returning users. You get dugg, you melt your servers, you return back to be unknown.

So you would probably appreciate the higher page rank you get from being linked at by digg as that leads to increased search engine traffic which generally is worth much more.

The solution on diggs part could be simple: Keep the original site url in the href of their links, but use some JS-magic to still open the digg bar. That way they still get to keep their foot in the users path away from the site, but search engines will now do the right thing and follow the links to their actual target, thus giving the webmasters their page rank back.

How to do this?

Here’s a few lines of jQuery to automatically make links formated in the form

be opened via the digg bar while still working correctly for search engines (assuming that the link’s ID is the digg shorturl):

$(function(){
  $('div#link_container a').click(function(){
    $(this).attr('href') = 'http://digg.com/' + this.id;
  });
});

piece of cacke.

No further changes needed and all the web masters will be so much happier while digg gets to keep all the advantages (and it may actually help digg to increase their pagerank as I could imagine that a site with a lot of links pointing to different places could rank higher than one without any external links).

Webmasters then still could do their usual parent.location.href trickery to get out of the digg bar if they want to, but they could also retain their page rank.

No need to add further complexity to the webs standards because one site decides not to play well.

Playing Worms Armageddon on a Mac

Last weekend, I had a real blast with the Xbox 360 Arcade version of worms. Even after so many years, this game still rules them all, especially (if not only) in multiplayer mode.

The only drawback of the 360 version is the lack of weapons.

While the provided set is all well, the game is just not the same without the Super Banana Bomb or the Super Sheep.

Worms Screenshot

So this is why I looked for my old Worms Armageddon CD and tried to get it to work on todays hardware.

Making it work under plain Vista was easy enough (get the latest beta patch for armageddon, by the way):

Right-Click the Icon, select the compatibility tab, chose Windows XP, Disable Themes and Desktop composition and run the game with administrative privileges.

You may get away with not using one option or the other, but this one worked consistently.

To be really useful though, I wanted to make the game run under OS X as this is my main environment and I really dislike going through the lengthy booting process that is bootcamp.

I tried the various virtualization solutions around – something that should work seeing that the game doesn’t really need much in terms of hardware support.

But unfortunately, this was way harder than anticipated:

  • The initial try was done using VMWare Fusion which looked very good at first, but failed miserably later on: While I was able to launch (and actually use) the games frontend, the actual game was a flickery mess with no known workaround.
  • Parallels failed by displaying a black menu. It was still clickable, but there was nothing on the screen but blackness and a white square border. Googling around a bit led to the idea to set SlowFrontendWorkaround in the registry to 0 which actually made the launcher work, but the game itself crashed consistenly without error message.

In the end, I’ve achieved success using VirtualBox. The SlowFrontendWorkaround is still needed to make the launcher work and the mouse helper of the VirtualBox guest tools needs to be disabled (on the Machine menu, the game still runs with the helper enabled, but you won’t be able to actually control the mouse pointer consistently), but after that, the game runs flawlessly.

Flickerless and with a decent frame rate. And with sound, of course.

To enable the workaround I talked about, use this .reg file.

Now the slaughter of worms can begin :-)

New MacMini (early 09) and Linux

The new MacMinis that were announced this week come with a Firewire 800 port which was reason enough for me to update shion yet again (keeping the host name of course).

All my media she’s serving to my various systems is stored on a second generation Drobo which is currently connected via USB2, but has a lingering FW800 port.

Of course the upgrade to FW800 will not double the transfer rate to and from the drobo, but it should increase it significantly, so I went ahead and got one of the new Minis.

As usual, I entered the Ubuntu (Intrepid) CD, hold c while turning the device on and completed the installation.

This left the Mini in an unbootable state.

It seems that this newest generation of Mac Hardware isn’t capable of booting from an MBR partitioned harddrive. Earlier Macs complained a bit if the harddrive wasn’t correctly partitioned, but then went ahead and booted the other OS anyways.

Not so much with the new boxes it seems.

To finally achieve what I wanted I had to do the following complicated procedure:

  1. Install rEFIt (just download the package and install the .mpkg file)
  2. Use the Bootcamp assistant to repartition the drive.
  3. Reboot with the Ubuntu Desktop CD and run parted (the partitioning could probably be accomplished using the console installer, but I didn’t manage to do it correctly).
  4. Resize the FAT32-partition which was created by the Bootcamp partitioner to make room at the end for the swap partition.
  5. Create the swap partition.
  6. Format the FAT32-partition with something useful (ext3)
  7. Restart and enter the rEFIt partitioner tool (it’s in the boot menu)
  8. Allow it to resync the MBR
  9. Insert the Ubuntu Server CD, reboot holding the C key
  10. Install Ubuntu normally, but don’t change the partition layout – just use the existing partitions.
  11. Reboot and repeat steps 7 and 8
  12. Start Linux.

Additionally, you will have to keep using rEFIt as the boot device control panel item does not recognize the linux partitions any more, so can’t boot from them.

Now to find out whether that stupid resistor is still needed to make the new mini boot headless.

Tunnel munin nodes over HTTP

Last time I’ve talked about Munin, the one system monitoring tool I feel working well enough for me to actually bother to work with. Harsh words, I know, but the key to every solution is simplicity. And simple Munin is. Simple, but still powerful enough to do everything I would want it to do.

The one problem I had with it is that the querying of remote nodes works over a custom TCP port (4949) which doesn’t work behind firewalls.

There are some SSH tunneling solutions around, but what do you do if even SSH is no option because the remote access method provided to you relies on some kind of VPN technology or access token.

Even if you could keep a long-running VPN connection, it’s a very performance intensive solution as it requires resources on the VPN gateway. But this point is moot anyways because nearly all VPNs terminate long running connections. If re-establishing the connection requires physical interaction, then you are basically done here.

This is why I have created a neat little solution which tunnels the munin traffic over HTTP. It works with a local proxy server your munin monitoring process will connect to and a little CGI-script on the remote end.

This will cause multiple HTTP connections per query interval (the proxy uses Keep-Alive though so it’s not TCP connections we are talking about – it’s just hits in the access.log you’ll have to filter out somehow) because it’s impossible for a CGI script to keep the connection open and send data both ways – at least not if your server-side is running plain PHP which is the case in the setup I was designing this for.

Aynways – the solution works flawlessly and helps me to monitor a server behind one hell of a firewall and behind a reverse proxy.

You’ll find the code here (on GitHub as usual) and some explanation on how to use it is here.

Licensed under the MIT license as usual.