How I back up gmail

There was a discussion on HackerNews about Gmail having lost the email in some accounts. One sentiment in the comments was clear:

It’s totally the users problem if they don’t back up their cloud based email.

Personally, I think I would have to agree:

Google is a provider like every other ISP or basically any other service too. There’s no reason to believe that your data is more save on Google than it is any where else. Now granted, they are not exactly known for losing data, but there’s other things that can happen.

Like your account being closed because whatever automated system believed your usage patterns were consistent with those of a spammer.

So the question is: What would happen if your Google account wasn’t reachable at some point in the future?

For my company (using commercial Google Apps accounts), I would start up that IMAP server which serves all mail ever sent to and from Gmail. People would use the already existing webmail client or their traditional IMAP clients. They would lose some productivity, but no single byte of data.

This was my condition for migrating email over to Google. I needed to have a back up copy of that data. Otherwise, I would not have agreed to switch to a cloud based provider.

The process is completely automated too. There’s not even a backup script running somewhere. Heck, not even the Google Account passwords have to be stored anywhere for this to work.

So. How does it work then?

Before you read on, here are the drawbacks of the solution:

  • I’m a die-hard Exim fan (long story. It served me very well once – up to saving-my-ass level of well), so the configuration I’m outlining here is for Exim as the mail relay.
  • Also, this only works with paid Google accounts. You can get somewhere using the free ones, but you don’t get the full solution (i.e. having a backup of all sent email)
  • This requires you to have full control over the MX machine(s) of your domain.

If you can live with this, here’s how you do it:

First, you set up your Google domain as normal. Add all the users you want and do everything else just as you would do it in a traditional set up.

Next, we’ll have to configure Google Mail for two-legged OAuth access to our accounts. I’ve written about this before. We are doing this so we don’t need to know our users passwords. Also, we need to enable the provisioning API to get access to the list of users and groups.

Next, our mail relay will have to know about what users (and groups) are listed in our Google account. Here’s what I quickly hacked together in Python (my first Python script ever – be polite while flaming) using the GData library:

import gdata.apps.service

consumer_key = 'yourdomain.com'
consumer_secret = '2-legged-consumer-secret' #see above
sig_method = gdata.auth.OAuthSignatureMethod.HMAC_SHA1

service = gdata.apps.service.AppsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key,
  consumer_secret=consumer_secret, two_legged_oauth=True)

res = service.RetrieveAllUsers()
for entry in res.entry:
    print entry.login.user_name

import gdata.apps.groups.service

service = gdata.apps.groups.service.GroupsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key,
  consumer_secret=consumer_secret, two_legged_oauth=True)
res = service.RetrieveAllGroups()
for entry in res:
    print entry['groupName']

Place this script somewhere on your mail relay and run it in a cron job. In my case, I’m having its output redirected to /etc/exim4/gmail_accounts. The script will emit one user (and group) name per line.

Next, we’ll deal with incoming email:

In the Exim configuration of your mail relay, add the following routers:

yourdomain_gmail_users:
  driver = accept
  domains = yourdomain.com
  local_parts = lsearch;/etc/exim4/gmail_accounts
  transport_home_directory = /var/mail/yourdomain/${lc:$local_part}
  router_home_directory = /var/mail/yourdomain/${lc:$local_part}
  transport = gmail_local_delivery
  unseen

yourdomain_gmail_remote:
  driver = accept
  domains = yourdomain.com
  local_parts = lsearch;/etc/exim4/gmail_accounts
  transport = gmail_t

yourdomain_gmail_users is what creates the local copy. It accepts all mail sent to yourdomain.com, if the local part (the stuff in front of the @) is listed in that gmail_accounts file. Then it sets up some paths for the local transport (see below) and marks the mail as unseen so the next router gets a chance too.

Which is yourdomain_gmail_remote. This one is again checking domain and the local part and if they match, it’s just delegating to the gmail_t remote transport (which will then send the email to Google).

The transports look like this:

gmail_t:
  driver = smtp
  hosts = aspmx.l.google.com:alt1.aspmx.l.google.com:
    alt2.aspmx.l.google.com:aspmx5.googlemail.com:
    aspmx2.googlemail.com:aspmx3.googlemail.com:
    aspmx4.googlemail.com
  gethostbyname

gmail_local_delivery:
  driver = appendfile
  check_string =
  delivery_date_add
  envelope_to_add
  group=mail
  maildir_format
  directory = MAILDIR/yourdomain/${lc:$local_part}
  maildir_tag = ,S=$message_size
  message_prefix =
  message_suffix =
  return_path_add
  user = Debian-exim
  create_file = anywhere
  create_directory

the gmail_t transport is simple. The local one you might have to patch up users and groups plus the location where you what to write the mail to.

Now we are ready to reconfigure Google as this is all that’s needed to get a copy of every inbound mail into a local maildir on the mail relay.

Here’s what you do:

  • You change the MX of your domain to point to this relay of yours

The next two steps are the reason you need a paid account: These controls are not available for the free accounts:

  • In your Google Administration panel, you visit the Email settings and configure the outbound gateway. Set it to your relay.
  • Then you configure your inbound gateway and set it to your relay too (and to your backup MX if you have one).

This screenshot will help you:

gmail config

All email sent to your MX (over the gmail_t transport we have configured above) will now be accepted by gmail.

Also, Gmail will now send all outgoing Email to your relay which needs to be configured to accept (and relay) email from Google. This pretty much depends on your otherwise existing Exim configuration, but here’s what I added (which will work with the default ACL):

hostlist   google_relays = 216.239.32.0/19:64.233.160.0/19:66.249.80.0/20:
    72.14.192.0/18:209.85.128.0/17:66.102.0.0/20:
    74.125.0.0/16:64.18.0.0/20:207.126.144.0/20
hostlist   relay_from_hosts = 127.0.0.1:+google_relays

And lastly, the tricky part: Storing a copy of all mail that is being sent through Gmail (we are already correctly sending the mail. What we want is a copy):

Here is the exim router we need:

gmail_outgoing:
  driver = accept
  condition = "${if and{
    { eq{$sender_address_domain}{yourdomain.com} }
    {=={${lookup{$sender_address_local_part}lsearch{/etc/exim4/gmail_accounts}{1}}}{1}}} {1}{0}}"
  transport = store_outgoing_copy
  unseen

(did I mention that I severely dislike RPN?)

and here’s the transport:

store_outgoing_copy:
  driver = appendfile
  check_string =
  delivery_date_add
  envelope_to_add
  group=mail
  maildir_format
  directory = MAILDIR/yourdomain/${lc:$sender_address_local_part}/.Sent/
  maildir_tag = ,S=$message_size
  message_prefix =
  message_suffix =
  return_path_add
  user = Debian-exim
  create_file = anywhere
  create_directory

The maildir I’ve chosen is the correct one if the IMAP-server you want to use is Courier IMAPd. Other servers use different methods.

One little thing: When you CC or BCC other people in your domain, Google will send out multiple copies of the same message. This will yield some message duplication in the sent directory (one per recipient), but as they say: Better backup too much than too little.

Now if something happens to your google account, just start up an IMAP server and have it serve mail from these maildir directories.

And remember to back them up too, but you can just use rsync or rsnapshot or whatever other technology you might have in use. They are just directories containing one file per email.

Do spammers find pleasure in destroying fun stuff?

Recently, while reading through the log file of the mail relay used by tempalias, I noticed a disturbing trend: Apparently, SPAM was being sent through tempalias.

I’ve seen various behaviours. One was to strangely create an alias per second to the same target and then delivering email there.

While I completely fail to understand this scheme, the other one was even more disturbing: Bots were registering {max-usage: 1, days: null} aliases and then sending one mail to them – probably to get around RBL checks they’d hit when sending SPAM directly.

Aside of the fact that I do not want to be helping spammers, this also posed a technical issue: node.js head which I was running back when I developed the service tended to leak memory at times forcing me to restart the service here and then.

Now the additional huge load created by the bots forced me to do that way more often than I wanted to. Of course, the old code didn’t run on current node any more.

Hence I had to take tempalias down for maintenance.

A quick look at my commits on GitHub will show you what I have done:

  • the tempalias SMTP daemon now does RBL checks and immediately disconnects if the connected host is listed.
  • the tempalias HTTP daemon also does RBL checks on alias creation, but it doesn’t check the various DUL lists as the most likely alias creators are most certainly listed in a DUL
  • Per IP, aliases can only be generated every 30 seconds.

This should be some help. In addition, right now, the mail relay is configured to skip sender-checks and sa-exim scans (Spam Assassin on SMTP time as to reject spam before even accepting it into the system) for hosts where relaying is allowed. I intend to change that so that sa-exim and sender verify is done regardless if the connecting host is the tempalias proxy.

Looking at the mail log, I’ve seen the spam count drop to near-zero, so I’m happy, but I know that this is just a temporary victory. Spammers will find ways around the current protection and I’ll have to think of something else (I do have some options, but I don’t want to pre-announce them here for obvious reasons).

On a more happy note: During maintenance I also fixed a few issues with the Bookmarklet which should now do a better job at not coloring all text fields green eventually and at using the target site’s jQuery if available.

tempalias.com – config file, SMTP cleanup, beginnings of a server

Welcome to the next installment of a series of blog posts about the creation of a new web service in node.js. The posts serve as a diary of how the development of the service proceeds and should give you some insight in how working with node.js feels right now. You can read the previous episode here.

Yesterday, I unfortunately didn’t have a lot of time to commit to the project, so I chose a really small task to complete: create a configuration file, a configuration file parser and use both to actually configure how the application should behave.

The task in general was made a lot easier by the fact that (current) node contains a really simple parser for INI style configuration files. For the simple type of configuration data I have to handle, the INI format felt perfect and as I got a free parser with node itself, that’s what I went with. So as of monday it’s possible to configure listening addresses and ports for both HTTP and SMTP daemons and additional settings for the SMTP part of the service.

Today I had more time.

The idea was to seriously look into the SMTP transmission. The general idea is that email sent to the tempalias.com domain will have to end up on the node server where the alias expansion is done and the email is prepared for final delivery.

While I strive to keep the service as self-contained as possible, I opted into forcing a smarthost to be present to do the actual mail delivery.

You see, mail delivery is a complicated task in general as you must deliver the mail and if you can’t, you have to notify the sender. Reasons for a failing delivery can be permanent (that’s easy – you just tell the sending server that there was a problem and you are done) or temporary. In case of many temporary errors you end up with the responsibility of needing to handle them.

Handling in case of temporary errors usually means: Keep the original email around in a queue and retry after the initial client has long disconnected. If you don’t succeed for a reasonably large amount of delivery attempts or if a permanent problem creeps up, then you  have to bounce the message back to the initial sender.

If you want to do the final email delivery, so that your app runs without any other dependencies, then you will end up not only writing an SMTP server but also a queueing system, something that’s way beyond the scope of simple alias resolution.

Even if I wanted to go through that hassle, it still wouldn’t help much as aside of the purely technical hurdles, there are also others on a more meta level:

If you intend to do the final delivery nowadays, you practically need to have a valid PTR record, you need to be in good standing with the various RBL’s, you need to handle SSL – the list goes on and on. Much of this is administrative in nature and might even create additional cost and is completely pointless considering the fact that you do usually have a dedicated smarthost around that takes your mail and does the final delivery. And even if you don’t: Installing a local MTA for the queue handling is easily done and whatever you install, it’ll be way more mature than what I could write in any reasonable amount of time.

So it’s decided: The tempalias codebase will require a smarthost to be configured. As mine doesn’t require authentication from a certain IP range, I can even get away without writing any SMTP authentication support.

Once that was clear, the next design decision was clear too: the tempalias smtp daemon should be a really, really thin layer around the smarthost. When a client connects to tempalias, we will connect to the smarthost (500ing (or maybe 400ing) out if we can’t – remember: immediate and permanent errors are easy to handle). When a client sends MAIL FROM, just relay it to the smarthost, returning back to the client whatever we got – you get the idea: the tempalias mail daemon is an SMTP proxy.

This keeps the complexity low while still providing all the functionality we need (i.e. rewriting RCPT TO).

Once all of this was clear, I sat down and had a look at the node-smtp servers and clients and it was immediately clear that both need a lot of work to even to the simple thing I had in mind.

This means that most of todays work went into my fork of node-smtp:

  • made the hostname in the banner configurable
  • made the smtp client library work with node trunk
  • fire additional events (on close, on mail from, on rcpt to)
  • fixed various smaller bugs

Of course I have notified upstream of my changes – we’ll see what they think about.

On the other hand, the SMTP server part of tempalias (incidentally the first SMTP server I’m writing. ever) also took shape somewhat. It now correctly handles proxying from initial connection up until DATA. It doesn’t do real alias expansion yet, but that’s just a matter of hooking it into the backend model class I already have – for now I’m happy with it rewriting all passed recipients to my own email address for testing.

I already had a look at how node-smtp’s daemon handles the DATA command and I have to say that gobbling up data into memory until the client stops sending data or we run out of heap isn’t quite what I need, so tomorrow I will have to change node-smtp even more in that it fires events for every bit of data that was received. That way a consumer of the API can do some validation on the various chunks  (mostly size validation) and I can pass the data directly to the smarthost as it arrives.

This keeps memory usage of the node server small.

So that’s what I’m going to do tomorrow.

On a different note, I had some thought going into actual deployment, which probably will end up with me setting up a reverse proxy after all, but this is a topic for another discussion.

… and back to Thunderbird

It has been a while since I’ve last posted about email – still a topic very close to my heart, be it on the server side or on the client side (though the server side generally works very well here, which is why I don’t post about it).

Waybackwhen, I’ve written about Becky! which is also where I’ve declared the points I deemed important in a mail client. A bit later, I’ve talked about The Bat!, but in the end, I’ve settled with Thunderbird, just to switch to Mac Mail when I’ve switched to the Mac.

After that came my excursion to Gmail, but now I’m back to Thunderbird again.

Why? After all, my Gmail review sounded very nice, didn’t it?

Well…

  • Gmail is blazingly fast once it’s loaded, but starting the browser and then gmail (it loads so slow that “starting (the) gmail (application)” is a valid term to use) is always slower than just keeping a mail client open on the desktop.
  • Google Calendar Sync sucks and we’re using Exchange/Outlook here (and are actually quite happy with it – for calendaring and address books – it sucks for mail, but it provides decent IMAP support), so there was no way for the other folks here to have a look at my calendar.
  • Gmail always posts a “Sender:”-Header when using a custom sender domain which technically is the right thing to do, but Outlook on the receiving end screws up by showing the mail as being “From xxx@gmail.com on behalf of xxx@domain.com” which isn’t really what I’d want.
  • Google’s contact management offering is sub par compared even to Exchange.
  • iPhone works better with Exchange than it does with Google (yes. iPhone, but that’s another story).
  • The cool Gmail MIDP client doesn’t work/isn’t needed on the iPhone, but originally was one of the main reasons for me to switch to Gmail.

The one thing I really loved about Gmail though was the option of having a clean inbox by providing means for archiving messages with just a single keyboard shortcut. Using a desktop mail client without that funcationality wouldn’t have been possible for me any more.

This is why I’ve installed Nostalgy, a Thunderbird extension allowing me to assign a “Move to Folder xxx” action to a single keystroke (y in my case – just like gmail).

Using Thunderbird over Mac Mail has its reasons in the performance and in the crazy idea of Mac Mail to always download all the messages. Thunderbird is no race horse, but Mail.app isn’t even a slug.

Lately, more and more interesting posts regarding the development of Thunderbird have appeared on Planet Mozilla, so I’m looking forward to see Thunderbird 3 taking shape in its revitalized form.

I’m all but conservative in my choice of applications and gadgets, but Mail  – also because of its importance for me – must work exactly as I want it. None of the solutions out there are doing that to the full extent, but TB certainly comes closest. Even after years of testing and trying out different solutions, TB is the thing that solves most of my requirements without adding new issues.

Gmail is splendid too, but it presents some shortcomings TB doesn’t come with.

SPAM insanity

<p>I don’t see much point in complaining about SPAM, but it’s slowly but surely reaching complete insanity…</p>

What you see here is the recent history view of my DSPAM – our second line of defense against SPAM.

Red means SPAM. (the latest of the messages was a quite clever phishing attempt which I had to manually reclassify)

To give even more perspective to this: The last genuine Email I received was this morning at 7:54 (it’s now 10 hours later) and even that was just an automatically generated mail from Skype.

To put it into even more perspective: My DSPAM reports that since december 22th, I got 897 SPAM messages and – brace yourself – 170 non-spam messages of which 100 were subversion commit emails and 60 other emails sent from automated cron-jobs.

What I’m asking myself now is: Do these spammers still get anything out of their work? The signal-to-noise ratio has gone down the drain in a manner which can only mean that no person on earth would actually still read through all this spam and even be stupid enough to actually fall for it.

How bad does it have to get before it gets better?

Oh and don’t think that DSPAM is all I’m doing… No… these 897 mails were the messages that passed through both the ix DNSBL and SpamAssassin.

Oh and: Kudos to the DSPAM team. A recognition rate of 99.957% is really, really good

Gmail – The review

It has been quite a while since I began routing my mail to Gmail with the intention of checking that often-praised mail service out thoroughly.

The idea was to find out if it’s true what everyone keeps saying: That gmail has a great user interface, that it provides all the features one needs and that it’s a plain pleasure to work with it.

Personally, I’m blown away.

Despite the obviously longer load time to be able to access the mailbox (Mac Mail launches quicker than it takes gmail to load here – even with a 10 MBit/s connection), the gmail interface is much faster to use – especially with the nice keyboard shortcuts – but I’m getting ahead of myself.

When I began to use the interface for some real email work, I immediately noticed the shift of paradigm: There are no folders and – the real new thing for me – you are encouraged to move your mail out of the inbox as you take notice of them and/or complete the task associated with the message.

When you archive a message, it moves out of the inbox and is – unless you tag it with a label for quick retrieval – only accessible via the (quick) full text search engine built into the application.

The searching part of this usage philosophy is known to me. When I was using desktop clients, I usually kept arriving email in my inbox until it contained somewhere around 1500 messages or so. Then I grabbed all the messages and put them to my “Old Mail” folder where I accessed them strictly via the search functionality built into the mail client (or the server in case of a good IMAP client).

What’s new for me is the notion of moving mail out of your inbox as you stop being interested in the message – either because you plain read it or because the associated task is completed.

This allows you for a quick overview over the tasks still pending and it keeps your inbox nice and clean.

If you want quick access to certain messages, you can tag them with any label you want (multiple labels per message are possible of course) in which case you can access the messages with one click, saving you the searching.

Also, it’s possible to define filters allowing you to automatically apply labels to messages and – if you want, move them out of the inbox automatically – a perfect setup for the SVN commit messages I’m getting, allowing me to quickly access them at the end of the day and looking over the commits.

But the real killer feature of gmail is the keyboard interface.

Gmail is nearly completely accessible without requiring you to move your hands off the keyboard. Additionally, you don’t even need to press modifier keys as the interface is very much aware of state and mode, so it’s completely usable with some very intuitive shortcuts which all work by pressing just any letter button.

So usually, my workflow is like this: Open gmail, press o to open the new message, read it, press y to archive it, close the browser (or press j to move to the next message and press o again to open it).

This is as fast as using, say, mutt on the console, but with the benefit of staying usable even when you don’t know which key to press (in that case, you just take the mouse).

Gmail is perfectly integrated into google calendar, and it’s – contrary to mac mail – even able to detect outlook meeting invitations (and send back correct responses).

Additionally, there’s a MIDP applet available for your mobile phone that’s incredibly fast and does a perfect job of giving you access to all your email messages when you are on the road. As it’s a Java application, it runs on pretty much every conceivable mobile phone and because it’s a local application, it’s fast as hell and can continue to provide the nice, keyboard shortcut driven interface which we are used to from the AJAXy web application.

Overall, the experiment of switching to gmail proofed to be a real success and I will not switch back anytime soon (all my mail is still archived in our Exchange IMAP box). The only downside I’ve seen so far is that if you use different email-aliases with your gmail-account, gmail will set the Sender:-Header to your gmail-address (which is a perfectly valid – and even mandated – thing to do), and the stupid outlook on the receiving end will display the email as being sent from your gmail adress “in behalf of” your real address, exposing your gmail-address at the receiving. Meh. So for sending non-private email, I’m still forced to use Mac Mail – unfortunately.

Trying out Gmail

Everyone and their friends seems to be using Gmail lately and I agree: The application has a clean interface, a very powerful search feature and is easily accessible from anywhere.

I have my Gmail address from back in the days when invites were scarce and the term AJAX wasn’t even a term yet, but I never go around to really take advantage of the services as I just don’t see myself checking various email accounts at various places – at least not for serious business.

But now I found a way to put gmail to the test as my main email application – at least for a week or two.

My main mail storage is and will be our Exchange server. I have multiple reasons for that

  1. I have all my email I ever sent or received in that IMAP account. That’s WAY more than the 2.8 GB you get in Gmail and even if I had enough space there, I would not want to upload all my messages there.
  2. I don’t trust gmail to be as diligent with the messages I store there as I would want it to. I managed to keep every single email message from 1998 till now and I’d hate to lose all that to a “glitch in the system”.
  3. I need IMAP access to my messages for various purposes.
  4. I need the ability of a strong server-side filtering to remove messages I’m more or less only receiving for logging purposes. I don’t want to see these – not until I need them. No reason to even have them around usually.

So for now I have added yet another filter to my collection of server-side filters: This time I’m redirecting a copy of all mail that didn’t get filtered away due to various reasons to my Gmail address. This way I get to keep all mail of my various aliases all at the central location where they always were and I can still use Gmail to access the newly arrived messages.

Which leaves the problem with the sent messages which I ALSO want to archive at my own location – at least the important ones.

I fixed this by BCCing all Mail I’m writing in gmail to a new alias I created. Mail to that alias with my Gmail address as sender will be filtered into my sent-box by Exchange so it’ll look as though I sent the message via Thunderbird and then uploaded the copy via IMAP.

I’m happy with this solution, so testing Gmail can begin.

I’m asking myself: Is a tag based storage system better than a purely search based (the mail I don’t filter away is kept in one big INBOX which I access purely via search queries if I need something)? Is a web based application as powerful as a mail client like Thunderbird or Apple Mail? Do I unconsciously use features I’m going to miss when using Gmail instead of Apple Mail or Thunderbird? Will I be able to get used to the very quick keyboard-interface to gmail?

Interesting questions I intend to answer.