stabilizing tempalias

While the maintenance last weekend brought quite a bit of stabilization to the tempalias service, I quickly noticed that it was still dying sooner or later and while before updating node, it died due to not being able to allocate more memory, this time, it died by just not answering any requests any more.

A look at the error log quickly revealed quite many exceptions complaining about a certain request type not being allowed to have a body and finally one complaining about not being able to open a file due to having run out of file handles.

So I quickly improved error logging and restarted the daemon in order to get a stacktrace leading to these tons of exceptions.

This quickly pointed to paperboy which was sending the file even if the request was a HEAD request. http.js in node checks for this and throws whenever you send a body when you should not. That exception lead then to paperboy never closing the file (have I already complained how incredibly difficult it is to do proper exception handling the moment continuations get involved? I think not and I also think it’s a good topic for another diary entry). With the help of lsof I’ve quickly seen that my suspicions were true: the node process serving tempalias had tons of open handles to public/index.html.

I sent a patch for this behavior to @felixge which was quickly applied, so that’s fixed now. I hope it’s of some use for other people too.

Now knowing that having a look at lsof here and then might be a good idea, quickly revealed another problem: While the file handles were gone, I’ve noticed tons and tons of SMTP sockets staying open in CLOSE_WAIT state. Not good as that too will lead to handle starvation sooner or later.

On a hunch, I found out that connecting to the SMTP daemon and then disconnecting, not sending QUIT to let the server disconnect was what was causing the lingering sockets. Clients disconnecting like that is very common in case the sender sends a 5xx response which is what the tempalias daemon was designed for.

So I had to fix that in my fork of the node smtp daemon (the original upstream isn’t interested in daemon functionality and the owner I forked the daemon for doesn’t respond to my pull requests. Hence I’m maintaining my own fork for now).

Futher looks at lsof prove that now we are quite stable in resource consumption: No lingering connections, no unclosed file handles.

But the error log was still filling up. This time something about removeListener needing a function. Thanks to the callstack I now had in my error log, I quickly hunted that one down and fixed it – that was a very stupid mistake. Thankfully, because the mails I usually deliver are small enough so that socket draining usually wasn’t required.

Onwards to the next issue filling the error log: «This deferred has already been resolved».

This comes from the Promise.js library if you emit*() multiple times on the same promise. This time, of course, the callstack was useless (… at <anonymous> – why, thank you), but I was very lucky again in that I tested from home and my mail relay didn’t trust my home IP address and thus denied relaying with a 500 which immediately led to the exception.

Now, this one is crazy: When you call .addErrback() on a Promise before calling addCallback(), your callback will be executed no matter if the errback was executed first.

Promise.js does some really interesting things to simulate polymorphism in JavaScript and I really didn’t want to fix up that library as lately, node.js itself seems go to a simpler continuation style using a callback parameter, so sooner or later, I’ll have to patch up the smtp server library anyways to remove Promise.js if I want to adhere to current node style.

So I took the workaround route by just using addCallback() before addErrback() even though the other order feels more natural to me. In addition, I reported an issue with the author as this is clearly unexpected behavior.

Now the error log is pretty much silent (minus some ECONNRESET exceptions due to clients sending RST packets in mid-transfer, but I think they are uncritical to resource consumption), so I hope the overall stability of the site has improved a bunch – I’d love not having to restart the daemon for more than a day :-)

Bugs, Bugs and more Bugs

I love my job. Ever loved it, always will love it.

But if you ask me what the most annoying aspect of it is, then I would answer you that it’s stuff always breaking all around me.

Whatever I do, there is no guarantee that any defined thing will work like it’s expected to, it will break from one moment to another or it will never work. There are hardware failures, OS failures, software failures – each and every day I lose at least one or two hours due to stuff not working or suddenly stopping to work.

Let me give you an account of what happened since the beginning of 2009:

  • When installing two previously configured servers at a collocation center, one didn’t start up at all (opening and reclosing the case fixed that) and the ESX server on the other machine refused to connect to the VMWare license server despite a working TCP/IP connection between them which turned out to be a missing host file entry despite connecting via IP-address.
  • One day later, Outlook on a computer of someone I’m looking after the PC a bit decided to trash the .PST-file and I had to remotely guide (on the phone) the person to restore it from the backup.
  • Yesterday, my Firebug suddenly stopped working. At least the console-object wasn’t any longer available in my scripts and the console itself didn’t work. Reinstalling the Addon helped (WTF?)
  • One of my two Vista Media Center PCs suddenly stopped to play any video file, despite me not doing updates on these machines to prevent stuff like this from happening. To this date I have no idea how to fix this.
  • My Delphi 2007 installation just now decided to stop displaying the online help. Trying to fix that by reinstalling it ended with an Error message containing title and content of “Error”, but not after first completely uninstalling Delphi with no way of getting it back (you know… “Error” again). This was fixed by removing D2009 and then reinstalling 2007 and 2009 – a process that took 2 hours of installation time and another three to figure out what’s going on.
  • When I was frustrated enough and wanted to vent (i.e. write this post), my WordPress just now decided to do something really strange to the layout of the “Add New Post” page which made it impossible to post anything. Disabling Google Gears and restarting the browser helped.

Our everyday technology is becoming more and more complex, thus causing more and more strange problems, requiring more and more knowledge and time to work around them. If we continue on that path, sooner or later it will be impossible to keep up with fixing problems popping up.

That will be the day when I’ll hopefully live on some island way off the net and all this stuff.