Tracking runs with the Apple Watch

When I started running this summer, I also wanted to make use of my Apple watch to keep track of my routes and my speed over time, so I looked into the various apps and services around that.

Generally, there are two parts to tracking a run: One part is the actual data gathering that happens while you’re running and the other part is the analysis and comparison other other runs afterwards.

Unfortunately, of all the applications I looked at, none excelled at both, so in the end what I’ve ended up with is writing custom code to give me the best of both worlds.

Here’s what I’ve looked at.

Apple Workouts.app

The built-in Workout app of the watch, being a watch-native app made by Apple, is more equal than other apps: It’s the only app that allows you to trigger the screen lock while in the app and with WatchOS 4 it’s also the only app that gives you very easy access to the media controls. And finally, it’s the only app that can log its tracked workout and movement Activitiy in the Activitiy app in their actual colors instead of just gray (yes. very important this).

It offers very readable data on-screen while it’s running, it can send timely notifications as you pass another kilometre and it never crashes.

Looking at the Map of the run in the Activity app, it also collects very accurate location data.

As good as it is for collecting data, as bad it is for analysis of the data though: The best you can do is have a look at a single workout. There’s no way to compare two – unless you take screenshots and do that manually.

There is also no way to export the data: In the workout details there is a share button, but that just exports a corny text and a useless picture. No detail is included.

Screen Shot 2017-09-04 at 08.51.13.png
Yes. This is all you get from exporting a run. I love the picture. So useful.

So for any analysis you want to do based on runs recorded with the Workouts app, you have to first manually transfer data from screenshots to some other machine readable form and even then: The screenshots alone don’t provide nearly enough useful data.

Strava

This is the other extreme in the list of apps I looked at: It provides excellent analysis and it has an extremely motivating high-score list for user-provided segments of a run. You don’t have to match routes exactly – the moment you run through an existing previously created segment, you’ll be able to compare your effort to others.

Screen Shot 2017-09-04 at 08.47.09.png
Segment rankings (this segment is uphill)

It’s also great at automatically matching previous runs over the same route, so you can compare your runs over time.

Screen Shot 2017-09-04 at 08.44.32.png
Getting faster over time (this mostly uphill too)

The other social features it offers don’t interest me, so I can’t really talk about them.

However: As good as the analysis is, as bad its recording feature is: Of all the apps I looked at it provides the least amount of detail during the run and, what’s worse, its GPS tracking is extremely inaccurate and unreliable.

I’m always running having my phone with me – mainly for easy access to all my media and to Overcast and also because most of my runs I do on my way home from the office where I need the phone anyways. Strava doesn’t make use of this but instead solely relies on the watches GPS which is much less accurate than the phones.

I can understand this: The device is smaller, so it’s harder to put in powerful antennas, it has way less battery and a much weaker CPU than the phone, so it just can’t be as good. It’s totally ok for when you only have the watch with you, but when you have the phone with you, it’s a shame if the app can’t use it.

Runkeeper

Runkeeper uses both the watch and the phone for location tracking and it provides a great UI while the workout is ongoing.

Its analysis features aren’t as good as the ones from Strava though. It doesn’t do the automated segment high-scoring and it’s not as good at comparing runs over the same route with each other.

And finally, the UI of the site doesn’t look as polished as does Strava’s – but that’s just a matter of taste I guess.

… master of none

For all of July and August, my mode of operation was to use Runkeeper to acquire the data during the run and then to export a .gpx file from their site and to import it into Strava.

This gave me the best of both worlds: Very good data gathering and very good data analysis.

However, I wasn’t entirely happy with this either as the process was somewhat cumbersome and, lately, unstable.

Probably caused by iOS 11 Beta, I’ve seen various failure modes related to Runkeeper, all of wich are very annoying:

  1. The workout might start on the Watch but it will not manage to also start it on the phone. This way, the workout will be tracked, but no route data will be saved.
  2. Runkeeper on the phone will crash after about 10 minutes. There’s no indication of this happening, but the result will be that a 10 minutes run is logged instead of the real data on the watch. If this happens, there is no way to even just get to the data without the route.

Issue 1) I could work around often by launching Runkeeper manually on the phone, then starting the workout on the watch and then making sure that the workout would also start on the phone.

If that happened, then route data was tracked correctly.

Unfortunately, sometimes, this stopped working all-together and the only way for the watch to talk to the phone again was to completely uninstall and reinstall Runkeeper on both the Phone and the Watch. This is annoying when you want to start running, but you can’t because the Software-gods have put 20 minutes of fiddling with the App Store in front of you (also, Runkeeper is bigger than the App Store’s 3G download limit, so you better have wifi  available).

Issue 2) is much worse though: There’s no indication of it happening. You’d think that the blue bar “Runkeeper is actively using your location” on the phone would be a good indicator, but it isn’t: When the crash happens, the bar stays there until you unlock your phone. Then it goes away.

So there’s no way to be sure unless you periodically unlock your phone which is very annoying and distracting during the run – especially as you’re sweaty and TouchID won’t work most of the time (I use a strong 25 character password).

I know – even if it isn’t tracked, a run is a run. But it certainly doesn’t feel that way and how it feels is very important to keep motivated to doing this – especially under bad weather conditions.

let’s just hack it

Now, admittedly, these are very likely beta-woes that will eventually solve themselves. We’re pretty far into the beta cycle though (Beta 9 at the time of this writing), so I’m suspicious that these issues won’t be fixed come release but will have to wait for a future update to either Runkeeper or the OS.

Losing about a third of my runs to software issues felt really unacceptable to me, especially considering that Runkeeper still wasn’t offering some features that the workout app was (like enabling the screen lock – which is important when running in the rain).

However, when I looked again at the WWDC sessions this year, I found out that IOS11 will finally offer an API to read and write route data for workouts. This means that the data you track using Apple’s built-in app will finally be available to other apps to read.

This would give me the best of all worlds: Use the best data-gathering app and export it to the best analysis app; side-stepping the stability issues.

IMG_1324.png
Such Wow. Much UI.

So, this weekend, I hacked together a quick solution (MIT licensed) that does exactly this. It lists you all your workouts and if you tap one, it will eventually show you a share-sheet, allowing you to select a location to store a .gpx file to.

That file contains all the information required for Strava to do its analysis.

In a perfect world, this app would of course upload directly to Strava. And it would not block the UI thread while it’s exporting the gpx file. And it would actually have some UI to speak of.

But this was a quick-hack that solved an issue for me – and who knows – maybe it will fix it for you.

If you need a real solution for this, Twitter user @dwlz is apparently working on a real app that will be usable for normal people and I’ll definitely switch to that when it’s ready. But until then, I can finally track my runs with the peace of mind of having a crash-free solution that still provides the best analysis possible.

Screen Shot 2017-09-04 at 08.54.14.png
A run tracked with the Workout app and uploaded to Strava using my hack

another fun project: digipass

As a customer of digitec, I often deal with their collection notices which I get via email and which invite me to go to their store and fetch my order (yes. I could have the goods delivered, but I’m impatient and not willing to pay the credit card surcharge).

Ever since Passbook happened on iOS 6, I wished for these collection notices to be iOS Passes as they have a lot of usability benefits:

  • passes are location aware an pop up automatically when you get close to the location
  • Wallet automatically turns the screen brightness all the way up
  • passes could potentially be updated remotely
  • once added to the Wallet, passes don’t clutter your mailbox and you’ll never lose them in the noise of your inbox.

So my latest fun project is digipass.

Next time you get a digitec collection notice, just forward it to

digipass@pilif.me

After a few seconds, you will get the same collection notice again, but with the PDF replaced by an iOS Wallet pass that you can add to your Wallet.

I have slightly altered the logo and the name to make it clear that there’s no affiliation to digitec.

The pass will be geo-coded to the correct store, so it will automatically pop up as you get close to the store.

As I don’t want access to your digitec account and because digitec doesn’t have any kind of API, I unfortunately can’t automatically remove the pass when your fetch your order – that’s something only digitec can do.

The source code for the server is available under the MIT license.

Disclaimer:

  • I’m not affiliated with digitec aside of being a customer of theirs. If they want me to shut this down, I will.
  • I am not logging the collection notices you’re forwarding me. If you don’t trust me, you can self-host, or redact the notice to contain nothing but the URLs (I need these in order to build the pass).
  • This is a fun project. If it’s down, it’s down. If it doesn’t work, submit a pull request. Don’t expect any support
  • The LMTP daemon powering this is running in my home. I have a very good connection, but I also have not signed an SLA or anything. If it’s down, it’s down (the message will get queued though).
  • The moment I see this being abused, it will be shut down. Just like my previous email based fun project

AV Programs as a Security Risk

Imagine you were logged into your machine as an administrator. Imagine you’re going to double-click every single attachment in every single email you get. Imagine you’re going to launch every single file your browser downloads. Imagine you answer affirmative on every single prompt to install the latest whatever. Imagine you unpack every single archive sent to you and you launch ever single file in those archives.

This is the position that AV programs put themselves on your machine if they want to have any chance at being able to actually detect malware. Just checking whether a file contains a known byte signature has stopped being a reliable method for detecting viruses long ago.

It makes sense. If I’m going to re-distribute some well-known piece of malware, all I have to do is to obfuscate it a little bit or encrypt it with a static key and my piece of malware will naturally not match any signature of any existing malware.

The loader-stub might, but if I’m using any of the existing installer packagers, then I don’t look any different than any other setup utility for any other piece of software. No AV vendor can yet affort to black-list all installers.

So the only reliable way to know whether a piece of software is malware or not, is to start running it in order to at least get it to extract/decrypt itself.

So here we are in a position where a anti malware program either is useless placebo or it has to put itself into the position I have started this article with.

Personally, I think it is impossible to safely run a piece of software in a way that it cannot do any harm to the host machine.

AV vendors could certainly try to make it as hard as possible for malware to take over a host machine, but here we are in 2016 where most of the existing AV programs are based on projects started in the 90ies where software quality and correctness was even less of a focus than it is today.

We see AV programs disabling OS security features, installing and starting VNC servers and providing any malicious web site with full shell access to the local machine. Or allow malware to completely take over a machine if a few bytes are read no matter where from.

This doesn’t cover the privacy issues yet which are caused by more and more price-pressure the various AV vendors are subject to. If you have to sell the software too cheap to pay for its development (or even give it away for free), then you need to open other revenue streams.

Being placed in such a privileged position like AV tools are, it’s no wonder what kinds of revenue streams are now in process of being tapped…

AV programs by definition put themselves into an extremely dangerous spot on your machine: In order to read every file your OS wants to access, it has to run with adminitrative rights and in order to actually protect you it has to understand many, many more file formats than what you have applications for on your machine.

AV software has to support every existing archive format, even long obsolete ones because who knows – you might have some application somewhere that can unpack it; it has to support every possibly existing container format and it has to support all kinds of malformed files.

If you try to open a malformed file with some application, then the application has the freedom to crash. An AV program must keep going and try even harder to see into the file to make sure it’s just corrupt and not somehow malicious.

And as stated above: Once it finally got to some executable payload, it often has no chance but to actually execute it, at least partially.

This must be some of the most difficult thing to get right in all of engineering: Being placed at a highly privileged spot and being tasked to then deal with content that’s malicious per definitionem is an incredibly difficult task and when combined with obviously bad security practices (see above), I come to the conclusion that installing AV programs is actually lowering the overall security of your machines.

Given a fully patched OS, installing an AV tool will greatly widen the attack surface as now you’re putting a piece of software on your machine that will try and make sense of every single byte going in and out of your machine, something your normal OS will not do.

AV tools have the choice of doing nothing against any but the most common threats if they decide to do pure signature matching, or of potentially putting your machine at risk.

AV these days might provide a very small bit of additional security against well-known threats (though against those you’re also protected if you apply the latest OS patches and don’t work as an admin) but they open your installation wide for all kinds of targeted attacks or really nasty 0-day exploits that can bring down your network without any user-interaction what so ever.

If asked what to do these days, I would give the strong recommendation to not install AV tools. Keep all the software you’re running up to date and white-list the applications you want your users to run. Make use of white-listing by code-signatures to, say, allow everything by a specific vendor. Or all OS components.

If your users are more tech-savy (like developers or sys admins), don’t whitelist but also don’t install AV on their machines. They are normally good enough to not accidentally run malware and the risk of them screwing up is much lower than the risk of somebody exploiting the latest flaw in your runs-as-admin-and-launches-every-binary piece of security software.

AJAX, Architecture, Frameworks and Hacks

Today I was talking with @brainlock about JavaScript, AJAX and Frameworks and about two paradigms that are in use today:

The first is the “traditional” paradigm where your JS code is just glorified view code. This is how AJAX worked in the early days and how people are still using it. Your JS-code intercepts a click somewhere, sends an AJAX request to the server and gets back either more JS code which just gets evaulated (thus giving the server kind of indirect access to the client DOM) or a HTML fragment which gets inserted at the appropriate spot.

This means that your JS code will be ugly (especially the code coming from the server), but it has the advantage that all your view code is right there where all your controllers and your models are: on the server. You see this pattern in use on the 37signals pages or in the github file browser for example.

Keep the file browser in mind as I’m going to use that for an example later on.

The other paradigm is to go the other way around an promote JS to a first-class language. Now you build a framework on the client end and transmit only data (XML or JSON, but mostly JSON these days) from the server to the client. The server just provides a REST API for the data plus serves static HTML files. All the view logic lives only on the client side.

The advantages are that you can organize your client side code much better, for example using backbone, that there’s no expensive view rendering on the server side and that you basically get your third party API for free because the API is the only thing the server provides.

This paradigm is used for the new twitter webpage or in my very own tempalias.com.

Now @brainlock is a heavy proponent of the second paradigm. After being enlightened by the great Crockford, we both love JS and we both worked on huge messes of client-side JS code which has grown over the years and lacks structure and feels like copy pasta sometimes. In our defense: Tons of that code was written in the pre-enlightened age (2004).

I on the other hand see some justification for the first pattern aswell and I wouldn’t throw it away so quickly.

The main reason: It’s more pragmatic, it’s more DRY once you need graceful degradation and arguably, you can reach your goal a bit faster.

Let me explain by looking at the github file browser:

If you have a browser that supoports the HTML5 history API, then a click on a directory will reload the file list via AJAX and at the same time the URL will be updated using push state (so that the current view keeps its absolute URL which is valid even after you open it in a new browser).

If a browser doesn’t support pushState, it will gracefully degrade by just using the traditional link (and reloading the full page).

Let’s map this functionality to the two paradigms.

First the hacky one:

  1. You render the full page with the file list using a server-side template
  2. You intercept clicks to the file list. If it’s a folder:
  3. you request the new file list
  4. the server now renders the file list partial (in rails terms – basically just the file list part) without the rest of the site
  5. the client gets that HTML code and inserts it in place of the current file list
  6. You patch up the url using push state

done. The view code is only on the server. Whether the file list is requested using the AJAX call or the traditional full page load doesn’t matter. The code path is exactly the same. The only difference is that the rest of the page isn’t rendered in case of an AJAX call. You get graceful degradation and no additional work.

Now assuming you want to keep graceful degradation possible and you want to go the JS framework route:

  1. You render the full page with the file list using a server-side template
  2. You intercept the click to the folder in the file list
  3. You request the JSON representation of the target folder
  4. You use that JSON representation to fill a client-side template which is a copy of the server side partial
  5. You insert that HTML at the place where the file list is
  6. You patch up the URL using push state

The amount of steps is the same, but the amount of work isn’t: If you want graceful degradation, then you write the file list template twice: Once as a server-side template, once as a client-side template. Both are quite similar but usually you’ll be forced to use slightly different syntax. If you update one, you have to update the other or the experience will be different whether you click on a link or you open the URL directly.

Also you are duplicating the code which fills that template: On the server side, you use ActiveRecord or whatever other ORM. On the client side, you’d probably use Backbone to do the same thing but now your backend isn’t the database but the JSON response. Now, Backbone is really cool and a huge timesaver, but it’s still more work than not doing it at all.

OK. Then let’s skip graceful degradation and make this a JS only client app (good luck trying to get away with that). Now the view code on the server goes away and you are just left with the model on the server to retrieve the data, with the model on the client (Backbone helps a lot here, but there’s still a substatial amount of code that needs to be written that otherwise wouldn’t) and with the view code on the client.

Now don’t ge me wrong.

I love the idea of promoting JS to a first class language. I love JS frameworks for big JS only applications. I love having a “free”, dogfooded-by-design REST API. I love building cool architectures.

I’m just thinking that at this point it’s so much work doing it right, that the old ways do have their advantages and that we should not condemn them for being hacky. True. They are. But they are also pragmatic.

Find relation sizes in PostgreSQL

Like so many times before, today I was yet again in the situation where I wanted to know which tables/indexes take the most disk space in a particular PostgreSQL database.

My usual procedure in this case was to dt+ in psql and scan the sizes by eye (this being on my development machine, trying to find out the biggest tables I could clean out to make room).

But once you’ve done that a few times and considering that dt+ does nothing but query some PostgreSQL internal tables, I thought that I want this solved in an easier way that also is less error prone. In the end I just wanted the output of dt+ sorted by size.

The lead to some digging in the source code of psql itself (src/bin/psql) where I quickly found the function that builds the query (listTables in describe.c), so from now on, this is what I’m using when I need to get an overview over all relation sizes ordered by size in descending order:

select
  n.nspname as "Schema",
  c.relname as "Name",
  case c.relkind
     when 'r' then 'table'
     when 'v' then 'view'
     when 'i' then 'index'
     when 'S' then 'sequence'
     when 's' then 'special'
  end as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner",
  pg_catalog.pg_size_pretty(pg_catalog.pg_relation_size(c.oid)) as "Size"
from pg_catalog.pg_class c
 left join pg_catalog.pg_namespace n on n.oid = c.relnamespace
where c.relkind IN ('r', 'v', 'i')
order by pg_catalog.pg_relation_size(c.oid) desc;

Of course I could have come up with this without source code digging, but honestly, I didn’t know about relkind s, about pg_size_pretty and pg_relation_size (I would have thought that one to be stored in some system view), so figuring all of this out would have taken much more time than just reading the source code.

Now it’s here so I remember it next time I need it.

How to kill IE performance

While working on my day job, we are often dealing with huge data tables in HTML augmented with some JavaScript to do calculations with that data.

Think huge shopping cart: You change the quantity of a line item and the line total as well as the order total will change.

This leads to the same data (line items) having three representations:

  1. The model on the server
  2. The HTML UI that is shown to the user
  3. The model that’s seen by JavaScript to do the calculations on the client side (and then updating the UI)

You might think that the JavaScript running in the browser would somehow be able to work with the data from 2) so that the third model wouldn’t be needed, but due to various localization issues (think number formatting) and data that’s not displayed but affects the calculations, that’s not possible.

So the question is: Considering we have some HTML templating language to build 2), how do we get to 3).

Back in 2004 when I initially designed that system (using AJAX before it was widely called AJAX even), I hadn’t seen Crockford’s lectures yet, so I still lived in the “JS sucks” world, where I’ve done something like this

<!-- lots of TRs -->
<tr>
    <td>Column 1 addSet(1234 /*prodid*/, 1 /*quantity*/, 10 /*price*/, /* and, later, more, stuff, so, really, ugly */)</td>
    <td>Column 2</td>
    <td>Column 3</td>
</tr>
<!-- lots of TRs -->

(Yeah – as I said: 2004. No object literals, global functions. We had a lot to learn back then, but so did you, so don’t be too angry at me – we improved)

Obviously, this doesn’t scale: As the line items got more complicated, that parameter list grew and grew. The HTML code got uglier and uglier and of course, cluttering the window object is a big no-no too. So we went ahead and built a beautiful design:

<!-- lots of TRs -->
<tr class="lineitem" data-ps-lineitem='{"prodid": 1234, "quantity": 1, "price": 10, "foo": "bar", "blah": "blah"}'>
    <td>Column 1</td>
    <td>Column 2</td>
    <td>Column 3</td>
</tr>
<!-- lots of TRs -->

The first iteration was then parsing that JSON every time we needed to access any of the associated data (and serializing again whenever it changed). Of course this didn’t go that well performance-wise, so we began caching and did something like this (using jQuery):

$(function(){
    $('.lineitem').each(function(){
        this.ps_data = $.parseJSON($(this).attr('data-ps-lineitem'));
    });
});

Now each DOM element representing one of these <tr>’s had a ps_data member which allowed for quick access. The JSON had to be parsed only once and then the data was available. If it changed, writing it back didn’t require a re-serialization either – you just changed that property directly.

This design is reasonably clean (still not as DRY as the initial attempt which had the data only in that JSON string) while still providing enough performance.

Until you begin to amass datasets. That is.

Well. Until you do so and expect this to work in IE.

800 rows like this made IE lock up its UI thread for 40 seconds.

So more optimization was in order.

First,

$('.lineitem')

will kill IE. Remember: IE (still) doesn’t have getElementsByClassName, so in IE, jQuery has to iterate the whole DOM and check whether each elements class attribute contains “lineitem”. Considering that IE’s DOM isn’t really fast to start with, this is a HUGE no-no.

So.

$('tr.lineitem')

Nope. Nearly as bad considering there are still at least 800 tr’s to iterate over.

$('#whatever tr.lineitem')

Would help if it weren’t 800 tr’s that match. Using dynaTrace AJAX (highly recommended tool, by the way) we found out that just selecting the elements alone (without the iteration) took more than 10 seconds.

So the general take-away is: Selecting lots of elements in IE is painfully slow. Don’t do that.

But back to our little problem here. Unserializing that JSON at DOM ready time is not feasible in IE, because no matter what we do to that selector, once there are enough elements to handle, it’s just going to be slow.

Now by chunking up the amount of work to do and using setTimeout() to launch various deserialization jobs we could fix the locking up, but the total run time before all data is deserialized will still be the same (or slightly worse).

So what we have done in 2004, even though it was ugly, was way more feasible in IE.

Which is why we went back to the initial design with some improvements:

<!-- lots of TRs -->
<tr class="lineitem">
    <td>Column 1 PopScan.LineItems.add({"prodid": 1234, "quantity": 1, "price": 10, "foo": "bar", "blah": "blah"});</td>
    <td>Column 2</td>
    <td>Column 3</td>
</tr>
<!-- lots of TRs -->

phew crisis averted.

Loading time went back to where it was in the 2004 design. It was still bad though. With those 800 rows, IE was still taking more than 10 seconds for the rendering task. dynaTrace revealed that this time, the time was apparently spent rendering.

The initial feeling was that there’s not much to do at that point.

Until we began suspecting the script tags.

Doing this:

<!-- lots of TRs -->
<tr class="lineitem">
    <td>Column 1</td>
    <td>Column 2</td>
    <td>Column 3</td>
</tr>
<!-- lots of TRs -->

The page loaded instantly.

Doing this

<!-- lots of TRs -->
<tr class="lineitem">
    <td>Column 1 1===1;</td>
    <td>Column 2</td>
    <td>Column 3</td>
</tr>
<!-- lots of TRs -->

it took 10 seconds again.

Considering that IE’s JavaScript engine runs as a COM component, this isn’t actually that surprising: Whenever IE hits a script tag, it stops whatever it’s doing, sends that script over to the COM component (first doing all the marshaling of the data), waits for that to execute, marshals the result back (depending on where the DOM lives and whether the script accesses it, possibly crossing that COM boundary many, many times in between) and then finally resumes page loading.

It has to wait for each script because, potentially, that JavaScript could call document.open() / document.write() at which point the document could completely change.

So the final solution was to loop through the server-side model twice and do something like this:

<!-- lots of TRs -->
<tr class="lineitem">
    <td>Column 1 </td>
    <td>Column 2</td>
    <td>Column 3</td>
</tr>
<!-- lots of TRs -->
</table>

PopScan.LineItems.add({prodid: 1234, quantity: 1, price: 10, foo: "bar", blah: "blah"});
// 800 more of these

Problem solved. Not too ugly design. Certainly no 2004 design any more.

And in closing, let me give you a couple of things you can do if you want to bring the performance of IE down to its knees:

  • Use broad jQuery selectors. $('.someclass') will cause jQuery to loop through all elements on the page.
  • Even if you try not to be broad, you can still kill performance: $('div.someclass'). The most help jQuery can expect from IE is getElementsByTagName, so while it’s better than iterating all elements, it’s still going over all div’s on your page. Once it’s more than 200, the performance extremely quickly falls down (probably doing some O(n^2) thing somehwere).
  • Use a lot of <script>-tags. Every one of these will force IE to marshal data to the scripting engine COM component and to wait for the result.

Next time, we’ll have a look at how to use jQuery’s delegate() to handle common cases with huge selectors.

stabilizing tempalias

While the maintenance last weekend brought quite a bit of stabilization to the tempalias service, I quickly noticed that it was still dying sooner or later and while before updating node, it died due to not being able to allocate more memory, this time, it died by just not answering any requests any more.

A look at the error log quickly revealed quite many exceptions complaining about a certain request type not being allowed to have a body and finally one complaining about not being able to open a file due to having run out of file handles.

So I quickly improved error logging and restarted the daemon in order to get a stacktrace leading to these tons of exceptions.

This quickly pointed to paperboy which was sending the file even if the request was a HEAD request. http.js in node checks for this and throws whenever you send a body when you should not. That exception lead then to paperboy never closing the file (have I already complained how incredibly difficult it is to do proper exception handling the moment continuations get involved? I think not and I also think it’s a good topic for another diary entry). With the help of lsof I’ve quickly seen that my suspicions were true: the node process serving tempalias had tons of open handles to public/index.html.

I sent a patch for this behavior to @felixge which was quickly applied, so that’s fixed now. I hope it’s of some use for other people too.

Now knowing that having a look at lsof here and then might be a good idea, quickly revealed another problem: While the file handles were gone, I’ve noticed tons and tons of SMTP sockets staying open in CLOSE_WAIT state. Not good as that too will lead to handle starvation sooner or later.

On a hunch, I found out that connecting to the SMTP daemon and then disconnecting, not sending QUIT to let the server disconnect was what was causing the lingering sockets. Clients disconnecting like that is very common in case the sender sends a 5xx response which is what the tempalias daemon was designed for.

So I had to fix that in my fork of the node smtp daemon (the original upstream isn’t interested in daemon functionality and the owner I forked the daemon for doesn’t respond to my pull requests. Hence I’m maintaining my own fork for now).

Futher looks at lsof prove that now we are quite stable in resource consumption: No lingering connections, no unclosed file handles.

But the error log was still filling up. This time something about removeListener needing a function. Thanks to the callstack I now had in my error log, I quickly hunted that one down and fixed it – that was a very stupid mistake. Thankfully, because the mails I usually deliver are small enough so that socket draining usually wasn’t required.

Onwards to the next issue filling the error log: «This deferred has already been resolved».

This comes from the Promise.js library if you emit*() multiple times on the same promise. This time, of course, the callstack was useless (… at <anonymous> – why, thank you), but I was very lucky again in that I tested from home and my mail relay didn’t trust my home IP address and thus denied relaying with a 500 which immediately led to the exception.

Now, this one is crazy: When you call .addErrback() on a Promise before calling addCallback(), your callback will be executed no matter if the errback was executed first.

Promise.js does some really interesting things to simulate polymorphism in JavaScript and I really didn’t want to fix up that library as lately, node.js itself seems go to a simpler continuation style using a callback parameter, so sooner or later, I’ll have to patch up the smtp server library anyways to remove Promise.js if I want to adhere to current node style.

So I took the workaround route by just using addCallback() before addErrback() even though the other order feels more natural to me. In addition, I reported an issue with the author as this is clearly unexpected behavior.

Now the error log is pretty much silent (minus some ECONNRESET exceptions due to clients sending RST packets in mid-transfer, but I think they are uncritical to resource consumption), so I hope the overall stability of the site has improved a bunch – I’d love not having to restart the daemon for more than a day :-)