PHP, stream filters, bzip2.compress

Maybe you remember that, more than a year ago, I had an interesting problem with stream filters.

The general idea is that I want to output bz2-compressed data to the client as the output is being assembled – or, more to the point: The PopScan Windows-Client supports the transmission of bzip2 encoded data which gets really interesting as the amount of data to be transferred increases.

Even more so: The transmitted data is in XML format which is very easily compressed – especially with bzip2.

Once you begin to transmit multiple megabytes of uncompressed XML-data, you begin to see the sense in jumping through a hoop or two to decrease the time needed to transmit the data.

On the receiving end, I have an elaborate construct capable of downloading, decompressing, parsing and storing data as it arrives over the network.

On the sending end though, I have been less lucky: Because of that problem I had, I was unable to stream out bzip2 compressed data as it was generated – the end of the file was sometimes missing. This is why I’m using ob_start() to gather all the output and then compress it with bzcompress() to send it out.

Of course this means that all the data must be assembled before it can be compressed and the sent to the client.

As we have more and more data to transmit, the client must wait longer and longer before the data begins to reach it.

And then comes the moment when the client times out.

So I finally really had to fix the problem. I could not believe that I was unable to compress and stream out data on the fly.

It turns out that I finally found the smallest possible amount of code to illustrate the problem in a non-hacky way:

So: This fails under PHP up until 5.2.3:

<?
$str = "BEGIN (%d)n
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum.
nEND (%d)n";

$h = fopen($_SERVER['argv'][1], 'w');
$f = stream_filter_append($h, "bzip2.compress", STREAM_FILTER_WRITE);
for($x=0; $x < 10000; $x++){
   fprintf($h, $str, $x, $x);

}
fclose($h);
echo "Writtenn";
?>

Even worse though: It doesn’t fail with a message, but it writes out a corrupt bzip-File.

And it gets worse: With a little amount of data it works, but as the amount of data increases, it begins to fail – at different places depending on how you shuffle the data around.

Above script will write a bzip file which – when uncompressed – will end around iteration 9600.

So now that I had a small reproducible testcase, I could report a bug in PHP: Bug 47117.

After spending so many hours on a problem which in the end boiled down to a bug in PHP (I’ve looked anywhere, believe me. I also tried workarounds, but all to no avail), I just could not let the story end there.

Some investigation quickly turned up a wrong check for a return value in bz2_filter.c which I was able to patch up very, very quickly, so if you visit that bug above, you will find a patch correcting the problem.

Then, when I finished patching PHP itself, hacking up the needed PHP-code to let the thing stream out the compressed data as it arrived was easy. If you want, you can have a look at bzcomp.phps which demonstrates how to plug the compression into either the output buffer handling or something quick, dirty and easier else.

Oh, and if you are tempted to do this:

function ob($buf){
        return bzcompress($buf);
}

ob_start('ob');

… it won’t do any good because you will still gobble up all the data before compressing. And this:

function ob($buf){
        return bzcompress($buf);
}

ob_start('ob', 32768);

will encode in chunks (good), but it will write a bzip2-end-of-stream marker after every chunk (bad), so neither will work.

Nothing more satisfying than to fix a bug in someone else’s code. Now let’s hope this gets applied to PHP itself so I don’t have to manually patch my installations.

Trying out Gmail

Everyone and their friends seems to be using Gmail lately and I agree: The application has a clean interface, a very powerful search feature and is easily accessible from anywhere.

I have my Gmail address from back in the days when invites were scarce and the term AJAX wasn’t even a term yet, but I never go around to really take advantage of the services as I just don’t see myself checking various email accounts at various places – at least not for serious business.

But now I found a way to put gmail to the test as my main email application – at least for a week or two.

My main mail storage is and will be our Exchange server. I have multiple reasons for that

  1. I have all my email I ever sent or received in that IMAP account. That’s WAY more than the 2.8 GB you get in Gmail and even if I had enough space there, I would not want to upload all my messages there.
  2. I don’t trust gmail to be as diligent with the messages I store there as I would want it to. I managed to keep every single email message from 1998 till now and I’d hate to lose all that to a “glitch in the system”.
  3. I need IMAP access to my messages for various purposes.
  4. I need the ability of a strong server-side filtering to remove messages I’m more or less only receiving for logging purposes. I don’t want to see these – not until I need them. No reason to even have them around usually.

So for now I have added yet another filter to my collection of server-side filters: This time I’m redirecting a copy of all mail that didn’t get filtered away due to various reasons to my Gmail address. This way I get to keep all mail of my various aliases all at the central location where they always were and I can still use Gmail to access the newly arrived messages.

Which leaves the problem with the sent messages which I ALSO want to archive at my own location – at least the important ones.

I fixed this by BCCing all Mail I’m writing in gmail to a new alias I created. Mail to that alias with my Gmail address as sender will be filtered into my sent-box by Exchange so it’ll look as though I sent the message via Thunderbird and then uploaded the copy via IMAP.

I’m happy with this solution, so testing Gmail can begin.

I’m asking myself: Is a tag based storage system better than a purely search based (the mail I don’t filter away is kept in one big INBOX which I access purely via search queries if I need something)? Is a web based application as powerful as a mail client like Thunderbird or Apple Mail? Do I unconsciously use features I’m going to miss when using Gmail instead of Apple Mail or Thunderbird? Will I be able to get used to the very quick keyboard-interface to gmail?

Interesting questions I intend to answer.

Mail filtering belongs on the server

Different people who got their iPhone are complaining about SPAM reaching their inbox and want Junk Mail controls on their new gadget, failing to realize the big problem with that approach:

Even if the iPhone is updated with a SPAM filter, the messages will get transmitted and filtered there, which means that you pay for receiving the junk just to throw it away afterwards.

Additionally, Bayes filter still seem to be the way to go with junk mail filtering. The Bayes rules can get pretty large, so this means that you either have to retrain your phone or that the seed data must be synchronized with the phone which will take both a lot of time and space better used for something else.

No. SPAM filtering is a task for the mail server.

I’m using SpamAssassin and DSPAM to check the incoming mail for junk and then I’m using the server side filtering capabilities of our Exchange server to filter mail recognized as SPAM into the “Junk E-Mail” box.

If the filter is easy enough (checking for header values and moving into boxes), even though it is defined in Outlook, the server can process them regardless of which client is connecting to it to fetch the mail (Apple Mail, Thunderbird and the IMAP client on my W880i in my case). This means that all my junk is sorted away into the “Junk Email” folder just when it arrives. It never reaches the INBOX and I never see it.

I don’t have an iPhone and I don’t want to have one (I depend on bluetooth modem functionality and a real keypad), but the same thing applies to any mobile emailing solution. You don’t want SPAM on your Blackberry and especially not on your even simpler non-smartphone.

Speaking of transferring data: The other thing I really don’t like about the iPhone is the browser. Sure: It’s standard compliant, it renders nice, it supports AJAX and supports small-screen-rendering but it transmits the websites uncompressed.

Let me make an example: The digg.com frontpage in Opera Mini causes 10KB of data to be tranferred. It looks perfectly fine on my SonyEricsson W880 and works as such (minus some javascript functionality). Digg.com when accessed via Firefox causes 319 KB to be transmitted.

One MB costs CHF 7 here (though you can have some inclusive MB’s depending on contract) which is around EUR 4.50, so for that money I could watch digg.com three times with the iPhone or 100 times with Opera Mini. The end-user experience is largely the same on both platforms – at least close enough not to warrant the 33 times more expensive access via a browser that works without a special proxy.

As long as GPRS data traffic is prohibitively expensive, junk mail filtering on the server and a prerendering-proxy based browser are a must. Even more so than the other stuff missing in the iPhone.