PHP, stream filters, bzip2.compress

Maybe you remember that, more than a year ago, I had an interesting problem with stream filters.

The general idea is that I want to output bz2-compressed data to the client as the output is being assembled – or, more to the point: The PopScan Windows-Client supports the transmission of bzip2 encoded data which gets really interesting as the amount of data to be transferred increases.

Even more so: The transmitted data is in XML format which is very easily compressed – especially with bzip2.

Once you begin to transmit multiple megabytes of uncompressed XML-data, you begin to see the sense in jumping through a hoop or two to decrease the time needed to transmit the data.

On the receiving end, I have an elaborate construct capable of downloading, decompressing, parsing and storing data as it arrives over the network.

On the sending end though, I have been less lucky: Because of that problem I had, I was unable to stream out bzip2 compressed data as it was generated – the end of the file was sometimes missing. This is why I’m using ob_start() to gather all the output and then compress it with bzcompress() to send it out.

Of course this means that all the data must be assembled before it can be compressed and the sent to the client.

As we have more and more data to transmit, the client must wait longer and longer before the data begins to reach it.

And then comes the moment when the client times out.

So I finally really had to fix the problem. I could not believe that I was unable to compress and stream out data on the fly.

It turns out that I finally found the smallest possible amount of code to illustrate the problem in a non-hacky way:

So: This fails under PHP up until 5.2.3:

<?
$str = "BEGIN (%d)n
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum.
nEND (%d)n";

$h = fopen($_SERVER['argv'][1], 'w');
$f = stream_filter_append($h, "bzip2.compress", STREAM_FILTER_WRITE);
for($x=0; $x < 10000; $x++){
   fprintf($h, $str, $x, $x);

}
fclose($h);
echo "Writtenn";
?>

Even worse though: It doesn’t fail with a message, but it writes out a corrupt bzip-File.

And it gets worse: With a little amount of data it works, but as the amount of data increases, it begins to fail – at different places depending on how you shuffle the data around.

Above script will write a bzip file which – when uncompressed – will end around iteration 9600.

So now that I had a small reproducible testcase, I could report a bug in PHP: Bug 47117.

After spending so many hours on a problem which in the end boiled down to a bug in PHP (I’ve looked anywhere, believe me. I also tried workarounds, but all to no avail), I just could not let the story end there.

Some investigation quickly turned up a wrong check for a return value in bz2_filter.c which I was able to patch up very, very quickly, so if you visit that bug above, you will find a patch correcting the problem.

Then, when I finished patching PHP itself, hacking up the needed PHP-code to let the thing stream out the compressed data as it arrived was easy. If you want, you can have a look at bzcomp.phps which demonstrates how to plug the compression into either the output buffer handling or something quick, dirty and easier else.

Oh, and if you are tempted to do this:

function ob($buf){
        return bzcompress($buf);
}

ob_start('ob');

… it won’t do any good because you will still gobble up all the data before compressing. And this:

function ob($buf){
        return bzcompress($buf);
}

ob_start('ob', 32768);

will encode in chunks (good), but it will write a bzip2-end-of-stream marker after every chunk (bad), so neither will work.

Nothing more satisfying than to fix a bug in someone else’s code. Now let’s hope this gets applied to PHP itself so I don’t have to manually patch my installations.

The pain of email SPAM

Lately, the SPAM problem got a lot worse in my email INBOX. Spammers seem to more and more check if their mail gets flagged by SpamAssasin and tweak the messages until they get through.

Due to some tricky aliasing going on on the mail server, I’m unable to properly use the bayes filter of SpamAssasin on our main mail server. You see, I have an infinite amount of addresses which are in the end delivered to the same account and all that aliasing can only be done after the message has passed SpamAssassin.

This means that even though mail may go to one and the same user in the end, it’s seen as mail for many different users by SpamAssassin.

This inability to use Bayes with SpamAssassin means that lately, SPAM has been getting through the filter.

So much SPAM that I began getting really, really annoyed.

I know that mail clients themselves also have bayes based SPAM filters, but I often check my email account with my mobile phone or on different computers, so I’m dependent on a solution that filters out the SPAM before it reaches my INBOX on the server.

The day before yesterday I had enough.

While all mail for all domains I’m managing is handled by a customized MySQL-Exim-Courier setting, mail to the @sensational.ch domain is relayed to another server and then delivered to our exchange server.

Even better: That final delivery step is done after all the aliasing steps (the catch-all aliases being the difficult part here) have completed. This means that I can in-fact have all mail to @sensational.ch pass through a bayes filter and the messages will all be filtered for the correct account.

This made me install dspam on the relay that transmits mail from our central server to the exchange server.

Even after only one day of training, I’m getting impressive results: DSPAM only touches mail that isn’t flagged as spam by SpamAssassin, which means that it’s carefully crafted to look “real”.

After one day of training, DSPAM usually detects junk messages and I’m down to one false negative every 10 junk messages (and no false positives).

Even after running SpamAssassin and thus filtering out the obvious suspects, a whopping 40% of emails I’m receiving are SPAM. So nearly half of the messages not already filtered out by SA are still SPAM.

If I take a look at the big picture, even when counting the various mails sent by various cron daemons as genuine email, I’m getting much more junk email than genuine email per day!

Yesterday, tuesday, for example, I got – including mails from cron jobs and backup copies of order confirmations for PopScan installations currently in public tests – 62 genuine emails and 252 junk mails of which 187 were caught by SpamAssassin and the rest was detected by DSPAM (with the exception of two mails that got through).

This is insane. I’m getting four times more spam than genuine messages! What the hell are these people thinking? With that volume of junk filling up our inboxes how ever could one of these “advertisers” think that somebody is both stupid enough to fall for such a message and intelligent enough to pick the one to fall for from all the others?

Anyways. This isn’t supposed to be a rant. It’s supposed to be a praise to DSPAM. Thanks guys! You rule!