Maybe you remember that, more than a year ago, I had an interesting problem with stream filters.
The general idea is that I want to output bz2-compressed data to the client as the output is being assembled – or, more to the point: The PopScan Windows-Client supports the transmission of bzip2 encoded data which gets really interesting as the amount of data to be transferred increases.
Even more so: The transmitted data is in XML format which is very easily compressed – especially with bzip2.
Once you begin to transmit multiple megabytes of uncompressed XML-data, you begin to see the sense in jumping through a hoop or two to decrease the time needed to transmit the data.
On the receiving end, I have an elaborate construct capable of downloading, decompressing, parsing and storing data as it arrives over the network.
On the sending end though, I have been less lucky: Because of that problem I had, I was unable to stream out bzip2 compressed data as it was generated – the end of the file was sometimes missing. This is why I’m using ob_start() to gather all the output and then compress it with bzcompress() to send it out.
Of course this means that all the data must be assembled before it can be compressed and the sent to the client.
As we have more and more data to transmit, the client must wait longer and longer before the data begins to reach it.
And then comes the moment when the client times out.
So I finally really had to fix the problem. I could not believe that I was unable to compress and stream out data on the fly.
It turns out that I finally found the smallest possible amount of code to illustrate the problem in a non-hacky way:
So: This fails under PHP up until 5.2.3:
<? $str = "BEGIN (%d)n Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. nEND (%d)n"; $h = fopen($_SERVER['argv'][1], 'w'); $f = stream_filter_append($h, "bzip2.compress", STREAM_FILTER_WRITE); for($x=0; $x < 10000; $x++){ fprintf($h, $str, $x, $x); } fclose($h); echo "Writtenn"; ?>
Even worse though: It doesn’t fail with a message, but it writes out a corrupt bzip-File.
And it gets worse: With a little amount of data it works, but as the amount of data increases, it begins to fail – at different places depending on how you shuffle the data around.
Above script will write a bzip file which – when uncompressed – will end around iteration 9600.
So now that I had a small reproducible testcase, I could report a bug in PHP: Bug 47117.
After spending so many hours on a problem which in the end boiled down to a bug in PHP (I’ve looked anywhere, believe me. I also tried workarounds, but all to no avail), I just could not let the story end there.
Some investigation quickly turned up a wrong check for a return value in bz2_filter.c which I was able to patch up very, very quickly, so if you visit that bug above, you will find a patch correcting the problem.
Then, when I finished patching PHP itself, hacking up the needed PHP-code to let the thing stream out the compressed data as it arrived was easy. If you want, you can have a look at bzcomp.phps which demonstrates how to plug the compression into either the output buffer handling or something quick, dirty and easier else.
Oh, and if you are tempted to do this:
function ob($buf){ return bzcompress($buf); } ob_start('ob');
… it won’t do any good because you will still gobble up all the data before compressing. And this:
function ob($buf){ return bzcompress($buf); } ob_start('ob', 32768);
will encode in chunks (good), but it will write a bzip2-end-of-stream marker after every chunk (bad), so neither will work.
Nothing more satisfying than to fix a bug in someone else’s code. Now let’s hope this gets applied to PHP itself so I don’t have to manually patch my installations.