PHP, stream filters, bzip2.compress

Maybe you remember that, more than a year ago, I had an interesting problem with stream filters.

The general idea is that I want to output bz2-compressed data to the client as the output is being assembled – or, more to the point: The PopScan Windows-Client supports the transmission of bzip2 encoded data which gets really interesting as the amount of data to be transferred increases.

Even more so: The transmitted data is in XML format which is very easily compressed – especially with bzip2.

Once you begin to transmit multiple megabytes of uncompressed XML-data, you begin to see the sense in jumping through a hoop or two to decrease the time needed to transmit the data.

On the receiving end, I have an elaborate construct capable of downloading, decompressing, parsing and storing data as it arrives over the network.

On the sending end though, I have been less lucky: Because of that problem I had, I was unable to stream out bzip2 compressed data as it was generated – the end of the file was sometimes missing. This is why I’m using ob_start() to gather all the output and then compress it with bzcompress() to send it out.

Of course this means that all the data must be assembled before it can be compressed and the sent to the client.

As we have more and more data to transmit, the client must wait longer and longer before the data begins to reach it.

And then comes the moment when the client times out.

So I finally really had to fix the problem. I could not believe that I was unable to compress and stream out data on the fly.

It turns out that I finally found the smallest possible amount of code to illustrate the problem in a non-hacky way:

So: This fails under PHP up until 5.2.3:

<?
$str = "BEGIN (%d)n
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum.
nEND (%d)n";

$h = fopen($_SERVER['argv'][1], 'w');
$f = stream_filter_append($h, "bzip2.compress", STREAM_FILTER_WRITE);
for($x=0; $x < 10000; $x++){
   fprintf($h, $str, $x, $x);

}
fclose($h);
echo "Writtenn";
?>

Even worse though: It doesn’t fail with a message, but it writes out a corrupt bzip-File.

And it gets worse: With a little amount of data it works, but as the amount of data increases, it begins to fail – at different places depending on how you shuffle the data around.

Above script will write a bzip file which – when uncompressed – will end around iteration 9600.

So now that I had a small reproducible testcase, I could report a bug in PHP: Bug 47117.

After spending so many hours on a problem which in the end boiled down to a bug in PHP (I’ve looked anywhere, believe me. I also tried workarounds, but all to no avail), I just could not let the story end there.

Some investigation quickly turned up a wrong check for a return value in bz2_filter.c which I was able to patch up very, very quickly, so if you visit that bug above, you will find a patch correcting the problem.

Then, when I finished patching PHP itself, hacking up the needed PHP-code to let the thing stream out the compressed data as it arrived was easy. If you want, you can have a look at bzcomp.phps which demonstrates how to plug the compression into either the output buffer handling or something quick, dirty and easier else.

Oh, and if you are tempted to do this:

function ob($buf){
        return bzcompress($buf);
}

ob_start('ob');

… it won’t do any good because you will still gobble up all the data before compressing. And this:

function ob($buf){
        return bzcompress($buf);
}

ob_start('ob', 32768);

will encode in chunks (good), but it will write a bzip2-end-of-stream marker after every chunk (bad), so neither will work.

Nothing more satisfying than to fix a bug in someone else’s code. Now let’s hope this gets applied to PHP itself so I don’t have to manually patch my installations.

Trying out Gmail

Everyone and their friends seems to be using Gmail lately and I agree: The application has a clean interface, a very powerful search feature and is easily accessible from anywhere.

I have my Gmail address from back in the days when invites were scarce and the term AJAX wasn’t even a term yet, but I never go around to really take advantage of the services as I just don’t see myself checking various email accounts at various places – at least not for serious business.

But now I found a way to put gmail to the test as my main email application – at least for a week or two.

My main mail storage is and will be our Exchange server. I have multiple reasons for that

  1. I have all my email I ever sent or received in that IMAP account. That’s WAY more than the 2.8 GB you get in Gmail and even if I had enough space there, I would not want to upload all my messages there.
  2. I don’t trust gmail to be as diligent with the messages I store there as I would want it to. I managed to keep every single email message from 1998 till now and I’d hate to lose all that to a “glitch in the system”.
  3. I need IMAP access to my messages for various purposes.
  4. I need the ability of a strong server-side filtering to remove messages I’m more or less only receiving for logging purposes. I don’t want to see these – not until I need them. No reason to even have them around usually.

So for now I have added yet another filter to my collection of server-side filters: This time I’m redirecting a copy of all mail that didn’t get filtered away due to various reasons to my Gmail address. This way I get to keep all mail of my various aliases all at the central location where they always were and I can still use Gmail to access the newly arrived messages.

Which leaves the problem with the sent messages which I ALSO want to archive at my own location – at least the important ones.

I fixed this by BCCing all Mail I’m writing in gmail to a new alias I created. Mail to that alias with my Gmail address as sender will be filtered into my sent-box by Exchange so it’ll look as though I sent the message via Thunderbird and then uploaded the copy via IMAP.

I’m happy with this solution, so testing Gmail can begin.

I’m asking myself: Is a tag based storage system better than a purely search based (the mail I don’t filter away is kept in one big INBOX which I access purely via search queries if I need something)? Is a web based application as powerful as a mail client like Thunderbird or Apple Mail? Do I unconsciously use features I’m going to miss when using Gmail instead of Apple Mail or Thunderbird? Will I be able to get used to the very quick keyboard-interface to gmail?

Interesting questions I intend to answer.

Mail filtering belongs on the server

Different people who got their iPhone are complaining about SPAM reaching their inbox and want Junk Mail controls on their new gadget, failing to realize the big problem with that approach:

Even if the iPhone is updated with a SPAM filter, the messages will get transmitted and filtered there, which means that you pay for receiving the junk just to throw it away afterwards.

Additionally, Bayes filter still seem to be the way to go with junk mail filtering. The Bayes rules can get pretty large, so this means that you either have to retrain your phone or that the seed data must be synchronized with the phone which will take both a lot of time and space better used for something else.

No. SPAM filtering is a task for the mail server.

I’m using SpamAssassin and DSPAM to check the incoming mail for junk and then I’m using the server side filtering capabilities of our Exchange server to filter mail recognized as SPAM into the “Junk E-Mail” box.

If the filter is easy enough (checking for header values and moving into boxes), even though it is defined in Outlook, the server can process them regardless of which client is connecting to it to fetch the mail (Apple Mail, Thunderbird and the IMAP client on my W880i in my case). This means that all my junk is sorted away into the “Junk Email” folder just when it arrives. It never reaches the INBOX and I never see it.

I don’t have an iPhone and I don’t want to have one (I depend on bluetooth modem functionality and a real keypad), but the same thing applies to any mobile emailing solution. You don’t want SPAM on your Blackberry and especially not on your even simpler non-smartphone.

Speaking of transferring data: The other thing I really don’t like about the iPhone is the browser. Sure: It’s standard compliant, it renders nice, it supports AJAX and supports small-screen-rendering but it transmits the websites uncompressed.

Let me make an example: The digg.com frontpage in Opera Mini causes 10KB of data to be tranferred. It looks perfectly fine on my SonyEricsson W880 and works as such (minus some javascript functionality). Digg.com when accessed via Firefox causes 319 KB to be transmitted.

One MB costs CHF 7 here (though you can have some inclusive MB’s depending on contract) which is around EUR 4.50, so for that money I could watch digg.com three times with the iPhone or 100 times with Opera Mini. The end-user experience is largely the same on both platforms – at least close enough not to warrant the 33 times more expensive access via a browser that works without a special proxy.

As long as GPRS data traffic is prohibitively expensive, junk mail filtering on the server and a prerendering-proxy based browser are a must. Even more so than the other stuff missing in the iPhone.

Upscaling video

I have an awesome Full-HD projector and I have a lot of non-HD video material, ranging from DVD-rips to speedruns of older consoles and I’m using a Mac Mini running Windows (first Vista RC2, then XP and now Vista again) connected to said projector to access the material.

The question was: How do I get the best picture quality out of this setup.

The answer boils down to the question of what device should do the scaling of the picture:

Without any configuration work, the video is scaled by your graphics card which usually does quite a bad job at it unless it provides some special upscaling support which the intel chip in my Mac Mini seems not to.

Then you could let the projector do the scaling which would require the MCE application to change the screen resolution to the resolution of the file played. It would also mean that the projector has to support the different resolutions the files are stored in which is hardly the case as there are some very strange resolutions here and then (think game boy’s native 140×102 resolution).

The last option is to let your CPU do the scaling – at least to some degree.

This is a very interesting option, especially as my Mac Mini comes with one of these nice dual core CPUs we can try and leverage for this task. Then, there are a lot of algorithms out there that are made exactly for the purpose of scaling video, some of which are very expensive to implement in specialized hardware like GPUs or the firmware of a projector.

So I went around and finally found this post outlining the steps needed to configure ffdshow to do its thing.

I used the basic setting and modified it just a bit to keep the original aspect ratio of the source material and to only do the resizing up to the resolution of 1280×720. If the source is larger than this, there’s no need to shrink the video just to use the graphics chip to upscale it again to the projectors native 1920×1280 resolution (*sigh*).

Also, I didn’t want ffdshow to upscale 1280×720 to the full 1920×1280. At first I tried that, but I failed to see a difference in picture quality, but I had the odd frame drop here and then, so I’m running at the limits of my current setup.

Finally, I compared the picture quality of a Columbo (non-referal link to Amazon – the package arrived last week) DVD rip with and without the resizing enabled.

The difference in quality is immense. The software-enhanced picture looks nearly like a real 720p movie – sure: Some details are washed-up, but the overall quality is worlds better than what I got with plain ffdshow and no scaling.

Sure. The CPU usage is quite a bit higher than before, but that’s what the CPUs are for – to be used.

I highly recommend you taking the 10 minutes needed to set up the ffdshow video decoder to do the scaling. Sure: The UI is awful and I didn’t completely understand many of the settings, but the increased quality more than made up the work it took to configure the thing.

Heck! Even the 240×160 pixel sized Pokémon Sapphire run looked much better after going through ffdshow with software scaling enabled.

Highly recommended!

By the way: This only works in MCE for video files as MCE refuses to use ffdshow for MPEG2 decoding which is needed for DVD or TV playback. But 100% of the video I watch are video files anyways, so this doesn’t bother me at all.

*sigh*

 % php -a
Interactive shell

php > if (0 == null) echo "*sigh*n";
*sigh*
php > quit

that bit me today. Even after so many years. I should really get used to use ===

Newfound respect for JavaScript

Around the year 1999 I began writing my own JavaScript code as opposed to copying and pasting it from other sources and only marginally modifying it.

In 2004 I practically discovered AJAX (XmlHttpRequest in particular) just before the hype started and I have been doing more and more JavaScript since then.

I always regarded JavaScript as something you have to do, but which you dislike. My code was dirty, mainly because I was of the wrong opinion that JavaScript was a procedural language with just one namespace (the global one). Also, I wasn’t using JavaScript for a lot of functionality of my sites, partly because of old browsers and partly because I have not yet seen what was possible in that language.

But for the last year or so, I’m writing very large quanitites of JS in very AJAXy applications, which made me really angry about the limited ways you could use to structure your code.

And then I found a link on reddit to a lecture of a yahoo employee, Douglas Crockford, which really managed to open my eyes.

JavaScript isn’t procedural with some object oriented stuff bolted on. JavaScript is a functional language with object oriented and procedural concepts integrated where it makes sense for us developers to both quickly write code and to understand written code even with only a very little knowledge of how functional languages work.

The immensely powerful concept of having functions as first class objects, of allowing closures and of allowing to modify object prototypes at will makes turns JS into a really interesting language which can be used to write “real” programs with a clean structure.

The day when I have seen those videos, I understood that I had the completely wrong ideas about JavaScript mainly because of my crappy learning experience so far which initially consisted of Copying and Pasting crappy code from the web and later of reading library references, but always ignoring real introductions to the language («because I know that already»).

If you are interested to learn a completely new, powerful side of JavaScript, I highly recommend you watch these movies.

A followup to MSI

My last post about MSI generated some nice responses, amongst them the lengthy blog post on Legalize Adulthood.

Judging from the two track-backs on the MSI posting and especially after reading the linked post above, I come to the conclusion that my posting was very easy to misunderstand.

I agree that the workarounds I listed are problems with the authoring. I DO think however that all these workarounds where put in place because the platform provided by Microsoft is lacking in some kind.

My rant was not about the side effects of these workarounds. It was about their sole existence. Why are some of us forced to apply workarounds to an existing platform to achieve their goals? Why doesn’t the platform itself provide the essential features that would make the workarounds unneeded?

For my *real* problems with MSI from an end users perspective, feel free to read this rant or this on e (but bear in mind that both are a bit oldish by now).

Let’s go once again through my points and try to understand what each workaround tries to accomplish:

  1. EXE-Stub to install MSI: MSI, despite being the platform of choice still isn’t as widely deployed as the installer authors want it to be. If Microsoft wants us to use MSI, it’s IMHO their responsibility to ensure that the platform is actually available.

    I do agree though that Microsoft is working on this, for example by requiring MSI 3.1 (the first release with acceptable patching functionality) for Windows Update. This is what makes the stubs useless over time.

    And personally I think a machine that isn’t using Windows Update and thus hasn’t 3.1 on it isn’t a machine I’d want to deploy my software on because a machine not running Windows update is probably badly compromised and in an unsupportable state.

  2. EXE-Stub to check prerequisites: Once more I don’t get why the underlying platform cannot provide functionality that is obviously needed by the community. Prerequisites are a fact for life and MSI does nothing to help that. MSI packages can’t be used to install other MSI packages but Merge Modules, but barely any libraries required by todays applications actually come in MSM format (.NET framework? Anyone?).

    In response to the excellent post on Legalize Adulthood which gives an example about DirectX, I counter with: Why is there a DirectX Setup API? Why are there separate CAB files? Isn’t MSI supposed to handle that? Why do I have to create a setup stub calling a third-party API to get stuff installed that isn’t installed in the default MSI installation?.

    An useful package solution would provide a way to specify dependencies or at least allow for automated installation of dependencies from the original package.

    It’s ironic that an MSI package can – even though it’s dirty – use a CustomAction to install a traditionally packaged .EXE-Installer-Dependency, but can’t install a .MSI packaged dependency.

    So my problem isn’t with bootstrappers as such, but with the limitations in MSI itself requiring us developers to create bootstrappers to do work which IMHO MSI should be able to do.

  3. MSI-packages .EXE’s: I wasn’t saying that MSI is to blame for the authors that repacked their .EXE’s into .MSI packages. I’m just saying that this is another type of workaround that could have been chosen for the purpose of getting the installation to work despite (maybe only perceived) limitations in MSI. An ideal packaging solution would be as accessible and flexibly as your common .EXE-installer and thus make such a workaround unneeded.

  4. Third party scripting: In retrospect I think the motivation for these third party scripting solutions is mainly the vendor-lock-in. I’m still convinced though that with a more traditional structure and a bit more flexibility for the installer authors, such third party solutions would get more and more unneeded until they finally die out.

  5. Extracting, then merging: Also just another workaround that has been chosen because a distinct problem wasn’t solvable using native MSI technology.

    I certainly don’t blame MSI for a developer screwing up. I’m blaming MSI for not providing the tools necessary for the installer community to use native MSI to solve the majority of problems. I ALSO blame MSI for messiness, for screwing up my system countless times and for screwing up my parent’s system which is plainly unforgivable.

    Because MSI is a complicated black box, I’m unable to fix problems with constantly appearing installation prompts, with unremovable entries in “Add/Remove programs” and with installations failing with such useful error messages as “Unknown Error 0x[whatever]. Installation terminated”.

    I’m blaming MSI for not stopping the developer community to author packages with above problems. I’m blaming MSI for its inherent complexity causing developers to screw up.

    I’m disappointed with MSI because it works in a ways that requires at least a part of the community to create messy workarounds for quite common problems MSI can’t solve.

    What I posted was a list of workarounds of varying stupidity for problems that shouldn’t exist. Authoring errors that shouldn’t need to happen.

    I’m not picky here: A large majority of packages I had to work with do in fact employ one of these workarounds (the unneeded EXE-stub being the most common one), none of which should be needed.

    And don’t get me started about how other operating systems do their deployment. I think Windows could learn from some of them, but that’s for another day.

Altering the terminal title bar in Mac OS X

After one year of owning a MacBook Pro, I finally got around to fix my precmd() ZSH-hack to really make the current directory and stuff appear in the title bar of Terminal.app and iTerm.app.

This is the code to add to your .zshrc:

case $TERM in
    *xterm*|ansi)
		function settab { print -Pn "e]1;%n@%m: %~a" }
		function settitle { print -Pn "e]2;%n@%m: %~a" }
		function chpwd { settab;settitle }
		settab;settitle
        ;;
esac

settab sets the tab contents in iTerm and settitle does the same thing for the title bar both in Terminal.app and iTerm.

The sample also shows the variables ZSH replaces in the strings (the parameter -P to print lets ZSH do prompt expansion. See zshmisc(1) for a list of all variables): %n is the currently logged on user, %m the hostname up until the first dot and %~ is displaying the current directory or ~ if you are in $HOME. You can certainly add any other environment variable of your choice if you need more options, but this more or less does it for me.

Usually, the guides in the internet make you use precmd to set the title bar, but somehow, Terminal wasn’t pleased with that method and constantly kept overwriting the title with the default string.

And this is how it looks in both iTerm (above) and Terminal (below):

Windows Installer – Worked around

I’ve talked about Windows Installer (the tool that parses these .MSI files) before and I’ve never really convinced that this technology really does its job. Just have a look at these previous articles: Why o why is my hard-drive so small?, A look at Windows Installer and The myth of XCOPY deployment

Yesterday I had a look at the Delphi 2007 installation process and it dawned me that I’m going to have to write yet another blog entry.

It’s my gut-feeling that 80% of all bigger software packages in Windows can’t live with MSIs default feature set and they have to work around inherent flaws in the design of that tool. Here’s what I found installers doing (in increasing order of stupidity):

  1. Use a .EXE-stub to install the MSI engine. These days this really doesn’t make sense any more as 99% of all windows installation already have MSI installed and the ones that don’t, you don’t want to support anyways (Windows Update requires MSI).
  2. Use a .EXE-stub that checks for availability and thereafter installs a bunch of prerequisites – sometimes even other MSI packages. This isn’t caused by MSI-files unable to detect the presence of prerequisites – it’s because MSI-files are unable to install other MSI files and the workaround (using merge packages) doesn’t work because most of the third party libraries to install don’t come as merge packages.
  3. Create a MSI-file which contains a traditional .EXE-Setup, unpack that to a temporary location and run it. This is what I call the “I want a Windows-Logo, but have no clue how to author MSI files”-type of installation (and I completely understand the motivation behind that) which just defeats all the purposes MSI files ever had. Still: Due to inherent limitations in the MSI engine, this is often times the only way to go.
  4. Create MSI-files that extract a vendor specific DLL, a setup script and all files to deploy (or even just an archive) and then use that vendor specific DLL to run the install script. This is what InstallShield does at least some of the time. This is another version of the “I have no clue how to author a MSI file”-installation with the additional “benefit” of being totally vendor-locked.
  5. Create a custom installer that installs all files and registry keys and then launch the windows installer with a temporary .MSI-file to register your installation work in the MSI-installer. This is what Delphi 2007 does. I feel this is another workaround for Microsoft’s policy that only MSI-driven software can get a windows-logo, but this time it’s vendor-locked and totally unnecessary and I’m not even sure if such a behavior is consistent with any kind of specification.

Only a small minority of installations really use pure MSI and these installations usually are installations of small software packages and as my previous articles show: The technology is far from fool-proof. While I see that Windows should provide a generalized means for driving software installations, MSI can’t be the solution as evidenced by the majority of packages using workarounds to get by the inherent flaws of the technology.

*sigh*

Software patents

Like most programmers, I too hate software patents. But until now, I’ve never had a fine example of how bad they really are (though I’ve written about intellectual property in general before).

But now I just found another granted patent application linked on reddit.

The patent covers… linked lists.

Granted. It’s linked lists with pointers to objects further down the list than the immediate neighbors, but it’s still a linked list.

I’ve first read about linked lists when I was 13 and I read my first book about C. This was 13 years ago – way before that patent application was originally filed.

So seeing a technology in use for at least 13 years being patented as «new invention», I’m asking myself two questions:

  1. How the hell could this patent application even be accepted seeing that it isn’t inventive at all?
  2. Why do companies file trivial patents for which prior art obviously exists and which are thus invalid to begin with?

And based on that I’m asking the world: Why don’t we stop the madness?

But let’s have a look at the above two points. Answering the first one is easy: The people checking these applications have no interest (and no obligation) to check the applied patents. In fact, these «experts» may even be paid per passed patent and thus are totally interested in letting as many patents pass as possible. Personally, I also doubt their technical knowledge in the fields they are reviewing patents in.

Even more so: Most of these applications are formulated in legal-speak which is targeted to be read by lawyers which usually have no clue about IT, whereas the IT people usually don’t understand the texts of the applications.

Patent law (as trademark law) basically allows you to submit anything and it’s the submitters responsibility to make sure that prior art doesn’t exist. The patent offices can’t be hold liable for wrongly issued patents.

And this leads us to question 2: Why submit an obviously invalid patent?

For one, patent applications make the scientific achievement of a company measurable for non-tech people.

Analysts compare the «inventiveness» of companies by comparing the sheer number of granted patents. A company with more granted patents has a better value in the market and it’s only about market-value these days. This is one big motivation for a company to try and have as many patents granted as possible.

The other issue is that once the patent is granted, you can use that (invalid) patent to sue as many competitors as possible. As you have the legally granted patent on your side, the sued party must prove that the patent is invalid. This means a long and very expensive trial with an uncertain outcome – you can never know if the jury/judge in question knows enough about technology to identify the patent as false or if they will just value the legally issued document higher than the possible doubts raised by the sued party.

This makes fighting an invalid patent a very risky adventure which many companies don’t want to invest money in.

So in many (if not most) cases, your invalid patent is as valuable as a valid one if you intend to use it to sue competitors to make them pay royalty fees or hinder them at ever selling a product competing to yours – even though your legal measure is invalid.

One more question to ask: Why does the Free Software community seem so incredibly concerned about software patents while vendors of commercial software usually keep quiet?

It’s all about the provability of infringing upon trivial patents.

Let’s take above linked-list patent: It’s virtually impossible to prove that any piece of compiled software is infringing on this (invalid) patent. In source form though, it’s trivially easy to prove the same thing.

So where this patent serves only one purpose in the closed source world (increased shareholder value due to higher amount of patents granted), it also begins to serve the other purpose (weapon against competitors) in a closed source world.

And. Yes. I’m asserting that Free as well as Non-Free software infringes upon countless of patents. Either willing or unwilling (I guess the former is limited to the non-free community). Just look at the sheer amount of software patents granted! I’m asserting that it’s plain impossible to write software today that doesn’t infringe upon any patent.

Please, stop that software patent nonsense. The current system criminalizes developers and serves no purpose that trademark and intellectual property laws couldn’t solve.