Last autumn, I was talking about how I would like to see pdo_pgsql for PHP to be improved.
Over the last few days I had time to seriously start looking into making sure I get my wish. Even though my C is very rusty and I have next to no experience in dealing with the PHP/Zend API, I made quite a bit of progress over the last few days.
First, JSON support
If you have the json extension enabled in your PHP install (it’s enabled by default), then any column of data type json will be automatically parsed and returned to you as an array.
No need to constantly repeat yourself with json_parse(). This works, of course, with directly selected json columns or with any expression that returns json (like array_to_json or the direct typecast shown in the screenshot).
This is off by default and can be enabled on a per-connection or a per-statement level as to not break backwards compatibility (I’ll need it off until I get a chance to clean up PopScan for example).
Next, array support:
Just like with JSON, this will automatically turn any array expression (of the built-in array types) into an array to use from PHP.
As I’m writing this blog entry here, this only works for text[] and it’s always enabled.
Once I have an elegant way to deal with the various types of arrays and convert them into the correct PHP types, I’ll work on making this turnoffable (technical term) too.
I’ll probably combine this and the automatic JSON parsing into just one setting which will include various extended data types both Postgres and PHP know about.
Once I’ve done that, I’ll look into more points on my wishlist (better error reporting with 9.3 and later and a way to quote identifiers comes to mind) and then I’ll probably try to write a proper RFC and propose this for inclusion into PHP itself (though don’t get your hopes up – they are a conservative bunch).
If you want to follow along with my work, have a look at my pdo_pgsql-improvements branch on github (tracks to PHP-5.5)
Today, PostgreSQL 9.3 was released.
September is always the month of PostgreSQL as every September a new
Major Release with awesome new feature is released and every September
I have to fight the urgue to run and immediately update the production
systems to the new version of my favoritetoy
As every year, I want to talk the awesome guys (and girls I hope) that
make PostgreSQL one of my favorite pieces of software overall and for
certain my most favorite database system.
That said, there’s another aspect of PostgreSQL that needs some serious
love: While back in the days PHP was known for its robust database
client libraries, over time other language environments have caught up
and long since surpassed what’s possible in PHP.
To be honest, the PostgreSQL client libraries as they are currently
available in PHP are in serious need of some love.
If you want to connect to a PostgreSQL database, you have two options:
Either you use the thin wrapper over libpq, the pgsql extension,
or you go PDO at which point, you’d use pdo_pgsql
Both solutions are, unfortunately, quite inadequate solutions that fail
to expose most of the awesomeness that is PostgreSQL to the user:
pgsql
On the positive side, being a small wrapper around libpq, the pgsql
extension knows quite a bit about Postgres’ internals: It has excellent
support for COPY, it knows about a result sets data types (but doesn’t
use that knowledge as you’ll see below), it has pg_quote_identifier
to correctly quote identifiers, it support asynchronous queries and it
supports NOTIFY.
But, while pgsql knows a lot about Postgres’ specifics, to this day,
the pg_fetch_* functions convert all columns into strings. Numeric
types? String. Dates? String. Booleans? Yes. String too (‘t’ or ‘f’,
both trueish values to PHP).
To this day, while the extension supports prepared statements, their
use is terribly inconvenient, forcing you to name your statements and
to manually free them.
To this day, the pg_fetch_* functions load the whole result set into
an internal buffer, making it impossible to stream results out to the
client using an iterator pattern. Well. Of course it’s still possible,
but you waste the memory for that internal buffer, forcing you to
manually play with DECLARE CURSOR and friends.
There is zero support for advanced data types in Postgres and the
library doesn’t help at all with todays best practices for accessing a
database (prepared statements).
There are other things that make the extension unpractical for me, but
they are not the extensions fault, so I won’t spend any time explaining
them here (like the lack of support by newrelic – but, as I said,
that’s not the extensions fault)
pdo_pgsql
pdo_pgsql gets a lot of stuff right that the pgsql extension doesn’t:
It doesn’t read the whole result set into memory, it knows a bit about
data types, preserving numbers and booleans and, being a PDO driver, it
follows the generic PDO paradigms, giving a unified API with other PDO
modules.
It also has good support for prepared statements (not perfect, but
that’s PDOs fault).
But it also has some warts:
There’s no way to safely quote an identifier. Yes. That’s a PDO
shortcoming, but still. It should be there.
While it knows about numbers and booleans, it doesn’t know about any of the other more advanced data types.
Getting metadata about a query result actually makes it query the
database – once per column, even though the information is right there
in libpq, available to use (look at the source
of PDOStatement::getColumnMeta). This makes it impossible to fix above issue in userland.
It has zero support for COPY
If only
Imagine the joy of having a pdo_pgsql that actually cares about
Postgres. Imagine how selecting a JSON column would give you its data
already decoded, ready to use in userland (or at least an option to).
Imagine how selecting dates would at least give you the option of
getting them as a DateTime (there’s loss of precision though –
Postgres’ TIMESTAMP has more precision than DateTime)
Imagine how selecting an array type in postgres would actually give you
back an array in PHP. The string that you have to deal with now is
notoriously hard to parse. Yes. There now is array_to_json in
Postgres, but hat shouldn’t be needed.
Imagine how selecting a HSTORE would give you an associative array.
Imagine using COPY with pdo_pgsql for very quickly moving bulk data.
Imagine the new features of PGResult being exposed to userland.
Giving applications the ability to detect what constraint was just
violated (very handy to detect whether it’s safe to retry).
Wouldn’t that be fun? Wouldn’t that save us from having to type so much
boilerplate all day?
Honestly, what I think should happen is somebody should create a pdo_pgsql2 that breaks backwards compatibility and adds all these
features.
Have getColumnMeta just return the OID instead of querying the
database. Have a quoteIdentifier method (yes. That should be in PDO
itself, but let’s fix it where we can).
Have fetch() return Arrays or Objects for JSON columns. Have it
return Arrays for arrays and HSTOREs. Have it optionally return DateTimes instead of strings.
Wouldn’t that be great?
Unfortunately, while I can write some C, I’m not nearly good enough
to produce something that I could live with other people using, so any
progress I can achieve will be slow.
I’m also unsure of whether this would ever have a chance to land in PHP
itself. Internals are very adverse to adding new features to stuff that
already “works” and no matter how good the proposal, you need a very
thick skin if you want to ever get something merged, no matter whether
you can actually offer patches or not.
Would people be using an external pdo_pgsql2? Would it have a chance as
a pecl extension? Do other people see a need for this? Is somebody
willing to help me? I really think something needs to be done and I’m
willing to get my hands dirty – I just have my doubts about the quality
of the result I’m capable of producing. But I can certainly try.
I’ve just pushed version 0.4-beta1 of sacy
to its github repository. Aside of requiring
PHP 5.3 now, it also has support for transforming contents of inline-tags.
and have the transformation done on the server-side, then I have good news
for you: Now you can! Just wrap the script with {asset_compile}...{/asset_compile}.
I’m not saying that having inline-scripts (or even stylesheets) is a good idea
but sometimes, we have to pass data between our HTML templates and the JS
code and now we can do it in Coffee Script.
Development note
When you take a look at the commits leading to the release, you will notice
that I more or less hacked the support for inline tags into the existing
codebase (changing the terminology from files to work units in the process
though).
Believe me, I didn’t like this.
When I sat down to implement this, what I had in mind was a very nice
architecture where various components just register themselves and then
everything falls into place more or less automatically.
Unfortunately, what ever I did (I used git checkout . about three times) to
start over, I never got a satisfactory solution:
sometimes, I was producing a ton of objects, dynamically looking up what
methods to call and what classes to instantiate.
This would of course be very clean and cool, but also terribly slow. Sacy
is an embeddable component, not an application in its own right.
sometimes, I had a simplified object model that kind of worked right until I
thought of some edge-case at which point we would have either ended up back in
hack-land or the edge-cases would have had to remain unfixed
sometimes I had something flexible enough to do what I need, but it still
had code in it that had to know whether it was dealing with instances of Class
A or Class B which is as inacceptable as the current array-mess.
In the end, it hit me: Sacy is already incomplete in that it simplifies the
problem domain quite a lot already. To cleanly get out of this, I would have to
actually parse and manipulate the DOM instead of dealing with regexes and I
would probably even have to go as far as to write a FactoryFactory in order
to correctly abstract away the issues.
Think of it: We have a really interesting problem domain here:
the same type of asset can use different tags (style and link for
stylesheets)
Different attributes are used to refer to external resources (href for
stylesheets, src for scripts)
File-backed assets can (and should) be combined
Conent-backed assets should be transformed and immediately inlined
Depending on the backing (content or file), the assets use a different
method to determine cache-freshness (modification-time/size vs. content)
And last but not least, file based asset caching is done on the client side,
content based asset caching is done on the server-side.
Building a nice architecture that would work without the ifs I learned to
hate lately would mean huge levels of indirections and abstractions.
No matter what I tried, I always ended up with a severe case of object-itis and
architectur-itis, both of which I deemed completely inacceptable for a
supposedly small and embeddable library.
Which is why I decided to throw away all my attempts and make one big
compromise and rely on CacheRenderer::renderWorkUnits to be called with
unified workunits (either all file or all content-based).
That made the backend code a lot easier.
And I could keep the lean array structure for describing a unit of work to do
for the backend.
I would still, at some point, love to have a nice way for handlers to register
themselves, but that’s something I’ll handle another day. For now, I’m happy
that I could accomplish my goal in a very lean fashion at the cost of a public
interface of the backend that is really, really inconvenient to use which leaves way too much code in the fronend.
At least I got away without an AssetFactoryFactory though :-)
I’ve just updated the sacy repository again and tagged a v0.3-beta1 release.
The main feature since yesterday is support for the official compilers and
tools if you can provide them on the target machine.
The drawback is that these things come with hefty dependencies at times (I
don’t think you’d find a shared hoster willing to install node.js or Ruby for
you), but if you can provide the tools, you can get some really nice
advantages over the PHP ports of the various compilers:
the PHP port of sass has an issue that prevents
@import from working. sacy’s build script does patch that, but the way they
were parsing the file names doesn’t inspire confidence in the library. You
might get a more robust solution by using the official tool.
uglifier-js is a bit faster than JSMin, produces significantly smaller
output and comes with a better license (JSMin isn’t strictly free software
as it has this “do no evil” clause)
coffee script is under very heavy development, so I’d much rather use the
upstream source than some experimental fun project. So far I haven’t seen
issues with coffeescript-php, but then I haven’t been using it much yet.
Absent from the list you’ll find less and css minification:
the PHP native CSSMin is really good and
there’s no single official external tool out that demonstrably better (maybe
the YUI compressor, but I’m not going to support something that requires me
to deal with Java)
lessphp is very lightweight and yet very full
featured and very actively developed. It also has a nice advantage over the
native solution in that the currently released native compiler does not
support reading its input from STDIN, so if you want to use the official
less, you have to go with the git HEAD.
Feel free to try this out (and/or send me a patch)!
Oh and by the way: If you want to use uglifier or the original coffee script
and you need node but can’t install it, have a look at the static binary I created
As always, the support is seamless – this is all you have to do.
Again, in order to keep deployment simple, I decided to go with a pure PHP solution (coffeescript-php).
I do see some advantages in the native solutions though (performance, better output), so I’m actively looking into a solution to detect the availability of native converters that I could shell out to without having to hit the file system on every request.
Also, when adding the coffee support, I noticed that the architecture of sacy isn’t perfect for doing this transformation stuff. Too much code had to be duplicated between CSS and JavaScript, so I will do a bit of refactoring there.
Once both the support for external tools and the refactoring of the transformation is completed, I’m going to release v0.3, but if you want/need coffee support right now, go ahead and clone the repository.
I’m sure you’ve used this idiom a lot when writing JavaScript code
options['a'] = options['a'] || 'foobar';
It’s short, it’s concise and it’s clear what it does. In ruby, you can even be more concise:
params[:a] ||= 'foobar'
So you can imagine that I was happy with PHP 5.3’s new ?: operator:
<?php $options['a'] = $options['a'] ?: 'foobar';
In all three cases, the syntax is concise and readable, though arguably, the PHP one could read a bit better, but, ?: still is better than writing the full ternary expression, spelling out $options['a'] three times.
PopScan, since forever (forever being 2004) runs with E_NOTICE turned off. Back in the times, I felt it provided just baggage and I just wanted (had to) get things done quickly.
This, of course, lead to people not taking enough care for the code and recently, I had one too many case of a bug caused by accessing a variable that was undefined in a specific code path.
I decided that I’m willing to spend the effort in cleaning all of this up and making sure that there are no undeclared fields and variables in all of PopScans codebase.
Which turned out to be quite a bit of work as a lot of code is apparently happily relying on the default null that you can read out of undefined variables. Those instances might be ugly, but they are by no means bugs.
Cases where the null wouldn’t be expected are the ones I care about, but I don’t even what to go and discern the two – I’ll just fix all of the instances (embarrassingly many, most of them, thankfully, not mine).
Of course, if I put hours into a cleanup project like this, I want to be sure that nobody destroys my work again over time.
Which is why I was looking into running PHP with E_NOTICE in development mode at least.
Which brings us back to the introduction.
<?php $options['a'] = $options['a'] ?: 'foobar';
is wrong code. Any accessing of an undefined index of an array always raises a notice. It’s not like Python where you can chose (accessing a dictionary using [] will throw a KeyError, but there’s get() which just returns None). No. You don’t get to chose. You only get to add boilerplate:
Which looks ugly as hell. And no, you can’t write a wrapper to explode() which always returns an array big enough, because you don’t know what’s big enough. You would have to pass the amount of nulls you want into the call too. That would look nicer then above hack, but it still doesn’t even come close in conciseness to the solution which throws a notice.
So. In the end, I’m just complaining about syntax you might think? I though so too and I wanted to add the syntax I liked, so I did a bit of experimenting.
Here’s a little something I’ve come up with:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The wrapped array solution looks really compelling syntax-wise and I could totally see myself using this and even forcing everybody else to go there. But of course, I didn’t trust PHP’s interpreter and thus benchmarked the thing.
pilif@tali ~ % php e_notice_stays_off.php
Notices off. Array 100000 iterations took 0.118751s
Notices off. Inline. Array 100000 iterations took 0.044247s
Notices off. Var. Array 100000 iterations took 0.118603s
Wrapped array. 100000 iterations took 0.962119s
Parameter call. 100000 iterations took 0.406003s
Undefined var. 100000 iterations took 0.194525s
So. Using nice syntactic sugar costs 7 times the performance. The second best solution? Still 4 times. Out of the question. Yes. It could be seen as a micro-optimization, but 100’000 iterations, while a lot is not that many. Waiting nearly a second instead of 0.1 second is crazy, especially for a common operation like this.
Interestingly, the most bloated code (that checks with isset()) is twice as fast as the most readable (just assign). Likely, the notice gets fired regardless of error_reporting() and then just ignored later on.
What really pisses me off about this is the fact that everywhere else PHP doesn’t give a damn. ‘0’ is equal to 0. Heck, even ‘abc’ is equal to 0. It even fails silently many times.
But in a case like this, where there is even newly added nice and concise syntax, it has to be overly conservative. And there’s no way to get to the needed solution but to either write too expensive wrappers or ugly boilerplate.
Dynamic languages give us a very useful tool to be dynamic in the APIs we write. We can create functions that take a dictionary (an array in PHP) of options. We can extend our objects at runtime by just adding a property. And with PHP’s (way too) lenient data conversion rules, we can even do math with user supplied string data.
But can we read data from $_GET without boilerplate? No. Not in PHP. Can we use a dictionary of optional parameters? Not in PHP. PHP would require boilerplate.
If a language basically mandates retyping the same expression three times, then, IMHO, something is broken. And if all the workarounds are either crappy to read or have very bad runtime properties, then something is terribly broken.
So, I decided to just fix the problem (undefined variable access) but leave E_NOTICE where it is (off). There’s always git blame and I’ll make sure I will get a beer every time somebody lets another undefined variable slip in.
When you call serialize() in PHP, to serialize a value into something that you store for later use with unserialize(), then be very careful what you are doing with that data.
When you look at the output, you’d be tempted to assume that it’s text data:
You will notice that the format encodes the strings length together with the string. And because PHP is inherently not unicode capable, it’s not encoding the strings character length, but its byte-length.
unserialize() checks whether the encoded length matches the actual delimited strings length. This means that if you treat the serialized output as text and your databases’s encoding changes along the way, that the retrieved string can’t be unserialized any more.
I just learned that the hard way (even though it’s obvious in hindsight) while migrating PopScan from ISO-8859-1 to UTF-8:
The databases of existing systems now contain a lot of output from serialize() which was run over ISO strings but now that the client-encoding in the database client is set to utf-8, the data will be retrieved as UTF-8 and because the serialize() output was stored in a TEXT column, it happily gets UTF-8 encoded.
If we remove the database from the picture and express the problem in code, this is what’s going on:
unserialize(utf8encode(serialize('data with 8bit chàracters')));
i.e the data gets altered after serializing and the way it gets altered is a way that unserialize can’t deal with the data any more.
So, for everybody else not yet in this dead end:
The output of serialize() is binary data. It looks like textual data, bit it isn’t. Treat it as binary. If you store it somewhere, make sure that the medium you store it to treats the data as binary. No transformation what so ever must ever be made on it.
Of course, that leaves you with a problem later on if you switch character sets and you have to unserialize, but at least you get to unserialize then. I have to go great lengths now to salvage the old data.
It does this without you ever having to manually run a compiler, without serving all your assets through some script (thus saving RAM) and without worries about stale copies being served. In fact, you can serve all static files generated with sacy with cache headers telling browsers to never revisit them!
All of this, using two lines of code (wrap as much content as you want in {asset_compile}…{/asset_compile})
Sacy has been around for a bit more than a year now and has since been in production use in PopScan. During this time, no single bug in Sacy has been found, so I would say that it’s pretty usable.
Coworkers have bugged me enough about how much better less or sass would be compared to pure CSS so that I finally decided to update sacy to allow us to use less in PopScan:
Aside of consolidating and minimizing CSS and JavaScript, sacy can now also transform less and sass (or scss) files using the exact same method as before but just changing the mime-type:
I have been patient. For months I hoped that Ubuntu would sooner or later get PHP 5.3, a release I’m very much looking forward to, mainly because of the addition of anonymous inner functions to spell the death of create_function or even eval.
We didn’t get 5.3 for Karmic and who knows about Lucid even (it’s crazy that nearly one year after the release of 5.3, there is still debate on whether to include it in the next version of Ubuntu that will be the current LTS release for the next four years. This is IMHO quite the disservice against PHP 5.3 adoption).
Anyways: We are in the process of releasing a huge update to PopScan that is heavily focussed on getting rid of cruft, increasing speed all over the place and increasing overall code quality. Especially the last part could benefit from having 5.3 and seeing that at this point PopScan already runs well on 5.3, I really wanted to upgrade.
In comes Al-Ubuntu-be, a coworker of mine and his awesome Debian packaging skills: Where there are already a few PPAs out there that contain a 5.3 package, Albe went the extra step and added not only PHP 5.3 but quite many other packages we depend upon that might also be useful to my readers. Packages like APC, memcache, imagick and xdebug for development.
While we can make no guarantees that these packages will be maintained heavily, they will get some security update treatment (though highly likely by version bumping as opposed to backporting).
So. If you are on Karmic (and later Lucid if it won’t get 5.3) and want to run PHP 5.3 with APC and Memcache, head over to Albe’s PPA.
Also, I’d like to take the opportunity to thank Albe for his efforts: Having a PPA with real .deb packages as opposed to just my self-compiled mess I would have done gives us a much nicer way of updating existing installations to 5.3 and even a much nicer path back to the original packages once they come out. Thanks a lot.
I guess it’s inevitable. Good ideas may fail. And good ideas may be years ahead of their time. And of course, sometimes, people just don’t listen.
But one never stops learning.
In the year 2000, I took part in a plan of a couple of guys to become the next Yahoo (Google wasn’t quite there yet back then), or, to use the words we used on the site,
For these reasons, we have designed an online environment that offers a truly new way for people to store, manage and share their favourite online resources and enables them to engage in long-lasting relationships of collaboration and trust with other users.
The idea behind the project, called linktrail, was basically what would much later on be picked up by the likes of twitter, facebook (to some extent) and the various community based news sites.
The whole thing went down the drain, but the good thing is that I was able to legally salvage the source code, the install it on a personal server of mine and to publish the source code. And now that so many years have passed, it’s probably time to tell the world about this, which is why I have decided to start this little series about the project. What is it? How was it made? And most importantly: Why did it fail? And concequently: What could we have done better?
But let’s first start with the basics.
As I said, I was able to legally acquire the database and code (which is mostly written by me anyways) and to install the site on a server of mine, so let’s get that out to start with. The site is available at linktrail.pilif.ch. What you see running there is the result of 6 months of programming by myself after a concept done by the guys I’ve worked with to create this.
What is linktrail?
If the tour we made back then is any good, then just taking it would probably be enough, but let me phrase in my words: The site is a collection of so called trails which in turn are small units, comparable to blogs, consisting of links, titles and descriptions. These micro-blogs are shown in a popup window (that’s what we had back then) beside the browser window to allow quick navigation between the different links in the trail.
Trails are made by users, either by each user on their own or as a collaborative work between multiple users. The owner of a trail can hand out permissions to everybody or their friends (using a system quite similar to what we currently see on facebook for example)
A trail is placed in a directory of trails which was built around the directory structures we used back then, though by now, we would probably do this much more different. Users can subscribe to trails they are interested in. In that case, they will be notified if a trail they are subscribed to is updated either by the owner or anybody else with the rights to update the trail.
Every user (called expert in the site’s terms) has their profile page (here’s mine) that lists the trails they created and the ones they are subscribed to.
The idea was for you as an user to find others with similar interests and form a community around those interests to collaborate on trails. An in-site messaging-system helped users to communicate with each other: Aside of just sending plain text messages, it’s possible to recommend trails (for easy one-click subscription) .
linktrail was my first real programming project, basically 6 months after graduating in what the US would call high school. Combine that fact with the fact that it was created during the high times of the browser wars (year 2000, remember) with web standards basically non-existing, then you can imagine what a mess is running behind the scenes.
Still, the site works fine within those constraints.
In future posts, I will talk about the history of the project, about the technology behind the site, about special features and, of course, about why this all failed and what I would do differently – both in matters of code and organization.
If I woke your interest, feel free to have a look at the code of the site which I just now converted from CVS (I started using CVS about 4 months into development, so the first commit is HUGE) to SVN to git and put it up on github for public consumption. It’s licensed under a BSD license, but I doubt that you’d find anything in this mess of PHP3(!) code (though it runs unchanged(!) on PHP5 – topic of another post I guess), HTML 3.2(!) tag soup and java-script hacks.
Oh and if you can read german, I have also converted the CVS repository that contained the concept papers that were written over the time.
In preparation of this series of blog-posts, I have already made some changes to the code base (available at github):
login after register now works
warning about unencrypted(!) passwords in the registration form