Automatic language detection

If you write a website, do not use Geolocation to determine the language to display to your user.

If you write a desktop application, do not use the region setting to determine the language to display to your user.

This is incredibly annoying for some of us, especially for me which is why I’m ranting here.

The moment Google released their (awful) German translation for their RSS reader, I was served the German version just because I have a Swiss IP address.

Here in Switzerland, we actually speak one of three (or four, depending on who you ask) languages, so defaulting to German is probably not of much help for the people in the french speaking part.

Additionally, there are many users fluent in (at least reading) English. We always prefer the original language if at all possible because generally, translations never quite work. Even if you have the best translators at work, translated texts never feel fluid. Especially not when you are used to the original version.

So, Google, what were you thinking to switch me over to the German version of the reader? I have been using the English version for more than a year, so clearly, I understood enough of that language to be able to use it. More than 90% of the RSS feeds I’m subscribed to are, in fact, in English. Can you imagine how pissed I was to see the interface changed?

This is even worse on the iPhone/iPod frontend, because, there, you don’t even provide an option to change the language aside of manually hacking the URL.

Or take desktop applications. I live in the German speaking parts of Switzerland. True. So naturally I have set my locale settings to Swiss German. You know: I want to have the correct number formatting, I want my weeks to start on Mondays. I want the correct currency. I want my 24 hours clock I’m used to.

Actually, I also want the German week and month names, because I will be using these in most of my letters and documents, which are, in fact, German too.

But my OS installation is English. I am used to English. I prefer English. Why do so many programs insist to use the locale setting to determine the display language? Do you developers think it’s funny to have a mish-mash of languages on the screen? Don’t you think that me using an English OS version may be an indication that I do not want to read your crappy German translation alongside the English user interface of my OS?

Don’t you think that it feels really stupid to have a button in a German dialog box open another, English, dialog (the first one is from Chrome, the one that opens once you click “Zertifikate verwalten” (Manage certificates) is from Windows itself)?

In Chrome, I can at least fix the language – once I found the knob to turn. At first, it was easier for me to just delete the German localization file from the chrome installation because, due to being completely unused to German UIs, I was unable to find the right setting.

This is really annoying and I see this particular problem being neglected on an incredibly large scale. I know that I am a minority, but the problem is so terribly easy to fix:

  • All current browsers send an Accept-Language header. In contrast to the earlier times, nowadays, it is actually correctly preset in all the common browsers. Use that. Don’t use my IP-address.
  • Instead of reading the locale setting in my OS, ask the OS for its UI language and use that to determine which localization to load (actually, this is the recommended way of doing things according to Microsoft’s guidelines at least since Windows XP which was 2001).

Using these two simple tricks, you help a minority without hindering the majority in any way and without additional development overhead!

Actually, you’ll be getting away a lot cheaper than before. GeoIP is expensive if you want it to be accurate (and you do want that. Don’t you?), whereas there are ready-to-use libraries to determine the correct language even from the most complex Accept-Language-Header.

Asking the OS for the UI language isn’t harder than asking it for the locale, so no overhead there either.

Please, developers, please have mercy! Stop the annoyance! Stop it now!

Dynamic object creation in Delphi

In a quite well-known pattern, you have a certain amount of classes, all inheriting from a common base and you have a factory that creates instances of these classes. Now let’s go further ahead and assume that the factory will have no knowledge of what classes will be available at run-time.

Each of this classes registers itself at run-time depending on a certain condition and then the factory will create instances depending on that registration.

This post is about how to do this in Delphi. Remember that this sample is very much abstracted and the real-world application is quite a bit more complex, but this sample should be enough to demonstrate the point.

Let’s say, we have these classes:

type
  TJob = class(TObject)
    public
      constructor Create;
  end;

  TJobA = class(TJob)
    public
      constructor Create;
  end;

  TJobB = class(TJob)
    public
      constructor Create;
  end;

  TJobAA = class(TJobA)
    public
      constructor Create;
  end;

Each of these constructors does something to initialize the instance and thus calls its parent using ‘inherited’.

Now, let’s further assume that we have a Job-Repository that stores a list of available jobs:

type
  TJobRepository = class(TObject)
    private
      FAvailableJobs: TList;
    public
      procedure registerJob(cls: TClass);
      function getJob(Index: Integer): TClass;
   end;

Now we can register our jobs

   rep = TJobRepository.Create;
   if condition then
     rep.RegisterJob(TJobAA);
   if condition2 then
     rep.RegisterJob(TJobB);

and so on. Now at runtime, depending on some condition, we will instantiate any of these registered jobs. This is how we’d do that:

  job = rep.getJob(0).Create; 

Sounds easy. But this doesn’t work.

job in this example will be of type TJobAA (good), but its constructor will not be called (bad). The solution is to

  1. Declare the constructor of TJob as being virtual.
  2. Create a Meta-Class for TJob, because the Constructor of TObject is NOT virtual, to when you dynamically instantiate an object from a TClass only the constructor of TObject will be called.
  3. Override the inherited virtual constructor.

So in code, it looks like this:

type
  TJobClass = class of TJob;
  TJob = class(TObject)
   public
    constructor Create; virtual;
  end;

  TJobA = class(TJob)
    public
      constructor Create; override;
    end;

TJobAA = class(TJobA)
    public
      constructor Create; override;
    end;

TJobRepository = class(TObject)
    private
      FAvailableJobs: TList;
    public
      procedure registerJob(cls: TClass);
      function getJob(Index: Integer): TJobClass;
   end

This way, Delphi knows that when you call

  job = rep.getJob(0).Create; 

that you are creating an instance of a TJobAA object which has a constructor that overrides the virtual Constructor of TJob by the virtue that the Class of TJobAA is a class of TJob.

Personally, I would have assumed that this just works without the need of declaring the Meta-Class and the trickery with the need to explicitly declare the constructor as virtual. But seeing that Delphi is a compiled static language, actually, I’m happy that this works at all.

OAuth signature methods

I’m currently looking into web services and different methods of request authentication, especially as what I’m aiming to end up with is something inherently RESTful as this method will provide me with the best flexibility when designing a frontend to the service and generally, the arguments of the REST crowd seem to convince me (works like the human readable web, inherently scalable, enforces clean structure of resources and finally: easy to program against due to “obvious” API).

As different services are going to communicate with themselves, sometimes acting as users of their respective platforms and because I’m not really inclined to pass credentials around (or make the user do one half of the tasks on one site and the other half on another site), I was looking into different methods of authentication and authorization which work in a RESTful enviroment and work without passing around user credentials.

The first thing I did was to note the requirements and subsequently, I quickly designed something using public key cryptography which would have worked quite nicely (possibly – I’m no expert in this field – yet).

Then I learned about OAuth which was designed precisely to solve my issues.

Eager, I read through the specification, but I was put off by one single fact: The default method for signing requests, the method that is most widely used, the method that is most widely supported, relies on a shared secret.

Even worse: The shared secret must be known in clear on both the client and the server (using the common terminology here; OAuth speaks of consumers and providers, but I’m (still) more used to the traditional naming).

This is bad on multiple levels:

  • As the secret is stored on two places (client and server), it’s twice as probable to leak out than if it’s only stored on one place (the client).
  • If the token is compromised, the attacker can act in the name of the client with no way of detection.
  • Frankly, it’s responsibility I, as a server designer, would not want to take on. If the secret is on the client and the client screws up and lets it leak, it’s their problem, if the secret is stored on the server and the server screws up, it’s my problem and I have to take responsibility.
    Personally, I’m quite confident that I would not leak secret tokens, but can I be sure? Maybe. Do I even want to think about this? Certainly not if there is another option.
  • If, god forbid, the whole table containing all the shared secrets is compromised, I’m really, utterly screwed as the attacker can use all services, impersonating any user at will.
  • As the server needs to know all shared secrets, the risk of losing all of them is only even created. If only the client knows the secret, an attacker has to compromise each client individually. If the server knows the secret, it suffices to compromise the server to get all clients.
  • As per the point above, the server gets to be a really interesting target for attacks and thus needs to be extra secured and even needs to take measures against all kinds of more-or-less intelligent attacks (usually ending up DoSing the server or worse).

In the end, HMAC-SHA1 is just repeating history. At first, we stored passwords in the clear, then we’ve learned to hash them, then we even salted them and now we’re exchanging them for tokens stored in the clear.

No.

What I need is something that keeps the secret on the client.

The secret should never ever need to be transmitted to the server. The server should have no knowledge at all of the secret.

Thankfully, OAuth contains a solution for this problem: RSA-SHA1 as defined in section 9.3 of the specification. Unfortunately, it leaves a lot to be desired though. Whereas the rest of the specification is a pleasure to read and very, well, specific, 9.3 contains the following phrase:

It is assumed that the Consumer has provided its RSA public key in a verified way to the Service Provider, in a manner which is beyond the scope of this specification.

Sure. Just specify the (IMHO) useless way using shared secrets and leave out the interesting and IMHO only functional method.

Sure. Transmitting a Public Key is a piece of cake (it’s public after all), but this puts another burden on the writer of the provider documentation and as it’s unspecified, implementors will be forced to amend the existing libraries with custom code to transmit the key.

Also I’m unclear on header size limitations. As the server needs to know what public key was used for signature (oauth_consumer_key), it must be sent on each requests. While manually generated public token can be small, a public key certainly isn’t. Is there a size-limit for HTTP-headers? I’ll have to check that.

I could just transmit the key ID (the key is known on the server) or the key fingerprint as the consumer key, but is that following the standard? I didn’t see this documented anywhere and examples in the wild are very scarcely implemented.

Well… as usual, the better solution just requires more work and I can live with that, especially considering as, for now, I’ll be the person to write both server and client, but I feel the upcoming pain, should third party consumers decide to hook up with that provider.

If you ask me what I would have done in the footsteps of the OAuth guys, I would only have specified RSA-SHA1 (and maybe PLAINTEXT) and not even bothered with HMAC-SHA1. And I would have specified a standard way for public key exchange between consumer and provider.

Now the train has left and everyone interested in creating a really secure (and convenient – at least for the provider) solution will be left with more work and not standardized methods.

Beautifying commits with git

When you look at our Subversion log, you’d often see revisions containing multiple topics, which is something I don’t particularly like. The main problem is merging patches. The moment you cramp up multiple things into a single commit, you are doomed should you ever decide to merge one of the things into another branch.

Because it’s such an easy thing to do, I began committing really, really often in git, but whenever I was writing the changes back to subversion, I’ve used merge –squash as to not clutter the main revision history with abandoned attempts at fixing a problem or implementing a feature.

So in essence, this meant that by using git, I was working against my usual goals: The actual commits to SVN were larger than before which is the exact opposite thing of how I’d want the repository to look.

I’ve lived with that, until I learned about the interactive mode of git add.

Beginners with git (at least those coming from subversion and friends) always struggle to really get the concept of the index and usually just git commit –a when committing changes.

This does exactly the same thing as a svn commit would do: It takes all changes you made to your working copy and commits them to the repository. This also means that the smallest unit of change you can track is the state of the working copy itself.

To do finer grained commits, you can git add a file and commit that, which is the same as svn status followed by some grep and awk magic.

But even a file is too large a unit for a commit if you ask me. When you implement feature X, it’s possible if not very probable, that you fix bugs a and b and extend the interface I to make the feature Y work – a feature on which X depends.

Bugfixes, interface changes, subfeatures. A git commit –a will mash them all together. A git add per file will mash some of them together. Unless you are really really careful and cleanly only do one thing at a time, but in the end that’s now how reality works.

It may very well be that you discover bug b after having written a good amount of code for feature Y and that both Y and b are in the same file. Now you have to either back out b again, commit Y and reapply b or you just commit Y and b in one go, making it very hard to later just merge b into a maintenance branch because you’d also get Y which you would not want to.

But backing out already written code to make a commit? This is not a productive workflow. I could never make myself do something like that, let alone my coworkers. Aside of that, it’s yet another cause to create errors.

This is where the git index shines. Git tracks content. The index is a stage area where you store content you whish to later commit to the repository. Content isn’t bound to a file. It’s just content. By help of the index, you can incrementally collect single changes in different files, assemble them to a complete package and commit that to the repository.

As the index is tracking content and not files, you can add parts of files to it. This solves the problems outlined above.

So once I have completed Feature X, and assuming I could do it in one quick go, then I run git add with the –i argument. Now I see a list of changed files in my working copy. Using the patch-command, I can decide, hunk per hunk, whether it should be included in the index or not. Once I’m done, I exit the tool using 7. Then I run git commit1) to commit all the changes I’ve put into the index.

Remember: This is not done per file, but per line in the file. This way I can separate all the changes in my working copy, bug a and b, feature Y and X into single commits and commit them separately.

With a clean history like that, I can consequently merge the feature branch without —squash, thus keeping the history when dcommiting to subversion, finally producing something that can easily be merged around and tracked.

This is yet another feature of git that, after you get used to it, makes this VCS shine even more than everything else I’ve seen so far.

Git is fun indeed.

1) and not git commit -a which would destroy all the fine-grained plucking of lines you just did – trust me: I know. Now.

Converting ogg streams into mp3 ones

This is just an announcement for my newest quick-hack which can be used to on-the-fly convert streams from webradios which use the ogg/vorbis format into the mp3 format which is more widely supported by the various devices out there.

I have created an own dedicated page for the project for those who are interested.

Also, I really got to like github.com, not as the commercial service they intend to be (I’ve already written about the stupidity of hosting your company trade secrets at a company in a foreign country with foreign legislation), but as a place to quickly and easily dump some code you want to be publically available without going through all the hassle otherwise associated with public project hosting.

This is why this little script is hosted there and not here. As I’m using git, even if github goes away, I still have the full repository around to either self-host or let someone else host for me, which is a crucial requirement for me to outsource anything.

Simplest possible RPCs in PHP

After spending hours to find out why a particular combination of SoapClient in PHP itself and SOAP::Server from PEAR didn’t consistenly work together (sometimes, arrays passed around lost an arbitrary number of elements), I thought about what would be needed to make RPCs work form a PHP client to a PHP server.

I wanted nothing fancy and I certainly wanted as less an overhead as humanly possible.

This is what I came up with for the server:

<?php
header('Content-Type: text/plain');

require_once('a/file/containing/a/class/you/want/to/expose.php');

$method = str_replace('/', '', $_SERVER['PATH_INFO']);

if ($_SERVER['REQUEST_METHOD'] != 'POST'){
   sendResponse(array('state' =&gt; 'error', 'cause' =&gt; 'unsuppored HTTP method'));
}

$s = new MyServerObject();
$params = unserialize(file_get_contents('php://input'));
if ( ($res = call_user_func_array(array($s, $method), $params)) === false)
   sendResponse(array('state' => 'error', 'cause' => 'RPC failed'));
if (is_object($res))
   $res = get_object_vars($res);
sendResponse($res);

function sendResponse($resobj){
    echo serialize($resobj);
    exit;

}

?>

This client as shown below is a bit more complex, mainly because it contains some HTTP protocol logic. Logic, which could possibly be reduced to 2-3 lines of code if I’d use the CURL library, but the client in this case does not have the luxury of having access to such functionality.

Also, I’ve already had the function laying around (/me winks at domi), so that’s what I used (as opposed to file_get_contents with a pre-prepared stream context). This way, we DO have the advantage of learning a bit of how HTTP works and we are totally self-contained.

<?php
class Client{
    function __call($name, $args){
        $req = $this-&gt;openHTTPRequest('http://localhost:5436/restapi.php/'.$name, 'POST', array('Content-Type' =&gt; 'text/plain'), serialize($args));
        $data = unserialize(stream_get_contents($req['handle']));
        fclose($req['handle']);
        return $data;
    }
    private function openHTTPRequest($url, $method = 'GET', $additional_headers = null, $data = null){
        $parts = parse_url($url);

        $fp = fsockopen($parts['host'], $parts['port'] ? $parts['port'] : 80);
        fprintf($fp, "%s %s HTTP/1.1rn", $method, implode('?', array($parts['path'], $parts['query'])));
        fputs($fp, "Host: ".$parts['host']."rn");
        if ($data){
            fputs($fp, 'Content-Length: '.strlen($data)."rn");
        }
        if (is_array($additional_headers)){
            foreach($additional_headers as $name => $value){
                fprintf($fp, "%s: %srn", $name, $value);
            }
        }
        fputs($fp, "Connection: closernrn");
        if ($data)
            fputs($fp, "$datarn");

        // read away header
        $header = array();
        $response = "";
        while(!feof($fp)) {
            $line = trim(fgets($fp, 1024));
            if (empty($response)){
                $response = $line;
                continue;
            }
            if (empty($line)){
                break;
            }
            list($name, $value) = explode(':', $line, 2);
            $header[strtolower(trim($name))] = trim($value);
        }
        return array('response' => $response, 'header' => $header, 'handle' => $fp);
   }

}

$client = new Client();
$result = $client->someMethod(array('data' => 'even arrays work'));

?>

What you can’t pass around this way is objects (at least object which are not of type stdClass) as both client and server would need to have access to the prototype. Also, this seriously lacks error handling. But it generally works much better than what SOAP ever could accomplish.

Naturally, I give up stuff when compared to SOAP or any «real» RPC solution:

  • This one works only with PHP
  • It has limitations on what data structures can be passed around, though that’s aleviated by PHP’s incredibly strong array support.
  • It relies heavily on PHP’s loosely typed nature and thus probably isn’t as robust.

Still, protocols like SOAP (or even any protocol with either «simple» or «lightweight» in its name) tend to be so complicated that it’s incredibly hard if not impossible to create different implementations what still correctly work together in all cases.

In my case, where I have the problem of having to separate two pieces of the same application due to unstable third-party libraries which I would not want to have linked into every PHP instance running on that server for which the solution outlined above (plus some error handling code) works better than SOAP on so many levels:

  • it’s easily debuggable. No need for wireshark or comparable tools
  • client and server are written by me, so they are under my full control
  • it works all the time
  • it relies on as little functionality of PHP as possible and the functionality it depends on is widely used and tested, to I can assume that it’s reasonably bug-free (aside of my own bugs).
  • it’s a whole lot faster than SOAP, though this does not matter at all in this case.

Web service authentication

When reading an article about how to make google reader work with authenticated feeds, one big flaw behind all those web 2.0 services sprang to my mind: Authentication.

I know that there are efforts underway to standardise on a common method of service authentication, but we are nowhere near there yet.

Take facebook: They offer you to enter your email account data into some form to send an invitation to all your friends. Or the article I was referring to: They want your account data for a authenticated feed to make them available in google reader.

But think of what you are giving away…

For your service provider to be able to interact with that other service, they need to store your passwort. Be it short term (facebook, hopefully) or long term (any online feed reader with authentication support). They can (and do) assure you that they will store the data in encrypted form, but to be able to access the service in the end, they need the unencrypted password, thus requiring them to not only use reversible encryption, but to also keep the encryption key around.

Do you want a company in a country whose laws you are not familiar with to have access to all your account data? Do you want to give them the password to your personal email account? Or to everything else in case you share passwords?

People don’t seem to get this problem as account data is freely given all over the place.

Efforts like OAuth are clearly needed, but as webbased technology, they clearly can’t solve all the problems (what about Email accounts for example).

But is this the right way? We can’t even trust desktop applications. Personally, I think the good old username/password combination is at the end of its usefulness (was it ever really useful?). We need new, better, ways for proving our identity. Something that is easily passed around and yet cannot be copied.

SSL client certificates feel like an underused but very interesting option. Let’s make two examples. The first one is your authenticated feed. The second one is your SSL-enabled email server. Let’s say that you want to give a web service revokable access to both services without ever giving away personal information.

For the authenticated feed, the external service will present the feed server with its client side certificate which you have signed. By checking your signature, the authenticated feed knows your identity and by checking your CRL it knows whether you authorized the access or not. The service doesn’t know your password and can’t use your signature for anything but accessing that feed.

The same goes for the email server: The third party service logs in with your username and the signed client certificate (signed by you), but without password. The service doesn’t need to know your password and in case they do something wrong, you revoke your signature and be done with it (I’m not sure whether mail servers support client certificates, but I gather they do as it’s part of the SSL spec).

Client side certificates already provide a standard means for secure authentication without ever passing a known secret around. Why isn’t it used way more often these days?

git branch in ZSH prompt

Screenshot of the terminal showing the current git branch

Today, I came across a little trick on how to output the current git branch on your bash prompt. This is very useful, but not as much for me as I’m using ZSH. Of course, I wanted to adapt the method (and to use fewer backslashes :-) ).

Also, in my setup, I’m making use of ZSH’s prompt themes feature of which I’ve chosen the theme “adam1”. So let’s use that as a starting point.

  1. First, create a copy of the prompt theme into a directory of your control where you intend to store private ZSH functions (~/zshfuncs in my case).
    cp /usr/share/zsh/4.3.4/functions/prompt_adam1_setup ~/zshfuncs/prompt_pilif_setup
  2. Tweak the file. I’ve adapted the prompt from the original article, but I’ve managed to get rid of all the backslashes (to actually make the regex readable) and to place it nicely in the adam1 prompt framework.
  3. Advise ZSH about the new ZSH function directory (if you haven’t already done so).
    fpath=(~/zshfunc $fpath)
  4. Load your new prompt theme.
    prompt pilif

And here’s the adapted adam1 prompt theme:

# pilif prompt theme

prompt_pilif_help () {
  cat <<'EOF'
This prompt is color-scheme-able.  You can invoke it thus:

  prompt pilif [<color1> [<color2> [<color3>]]]

This is heavily based on adam1 which is distributed with ZSH. In fact,
the only change from adam1 is support for displaying the current branch
of your git repository (if you are in one)
EOF
}

prompt_pilif_setup () {
  prompt_adam1_color1=${1:-'blue'}
  prompt_adam1_color2=${2:-'cyan'}
  prompt_adam1_color3=${3:-'green'}

  base_prompt="%{$bg_no_bold[$prompt_adam1_color1]%}%n@%m%{$reset_color%} "
  post_prompt="%{$reset_color%}"

  base_prompt_no_color=$(echo "$base_prompt" | perl -pe "s/%{.*?%}//g")
  post_prompt_no_color=$(echo "$post_prompt" | perl -pe "s/%{.*?%}//g")

  precmd  () { prompt_pilif_precmd }
  preexec () { }
}

prompt_pilif_precmd () {
  setopt noxtrace localoptions
  local base_prompt_expanded_no_color base_prompt_etc
  local prompt_length space_left
  local git_branch

  git_branch=`git branch 2>/dev/null | grep -e '^*' | sed -E 's/^* (.+)$/(1) /'`
  base_prompt_expanded_no_color=$(print -P "$base_prompt_no_color")
  base_prompt_etc=$(print -P "$base_prompt%(4~|...|)%3~")
  prompt_length=${#base_prompt_etc}
  if [[ $prompt_length -lt 40 ]]; then
    path_prompt="%{$fg_bold[$prompt_adam1_color2]%}%(4~|...|)%3~%{$fg_bold[white]%}$git_branch"
  else
    space_left=$(( $COLUMNS - $#base_prompt_expanded_no_color - 2 ))
    path_prompt="%{$fg_bold[$prompt_adam1_color3]%}%${space_left}<...<%~ %{$reset_color%}$git_branch%{$fg_bold[$prompt_adam1_color3]%} $prompt_newline%{$fg_bold_white%}"
  fi

  PS1="$base_prompt$path_prompt %# $post_prompt"
  PS2="$base_prompt$path_prompt %_&gt; $post_prompt"
  PS3="$base_prompt$path_prompt ?# $post_prompt"
}

prompt_pilif_setup "$@"

The theme file can be downloaded here

Shell history stats

It seems to be cool nowadays to post the output of a certain unix command to ones blogs. So here I come:

pilif@celes ~
 % fc -l 0 -1 |awk '{a[$2]++ } END{for(i in a){print a[i] " " i}}'|sort -rn|head
467 svn
369 cd
271 mate
243 git
209 ssh
199 sudo
184 grep
158 scp
124 rm
115 ./clitest.sh

clitest.sh is a small little wrapper around wget which I use to do protocol level debugging of the PopScan Server.

Hosted Code Repository?

Recently (yesterday), the Ruby on Rails project announced their switch to git for their revision controlling needs. Also, they announced that they will use the hosted service github as the place to host the main repository on (even though git is decentralized, there is some sense in having a “main tree” which contains what’s going to be the official releases).

I didn’t know github, so I had a look at their project.

What I don’t understand is that they seem to also target commercial entities with their offering. Think of it: Supposing that you are a commercial entity doing commercial software development. Would you send over all your sourcecode and all the development history to another company?

Sure. They call themselves “Secure”. But what does that mean? Sure: They have SSL and SSH support, but frankly, I’m less concerned with patches travelling over the network unencrypted than I’m concerned with trusting anybody to host my code.

Even if they don’t screw up storage security (think: “accessing the code of your competition”), even if they are completely 100% trustworthy (think: “displeased employee selling out to your competition before leaving his employer”), there is still the issue of government/legal access.

When using an external hosting provider, you are storing your code (and history) in a foreign country with its own legislation. Are you prepared for that?

And finally, do you want the government of the country you’ve just sent your code (and history) to, to really have access to all that data? Who guarantees that the hosting provider of your choice won’t cooperate as soon as the government comes knoking (it happened before, even without legal base at all)?

All that is never worth the risk for a larger company (or for smaller ones – like ours).

So what exactly are these hosting companies (github is one. Code Spaces is another) targeted at?

  • Free Software developers? Their code is open to begin with, so they have to face the problems I described anyways. But they are much harder to sue. Also, I’m not sure how compelling it is for a free software project to use a non-free tool (rails being the exception, but we’ll talk about that later on)
  • Large companies? No way (see above)
  • Smaller companies? Probably not. Smaller companies are less of a target due to lower visibility, but sueing them for anything is more likely to get you something in return quickly as they usually don’t dare prolonged legal fights.