Ansible - gnegg

In the summer of 2012, I had the great oportunity to clean up our hosting
infrastructure. Instead of running many differently configured VMs, mostly one
per customer, we started building a real redundant infrastructure with two
really beefy physical database machines (yay) and quite many (22) virtual
machines for caching, web app servers, file servers and so on.

All components are fully redundant, every box can fail without anybody really
needing to do anything (one exception is the database – that’s also redundant,
but we fail over manually due to the huge cost in time to failback).

Of course you don’t manage ~20 machines manually any more: Aside of the fact
that it would be really painful to do for those that have to be configured in an
identical way (the app servers come to mind), you also want to be able to
quickly bring a new box online which means you don’t have time to manually go
through the hassle of configuring it.

So, In the summer of 2012, when we started working on this, we decided to go
with puppet. We also considered Chef but their server
was really complicated to set up and install and there was zero incentive for
them to improve because that would, after all, disincentivse people from
becoming customers of their hosted solutions (the joys of open-core).

Puppet is also commerically backed, but everything they do is available as open
source and their approach for the central server is much more «batteries
included» than what Chef has provided.

And finally, after playing around a bit with both Chef and puppet, we noticed
that puppet was way more bitchy and less tolerant of quick hacks around issues
which felt like a good thing for people dabbling with HA configuration of a
multi machine cluster for the first time.

Fast forward one year: Last autumn I found out about
ansible (linking to their github page –
their website reads like a competition in buzzword-bingo) and after reading
their documentation, I immediately was convinced:

No need to install an agent on managed machines
Trivial to bootstrap machines (due to above point)
Contributors don’t need to sign a CLA (thank you so much, ansibleworks!)
No need to manually define dependencies of tasks: Tasks are run requentially
Built-in support for cowsay by default
Many often-used modules included by default, no hunting for, say, a sysctl
module on github
Very nice support for rolling updates
Also providing a means to quickly do one-off tasks
Very easy to make configuration entries based on the host inventory (which requires puppetdb and an external database in the case of puppet)

Because ansible connects to each machine individually via SSH, running it
against a full cluster of machines is going to take a bit longer than with
puppet, but our cluster is small, so that wasn’t that much of a deterrent.

So last Sunday evening I started working on porting our configuration over from
puppet to Ansible and after getting used to the YAML syntax of the playbooks, I
made very quick progress.

progress

Again, I’d like to point out the excellent, built-in, on-by-default support for
cowsay as one of the killer-features that made me seriously consider starting
the porting effort.

Unfortunately though, after a very promising start, I had to come to the
conclusion that we will be sticking with puppet for the time being because
there’s one single feature that Ansible doesn’t have and that I really, really
want a configuration management system to have:

I’ts not possible in Ansible to tell it to keep a directory clean of files not
managed by Ansible in some way

There are, of course, workarounds, but they come at a price too high for me to
be willing to pay.

You could first clean a directory completely using a shell command, but this
will lead to ansible detecting a change to that folder every time it runs which
will cause server restarts, even when they are not needed.
You could do something like this stack overflow question
but this has the disadvantage that it forces you into a configuration file
specific playbook design instead of a role specific one.

What I mean is that using the second workaround, you can only have one playbook
touching that folder. But imagine for example a case where you want to work with
/etc/sysctl.d: A generic role would put some stuff there, but then your
firewall role might put more stuff there (to enable ip forwarding) and your
database role might want to add other stuff (like tweaking shmmax and shmall,
though that’s thankfully not needed any more in current Postgres releases).

So suddenly your /etc/sysctl.d role needs to know about firewalls and
databases which totally violates the really nice separation of concerns between
roles. Instead of having a firewall and a database role both doing something to
/etc/sysctl.d, you know need a sysctl-role which does different things
depending on what other roles a machine has.

Or, of course, you just don’t care that stray files never get removed, but
honestly: Do you really want to live with the fact that your /etc/sysctl.d, or
worse, /etc/sudoers.d can contain files not managed by ansible and likely not
intended to be there? Both sysctl.d and sudoers.d are more than capable of doing
immense damage to your boxes and this sneakily behind the watching eye of your
configuration management system?

For me that’s inacceptable.

So despite all the nice advantages (like cowsay), this one feature is something
that I really need and can’t have right now and which, thus, forces me to stay
away from Ansible for now.

It’s a shame.

Some people tell me that implementing my feature would require puppet’s feature
of building a full state of a machine before doing anything (which is error-
prone and frustrating for users at times), but that’s not really true.

If ansible modules could talk to each other – maybe loosly coupled by firing
some events as they do stuff, you could just name the task that makes sure the
directory exists first and then have that task register some kind of event
handler to be notified as other tasks touch the directory.

Then, at the end, remove everything you didn’t get an event for.

Yes. This would probably (I don’t know how Ansible is implemented internally)
mess with the decouplling of modules a bit, but it would be so far removed
from re-implementing puppet.

Which is why I’m posting this here – maybe, just maybe, somebody reads my plight
and can bring up a discussion and maybe even a solution for this. Trust me: I’d
so much rather use Ansible than puppet, it’s crazy, but I also want to make sure
that no stray file in /etc/sysctl.d will bring down a machine.

Yeah. This is probably the most words I’ve ever used for a feature request, but
this one is really, really important for me which is why I’m so passionate about
this. Ansible got so f’ing much right. It’s such a shame to still be left
unable to really use it.

Is this a case of xkcd1172? Maybe, but to me, my
request seems reasonable. It’s not? Enlighten me! It is? Great! Let’s work on
fixing this.

Share this: