Find relation sizes in PostgreSQL

Like so many times before, today I was yet again in the situation where I wanted to know which tables/indexes take the most disk space in a particular PostgreSQL database.

My usual procedure in this case was to dt+ in psql and scan the sizes by eye (this being on my development machine, trying to find out the biggest tables I could clean out to make room).

But once you’ve done that a few times and considering that dt+ does nothing but query some PostgreSQL internal tables, I thought that I want this solved in an easier way that also is less error prone. In the end I just wanted the output of dt+ sorted by size.

The lead to some digging in the source code of psql itself (src/bin/psql) where I quickly found the function that builds the query (listTables in describe.c), so from now on, this is what I’m using when I need to get an overview over all relation sizes ordered by size in descending order:

select
  n.nspname as "Schema",
  c.relname as "Name",
  case c.relkind
     when 'r' then 'table'
     when 'v' then 'view'
     when 'i' then 'index'
     when 'S' then 'sequence'
     when 's' then 'special'
  end as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner",
  pg_catalog.pg_size_pretty(pg_catalog.pg_relation_size(c.oid)) as "Size"
from pg_catalog.pg_class c
 left join pg_catalog.pg_namespace n on n.oid = c.relnamespace
where c.relkind IN ('r', 'v', 'i')
order by pg_catalog.pg_relation_size(c.oid) desc;

Of course I could have come up with this without source code digging, but honestly, I didn’t know about relkind s, about pg_size_pretty and pg_relation_size (I would have thought that one to be stored in some system view), so figuring all of this out would have taken much more time than just reading the source code.

Now it’s here so I remember it next time I need it.

Updating or replacing datasets

This is maybe the most obvious trick in the world but I see people not doing it all over the place, so I guess it’s time to write about it.

Let’s say you have a certain set of data you need to enter into your RDBMS. Let’s further assume that you don’t know whether the data is already there or not, so you don’t know whether to use INSERT or UPDATE

Some databases provide us with something like REPLACE or “INSERT OR REPLACE”, but others do not. Now the question is, how to do this efficiently?

What I always see is something like this (pseudo-code):

  1. select count(*) from xxx where primary_key = xxx
  2. if (count > 0) update; else insert;

This means that for every dataset you will have to do two queries. This can be reduced to only one query in some cases by using this little trick:

  1. update xxx set yyy where primary_key = xxx
  2. if (affected_rows(query) == 0) insert;

This method just goes ahead with the update, assuming that data is already there (which usually is the right assumption anyways). Then it checks if an update has been made. If not, it goes ahead and inserts the data set.

This means that in cases where the data is already there in the database, you can reduce the work on the database to one single query.

Additionally, doing a SELECT and then an UPDATE essentially does the select twice as the update will cause the database to select the rows to update anyways. Depending on your optimizer and/or query cache, this can be optimized away of course, but there are no guarantees.