If you find yourself in the position where you need a full vacuum, you’re probably already looking at the number of dead tuples to have reached that conclusion. The trouble is that bloated tables and (nearly) full disks tend to go hand in hand. In this case, you might be wondering: how much disk space do I need to complete a full vacuum?

Let’s create an example setup:

ivdl=# create table vacuum_test (id serial primary key, col1 bigint, col2 bigint); CREATE TABLE ivdl=# insert into vacuum_test (col1, col2) select x, x from generate_series(1, 1000000) x; INSERT 0 1000000 ivdl=# select count(*) from vacuum_test; count --------- 1000000 (1 row) ivdl=# alter table vacuum_test set (autovacuum_enabled = false); ALTER TABLE

Now we have an example table containing one million rows, with the autovacuum turned off, for the sake of this experiment. Let’s have a look at the statistics for our table:

ivdl=# select n_live_tup, n_dead_tup from pg_stat_all_tables where relname = 'vacuum_test'; n_live_tup | n_dead_tup ------------+------------ 1000000 | 0 (1 row)

This estimate exactly matches our expectations. Now let’s delete half of the rows:

ivdl=# delete from vacuum_test where id % 2 = 0; DELETE 500000 ivdl=# select n_live_tup, n_dead_tup from pg_stat_all_tables where relname = 'vacuum_test'; n_live_tup | n_dead_tup ------------+------------ 500000 | 500000 (1 row)

Perfect again. Let’s have a look at the size of our table:

ivdl=# \d+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+--------------------+----------+----------------+-------------+---------------+------------+------------- public | vacuum_test | table | ivdl | permanent | heap | 50 MB | public | vacuum_test_id_seq | sequence | ivdl | permanent | | 8192 bytes | (2 rows)

We have a table, occupying 50 MB on disk and containing one million rows, of which half are pending a cleanup. To estimate the disk space needed for a VACUUM FULL command, we have to take into account that PostgreSQL will make a complete copy of the table, only containing the live tuples, before deleting the old version of the table. Our estimation function is, therefore:

$$\begin{aligned} s_{max} &= \frac{n_d + 2(n_l)}{n_d + n_l}\cdot s_{current} \end{aligned}$$

Where $s_{max}$ represents the total estimated disk space needed for the operation, $n_d$ and $n_l$ represents the number of dead and live tuples, respectively, and $s_{current}$ the current disk space occupied by the table:

$$\begin{aligned} s_{max} &= \frac{50 000 + 2(50 000)}{50 000 + 50 000}\cdot 50\text{MB} \\\\ &= 1.5\cdot 50\text{MB} \\\\ &= 75\text{MB} \end{aligned}$$

And now we can run a VACUUM FULL and see the results of our estimation:

ivdl=# vacuum full vacuum_test; VACUUM ivdl=# \d+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+--------------------+----------+----------------+-------------+---------------+------------+------------- public | vacuum_test | table | ivdl | permanent | heap | 25 MB | public | vacuum_test_id_seq | sequence | ivdl | permanent | | 8192 bytes | (2 rows)

Quite neat! Our new table occupies exactly half of the original, so the combined size at the moment the deletion happened would have been 25MB + 50MB = 75MB . Of course, if your tuples vary greatly in size and you have TOASTed columns then things will vary more, but you can also take a look at the pg_stats view to formulate a more nuanced (or a worst-case) estimate:

ivdl=# select tablename, attname, avg_width, n_distinct from pg_stats where tablename = 'vacuum_test'; tablename | attname | avg_width | n_distinct -------------+---------+-----------+------------ vacuum_test | id | 4 | -1 vacuum_test | col1 | 8 | -1 vacuum_test | col2 | 8 | -1 (3 rows)

In our case it’s not very interesting, but if you have wider text or varchar columns you can use the avg_width in pg_stats to get some idea of the size of your columns, and then use a conservative estimate (e.g. a multiple of the average) as an added percentage of the total row size and then use that estimate to influence your $s_{current}$ if you have reason to believe that your live rows may be larger than your dead rows – perhaps if you have columns containing application payloads that increased over time.