Fixing size bloat in Greenplum tables

August 19, 2011

I noticed recently that some queries on a table were running very slow. Simple counts were taking longer on what appeared to be smaller tables. I say appeared because the tables had 1/5 the number of rows as other tables but queries were slower. The raw data files contained about 7 GB of data. To see what the Greenplum system had for table size I ran this query:

select pg_size_pretty(pg_relation_size('schema.table_name'));

The answer was 190GB! Clearly there were problems. This table does get reloaded every month with new data but I truncate the table before reloading it so you aren't supposed to run into issues. Anyway there turned out to be a couple of solutions. One was to run a vacuum full on the table. After running that the table size was reported as 5.635 MB. I did try a vacuum on the table but it had no impact on size. Another solution is to redistribute the data randomly then redistribute by the table key.

ALTER TABLE schema.table_name SET DISTRIBUTED RANDOMLY;
ALTER TABLE schema.table_name SET DISTRIBUTED BY (table_key);

The first statement will not actually redistribute the data. To make the system actually redistribute the data randomly run this:

ALTER TABLE schema.table_name SET WITH (REORGANIZE=true);

Redistributing by a column name will actually physically redistribute the data.

Search This Blog

Random bits of info

Fixing size bloat in Greenplum tables

Comments

Post a Comment

Popular posts from this blog

Database, schema, and table sizes in Greenplum

ssh-add could not open a connection

rsync to SD card recopying files