Fixing size bloat in Greenplum tables
I noticed recently that some queries on a table were running very slow. Simple counts were taking longer on what appeared to be smaller tables. I say appeared because the tables had 1/5 the number of rows as other tables but queries were slower. The raw data files contained about 7 GB of data. To see what the Greenplum system had for table size I ran this query:
select pg_size_pretty(pg_relation_size('schema.table_name'));
The answer was 190GB! Clearly there were problems. This table does get reloaded every month with new data but I truncate the table before reloading it so you aren't supposed to run into issues. Anyway there turned out to be a couple of solutions. One was to run a vacuum full on the table. After running that the table size was reported as 5.635 MB. I did try a vacuum on the table but it had no impact on size. Another solution is to redistribute the data randomly then redistribute by the table key.
ALTER TABLE schema.table_name SET DISTRIBUTED RANDOMLY;
ALTER TABLE schema.table_name SET DISTRIBUTED BY (table_key);
The first statement will not actually redistribute the data. To make the system actually redistribute the data randomly run this:
ALTER TABLE schema.table_name SET WITH (REORGANIZE=true);
Redistributing by a column name will actually physically redistribute the data.
select pg_size_pretty(pg_relation_size('schema.table_name'));
The answer was 190GB! Clearly there were problems. This table does get reloaded every month with new data but I truncate the table before reloading it so you aren't supposed to run into issues. Anyway there turned out to be a couple of solutions. One was to run a vacuum full on the table. After running that the table size was reported as 5.635 MB. I did try a vacuum on the table but it had no impact on size. Another solution is to redistribute the data randomly then redistribute by the table key.
ALTER TABLE schema.table_name SET DISTRIBUTED RANDOMLY;
ALTER TABLE schema.table_name SET DISTRIBUTED BY (table_key);
The first statement will not actually redistribute the data. To make the system actually redistribute the data randomly run this:
ALTER TABLE schema.table_name SET WITH (REORGANIZE=true);
Redistributing by a column name will actually physically redistribute the data.
Comments
Post a Comment