Distributed by and updates in Greenplum
I have an earlier post where I list how to update table A from table B. In that example the WHERE clause sets tableA.column=tableB.column. That works perfectly as long as your tables are DISTRIBUTED BY(column). If one, or both, are distributed randomly, you are out of luck. At least this applies to 3.3.x. I haven't tried this on a GP 4.x setup. I had tableA distributed randomly and when I tried to redistribute by column I would get gang errors. Here's the fix I implemented.
Run a pg_dump --schema-only on tableA. Run ALTER TABLE and rename tableA to tableA_orig. Edit the schema dump of tableA and change distributed by from randomly to (column). Run the schema dump to recreate tableA distributed by column. Populate the new tableA with, INSERT INTO tableA SELECT * FROM tableA_orig.
There's a few other things to be aware of. You'll need to redo the indexes on tableA or edit the schema and change the name of any indexes because the original indexes will now be associated with tableA_orig but they are not renamed. Any views that include tableA will need to be redone because they now point to tableA_orig. I believe that's all I had to do. I'll update this post if I run into more.
Run a pg_dump --schema-only on tableA. Run ALTER TABLE and rename tableA to tableA_orig. Edit the schema dump of tableA and change distributed by from randomly to (column). Run the schema dump to recreate tableA distributed by column. Populate the new tableA with, INSERT INTO tableA SELECT * FROM tableA_orig.
There's a few other things to be aware of. You'll need to redo the indexes on tableA or edit the schema and change the name of any indexes because the original indexes will now be associated with tableA_orig but they are not renamed. Any views that include tableA will need to be redone because they now point to tableA_orig. I believe that's all I had to do. I'll update this post if I run into more.
Comments
Post a Comment