Monday, August 29, 2011

Time sucking queries

Had a user submit a query recently:

select count(*) from (select * from viewA UNION select * from viewB)

viewA pulls from two tables with a combined 1.5 billion rows and around 60 columns. viewB pulls from a table with about 300 million rows. I rewrote the query as follows:

select (select count(*) from viewA) + (select count(*) from viewB) AS count

Same result but the first query has an explain plan with a cost 26X the second. Second query runs in about 3 minutes.

Terminal title in OS X

I did not come up with this. A quick google search turns up multiple ideas. I combined a few and trimmed things down to how I wanted the title to look. Here's my final result:

case $TERM in
  (xterm*)
  export PROMPT_COMMAND='echo -ne "\033]0;${USER}@$(hostname -s)\007"'
  ;;
esac

I added that code to ~/.bash_profile. Now my Terminal title changes when I open a Terminal or ssh into a machine then exit back to my Mac.

Saturday, August 27, 2011

Distributed by and updates in Greenplum

I have an earlier post where I list how to update table A from table B. In that example the WHERE clause sets tableA.column=tableB.column. That works perfectly as long as your tables are DISTRIBUTED BY(column). If one, or both, are distributed randomly, you are out of luck. At least this applies to 3.3.x. I haven't tried this on a GP 4.x setup. I had tableA distributed randomly and when I tried to redistribute by column I would get gang errors. Here's the fix I implemented.
Run a pg_dump --schema-only on tableA. Run ALTER TABLE and rename tableA to tableA_orig. Edit the schema dump of tableA and change distributed by from randomly to (column). Run the schema dump to recreate tableA distributed by column. Populate the new tableA with, INSERT INTO tableA SELECT * FROM tableA_orig.
There's a few other things to be aware of. You'll need to redo the indexes on tableA or edit the schema and change the name of any indexes because the original indexes will now be associated with tableA_orig but they are not renamed. Any views that include tableA will need to be redone because they now point to tableA_orig. I believe that's all I had to do. I'll update this post if I run into more.

Tuesday, August 23, 2011

Database, schema, and table sizes in Greenplum

Starting in the 4.x release, you get size info from the gp_toolkit schema. To get the size of all databases and their size in bytes:

select sodddatname, sodddatasize from gp_toolkit.gp_size_database;

To see the database size in GB, TB, and/or MB.

TB:
select sodddatname, (sodddatsize/1073741824)/1024 AS sizeinTB from gp_toolkit.gp_size_of_database;

GB:
select sodddatname, (sodddatsize/1073741824) AS sizeinGB from gp_toolkit.gp_size_of_database;

MB:
select sodddatname, (sodddatsize/1048576) AS sizeinMB from gp_toolkit.gp_size_of_database;


For schema sizes, connect to your database and run:

TB:
select sosdnsp, (sosdschematablesize/1073741824)/1024 AS schemasizeinTB from gp_toolkit.gp_size_of_schema_disk;

GB:
select sosdnsp, (sosdschematablesize/1073741824) AS schemasizeinGB from gp_toolkit.gp_size_of_schema_disk;

MB:
select sosdnsp, (sosdschematablesize/1048576) AS schemasizeinMB from gp_toolkit.gp_size_of_schema_disk;

If you want a specific schema only, add WHERE sosdnsp='SCHEMA'

Take a given schema and find table sizes:

GB:
select sotdschemaname as SCHEMA, sotdtablename,(sotdsize/1073741824) as tableGB FROM gp_toolkit.gp_size_of_table_disk WHERE sotdschemaname = 'yourschema' AND (sotdsize/1073741824) > 0 ORDER BY sotdtablename;

MB:
select sotdschemaname as SCHEMA, sotdtablename,round((sotdsize/1048576),2) as tableMB FROM gp_toolkit.gp_size_of_table_disk WHERE sotdschemaname = 'yourschema' AND (sotdsize/1048576) > 0 ORDER BY sotdtablename;

For an individual table you can get the size of the table as:

select pg_size_pretty(pg_relation_size('schema.table_name'));

You can get table and index size(total) with:

select pg_size_pretty(pg_total_relation_size('schema.table_name'));

For just index size use:
select (pg_size_pretty( pg_total_relation_size('schema.table_name') - pg_relation_size('schema.table_name'))) AS IndexSize;

Saturday, August 20, 2011

Unable to log into OS X

Ran into a problem with being unable to log into an OS X machine joined to an AD domain. I had previously logged in successfully with this account. Logged in as a local administrator on the machine and noticed the account was no longer listed in System Preferences/Accounts but other domain accounts were, odd.  Ran dscl . -list users and the account was listed. There was also a directory structure under /Users/userid. Ran dscl . -delete /Users/userid and was then able to successfully log in and the home directory was untouched so no data was lost.

Friday, August 19, 2011

Fixing size bloat in Greenplum tables

I noticed recently that some queries on a table were running very slow. Simple counts were taking longer on what appeared to be smaller tables. I say appeared because the tables had 1/5 the number of rows as other tables but queries were slower. The raw data files contained about 7 GB of data. To see what the Greenplum system had for table size I ran this query:

select pg_size_pretty(pg_relation_size('schema.table_name'));

The answer was 190GB! Clearly there were problems.  This table does get reloaded every month with new data but I truncate the table before reloading it so you aren't supposed to run into issues. Anyway there turned out to be a couple of solutions. One was to run a vacuum full on the table. After running that the table size was reported as 5.635 MB. I did try a vacuum on the table but it had no impact on size. Another solution is to redistribute the data randomly then redistribute by the table key.

ALTER TABLE schema.table_name SET DISTRIBUTED RANDOMLY;
ALTER TABLE schema.table_name SET DISTRIBUTED BY (table_key);

The first statement will not actually redistribute the data. To make the system actually redistribute the data randomly run this:

ALTER TABLE schema.table_name SET WITH (REORGANIZE=true);

Redistributing by a column name will actually physically redistribute the data.