Skip to main content

Creating a UUID function in Redshift

We all know the data hotspots have negative impact on the performance in any distributed data processing environment and engine.  This holds true for hadoop / MPP columnar and other databases.  One way to avoid hotspots is to use the UUID to generate unique Ids.

As defined by wikipedia a UUID is " A UUID is simply a 128-bit value. The meaning of each bit is defined by any of several variants."


By default there is no UUID function in AWS Redshift.  However with the python UDF you can easily create a UUID function in Redshift.

If you want random UUID

CREATE OR REPLACE FUNCTION public.fn_uuid()
RETURNS character varying AS
' import uuid
return uuid.uuid4().__str__()
'
LANGUAGE plpythonu VOLATILE;

If you want sequential UUID 

CREATE OR REPLACE FUNCTION public.fn_uuid()
RETURNS character varying AS
' import uuid
return uuid.uuid1().__str__()
'
LANGUAGE plpythonu VOLATILE;




Comments

  1. Phenomenal, thanks for coding this up! How would you alter a table to add this as a new column?

    ReplyDelete
  2. You cannot default the new column to uuid (Redshift doesnt allow udf as default). The workaround is to add a new column to the table (alter table add column datatype) and then update the value with uuid ( update table set column=fn_uuid() )

    ReplyDelete
  3. Was looking for this, didn't event think it was possible. Thanks.

    ReplyDelete

Post a Comment

Popular posts from this blog

AWS Aurora Performance Review

AWS Aurora is the only PaaS offering for a Relational DBMS based on MYSQL platform.  Aurora is a game changer for many companies in a way.  Having worked my way through figuring out the different aspects here are the observations : Good : Almost Full Stack Mysql compatibility.  The scale up is painless Cheaper and Better alternative to other other RDBMS Benefits of any of the PaaS offering The Read Speed can be increased by creating multi AZ(Availability Zone) configuration. Bad : The write speed is poor when compared to the read speed. Lack of bulk import functionality, that makes data ingestion painful Overall : It is a very good alternative to other RDS instances (SQL Server / Oracle) Cost effective and better alternative to dynamodb. If you have structured data that can be handled by traditional DBMS.

AWS Aurora Bulk Load Performance Issues - Resolved

We have had performance issues when loading the bulk data into the AWS Aurora.  The bulk load performance was so bad that it was nearly worthless pushing around 2 million rows in to AWS Aurora.  We were inserting about 1000 records per second.  This was much worse comparing with the other MySQL counterparts like MySQL, MariaDB etc. However a few tweaks to the parameter and it resolved most of the performance issues we faced in the bulk Load. The solution is to add two parameters when you connect to the AWS Aurora jdbc for bulk load. These two parameters are : useServerPrepStatmts =false rewriteBatchedStatements =true Your full JDBC connection string should look like “jdbc:mysql://host:3306/db? useServerPrepStmts=false & rewriteBatchedStatements=true ", "username", “password”” Once we changed these parameters, the performance was blazing fast.  We were able to load the 2 million rows in flat 3 minutes. The Aurora Sever used in the benchark was r3.xlar