We all know the data hotspots have negative impact on the performance in any distributed data processing environment and engine. This holds true for hadoop / MPP columnar and other databases. One way to avoid hotspots is to use the UUID to generate unique Ids.
As defined by wikipedia a UUID is " A UUID is simply a 128-bit value. The meaning of each bit is defined by any of several variants."
By default there is no UUID function in AWS Redshift. However with the python UDF you can easily create a UUID function in Redshift.
If you want random UUID
CREATE OR REPLACE FUNCTION public.fn_uuid()
RETURNS character varying AS
' import uuid
return uuid.uuid4().__str__()
'
LANGUAGE plpythonu VOLATILE;
As defined by wikipedia a UUID is " A UUID is simply a 128-bit value. The meaning of each bit is defined by any of several variants."
By default there is no UUID function in AWS Redshift. However with the python UDF you can easily create a UUID function in Redshift.
If you want random UUID
CREATE OR REPLACE FUNCTION public.fn_uuid()
RETURNS character varying AS
' import uuid
return uuid.uuid4().__str__()
'
LANGUAGE plpythonu VOLATILE;
If you want sequential UUID
CREATE OR REPLACE FUNCTION public.fn_uuid()
RETURNS character varying AS
' import uuid
return uuid.uuid1().__str__()
'
LANGUAGE plpythonu VOLATILE;
RETURNS character varying AS
' import uuid
return uuid.uuid1().__str__()
'
LANGUAGE plpythonu VOLATILE;
Phenomenal, thanks for coding this up! How would you alter a table to add this as a new column?
ReplyDeleteYou cannot default the new column to uuid (Redshift doesnt allow udf as default). The workaround is to add a new column to the table (alter table add column datatype) and then update the value with uuid ( update table set column=fn_uuid() )
ReplyDeleteWas looking for this, didn't event think it was possible. Thanks.
ReplyDeleteThanks for this code
ReplyDelete