Skip to main content

Posts

AWS Aurora Bulk Load Performance Issues - Resolved

We have had performance issues when loading the bulk data into the AWS Aurora.  The bulk load performance was so bad that it was nearly worthless pushing around 2 million rows in to AWS Aurora.  We were inserting about 1000 records per second.  This was much worse comparing with the other MySQL counterparts like MySQL, MariaDB etc. However a few tweaks to the parameter and it resolved most of the performance issues we faced in the bulk Load. The solution is to add two parameters when you connect to the AWS Aurora jdbc for bulk load. These two parameters are : useServerPrepStatmts =false rewriteBatchedStatements =true Your full JDBC connection string should look like “jdbc:mysql://host:3306/db? useServerPrepStmts=false & rewriteBatchedStatements=true ", "username", “password”” Once we changed these parameters, the performance was blazing fast.  We were able to load the 2 million rows in flat 3 minutes. The Aurora Sever used in the benchark was r3.xlar
Recent posts

Google Cloud Spanner

Google recently released the Cloud Spanner. Cloud Spanner promises to be the first and only relational database service that is both strongly consistent and horizontally scalable. Cloud Spanner promises traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads. With automatic scaling, synchronous data replication, and node redundancy, Cloud Spanner delivers up to 99.999% (five 9s) of availability for your mission critical applications. You can get more details about the Cloud Spanner at  https://cloud.google.com/spanner/ .

Create Strip, LStrip, RStrip Functions in Redshift

There is no redshift inbuilt function to strip a character from start-end of a string. For eg. if a numeric value is stored as a character with preceding zeros ' 0000123' and you want to store/operate/aggregate/join it as a number it is not possible with inbuilt redshift functions.  What you can do is to create the following strip udfs and make a use of these. lstrip : strips out the left instances of a character from a string. CREATE OR REPLACE FUNCTION public.fn_lstrip(str_in character varying, a character) RETURNS character varying AS ' try: return(str_in.lstrip(a)) except: return None' LANGUAGE plpythonu VOLATILE; eg. select public.fn_lstrip('00001234','0') would result in 12345 rstrip : strips out the right instances of a character from a string. CREATE OR REPLACE FUNCTION public.fn_rstrip(str_in character varying, a character) RETURNS character varying AS ' try: return(str_in.rstrip(a)) except: return None&

AWS Redshift Best Practices

Here are the best practices with AWS Redshift for various aspects based on my experience Data Loading : The best way to load the data in AWS Redshift is through Redshift Copy Command. The copy command can load the data from variety of sources including AWS S3 bucket, EMR Cluster or any host that can be accessed using SSH.  Copy command can load the data from DynamoDB as well. Copy Command loads data in parrallel from the source table and the data is imported and stored in a more efficient way than the insert command Keys  : Make sure to add the keys(sort key , distribution key) to the Redshift Table Sort Key : Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans. Distribution Key : When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a tabl

AWS Aurora Performance Review

AWS Aurora is the only PaaS offering for a Relational DBMS based on MYSQL platform.  Aurora is a game changer for many companies in a way.  Having worked my way through figuring out the different aspects here are the observations : Good : Almost Full Stack Mysql compatibility.  The scale up is painless Cheaper and Better alternative to other other RDBMS Benefits of any of the PaaS offering The Read Speed can be increased by creating multi AZ(Availability Zone) configuration. Bad : The write speed is poor when compared to the read speed. Lack of bulk import functionality, that makes data ingestion painful Overall : It is a very good alternative to other RDS instances (SQL Server / Oracle) Cost effective and better alternative to dynamodb. If you have structured data that can be handled by traditional DBMS.

Creating a UUID function in Redshift

We all know the data hotspots have negative impact on the performance in any distributed data processing environment and engine.  This holds true for hadoop / MPP columnar and other databases.  One way to avoid hotspots is to use the UUID to generate unique Ids. As defined by wikipedia a UUID is "  A UUID is simply a 128-bit value. The meaning of each bit is defined by any of several variants. " By default there is no UUID function in AWS Redshift.  However with the python UDF you can easily create a UUID function in Redshift. If you want random UUID CREATE OR REPLACE FUNCTION public.fn_uuid() RETURNS character varying AS ' import uuid return uuid.uuid4().__str__() ' LANGUAGE plpythonu VOLATILE; If you want sequential UUID  CREATE OR REPLACE FUNCTION public.fn_uuid() RETURNS character varying AS ' import uuid return uuid.uuid1().__str__() ' LANGUAGE plpythonu VOLATILE;

Best Query Tool for Redshift

Amazon Redshift is MPP Columnar Database offering from Amazon based on the ParAccel/PostGreSQL engine.  It provides ODBC/JDBC Connectivity options but do not have an editor/query tool of its own. With my past experience of over a year with Redshift,  I have used plethora of tools. Sqirrel, SQL Workbench etc to name a few.  I found Aginity WorkBench to be the best tool to use with Redshift.  You can register and get a full functional free version from their website. Pros : - Intelligent Typing - Scripting the objects - Nice grid for output Cons : - The DDL Script generated is not formatted. - The abort operation is not smooth