AWS offers Amazon Redshift as its proprietary cloud native platform for developing your enterprise data warehouse. It comes with a number of features such as columnar data storage, data compression, zone maps, massively parallel processing architecture, optimised RA3 compute instances, etc. All these powerful technologies provide a possibility to build a high-performing data warehousing solution leveraging theoretically infinite resources of the cloud.
However, to unleash the full potential of Amazon Redshift you need to follow the guidelines specifically developed by platform experts. Even though the Amazon Redshift is probably one of the best available tools on the market, the final query performance is heavily determined by the quality of table design utilised to store your data.
AWS recommends a set of practices aimed to increase the query performance while reducing compute and storage resources required for resulting dataset calculation. The following are two simple guidelines on data tables design that could help you to get the maximum performance from your Amazon Redshift data warehouse.
DATE and TIMESTAMP datatypes for date and time information
Quite often date and time information are extracted from the source as CHAR or VARCHAR data type. It could be a design decision as well when such information is intentionally loaded into data warehouse landing tables with VARCHAR columns to mitigate the possible issues with bad data quality. For example, to prevent the failures from data type mismatch caused by the absence of data governance at the source system or by possible issues with the data extract job itself.
However, Amazon Redshift as a platform stores information in DATE and TIMESTAMP data types more efficiently than in CHAR or VARCHAR. Therefore, running analytical queries over properly stored date and time values results in a lower number of scanned bytes and makes it possible to finalise the query execution in less time.
Furthermore, this approach could guarantee better data integrity in your solution as it prevents invalid date and time information from being populated into your tables. Only meaningful and valid data are permitted to enter your data warehouse while all possible issues are detected at the earliest stage with the ability to react and fix problems in a timely manner.
Maintaining minimal column size
One of the reasons to introduce columns with VARCHAR data type and maximum possible column size is to prevent data load failure as a result of insufficient column size. Therefore, it could be completely justified to use this approach with landing tables in your data warehouse so you can perform the load part of ELT job in a lossless manner.
However, it is not recommended to apply the same architecture to final target tables. Even thoughRedshift implements a highly efficient data compression algorithm and maximum column size has no significant impact on table size, the negative consequence could be in elevated memory and disk space required for query execution. This is because the platform does not compress temporary tables that are created to store intermediate data sets while performing calculations for massive analytic queries. As a result, your SQL code will run slower and consume more compute resources.
Finally, there is another benefit of implementing minimal possible column size and it is better control over the data that is actually stored in the data warehouse. By performing an exercise of initial data profiling, you can better understand the valid range of values and design your tables the way that they let in only genuine data and explicitly reject anomalies. Such practice could increase the ability of your data warehouse to act as the source of truth of your enterprise data. Additionally, you get an early warning mechanism to identify possible data quality issues with the sources from where the data assets are extracted.