In the previous part we have discussed the importance of building your AWS S3 data lake in accordance with industry best practices to keep it well functioned and ready for data insights delivery. We mentioned object tagging as a simple and efficient technique capable to provide a number of benefits for your day-to-day business activity.
This time we are going to continue our coverage of AWS best practices and discuss the data cataloguing as another effective approach forgetting your data lake working. According to the Gartner report, data catalogue could be a primer technology for data management that can put an end to a constant struggle with finding, inventorying and analysing vastly distributed and diverse data assets.
AWS provides a range of services that we can utilise to build a comprehensive data catalogue for keeping track of all of the raw assets as they are populated into the data lake, and then tracking all of the new data assets and versions generated by data transformation, data processing and analytics.
Below is a high-level architecture of data cataloguing solution built with the implementation of AWS S3, Amazon Lambda, Amazon RDS andAmazon Elastic search Service.
Data cataloguing solution inAWS
In this architecture data derived from different sources is loaded into S3 data lake. Every load operation acts as an event trigger forAmazon Lambda function that populates object names and related metadata into database served by Amazon RDS. Search function in data catalogue for specific assets and related metadata as well as data classifications is performed by Amazon Elastic search service.
While consulting our clients, we at Cloudten have seen how suchAWS-based data catalogue helps organisations to break down the barriers to data lake adoption and get value from it. With the implementation of this solution a company can secure a single source of truth about the objects of the data lake by delivering a query-able interface of all raw assets stored in the S3 buckets. Further more, with better understanding of data derived from its cataloguing it becomes possible to implement a comprehensive information security policy which is vital for organisations working with sensitive data.