CLOSE ✕
Get in touch with us
Cloud consulting is what we do best - whether it's about taking your business to the next level or working for us we'd love to hear from you.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Get Your AWS S3 Data Lake Working. Part 2

Mykola Khivrych
|
Technical
|
August 31, 2020

In the previous part we have discussed the importance of building your AWS S3 data lake in accordance with industry best practices to keep it well functioned and ready for data insights delivery. We mentioned object tagging as a simple and efficient technique capable to provide a number of benefits for your day-to-day business activity.

This time we are going to continue our coverage of AWS best practices and discuss the data cataloguing as another effective approach forgetting your data lake working. According to the Gartner report, data catalogue could be a primer technology for data management that can put an end to a constant struggle with finding, inventorying and analysing vastly distributed and diverse data assets.

 

AWS provides a range of services that we can utilise to build a comprehensive data catalogue for keeping track of all of the raw assets as they are populated into the data lake, and then tracking all of the new data assets and versions generated by data transformation, data processing and analytics.

 

Below is a high-level architecture of data cataloguing solution built with the implementation of AWS S3, Amazon Lambda, Amazon RDS andAmazon Elastic search Service.

 

Data cataloguing solution inAWS 

In this architecture data derived from different sources is loaded into S3 data lake. Every load operation acts as an event trigger forAmazon Lambda function that populates object names and related metadata into database served by Amazon RDS. Search function in data catalogue for specific assets and related metadata as well as data classifications is performed by Amazon Elastic search service.

While consulting our clients, we at Cloudten have seen how suchAWS-based data catalogue helps organisations to break down the barriers to data lake adoption and get value from it. With the implementation of this solution a company can secure a single source of truth about the objects of the data lake by delivering a query-able interface of all raw assets stored in the S3 buckets. Further more, with better understanding of data derived from its cataloguing it becomes possible to implement a comprehensive information security policy which is vital for organisations working with sensitive data.

 

Mykola Khivrych
Data Engineer for Cloudten Industries, has over nine years of industry experience in large, technically complex environments, with exposure to a diverse array of IT products and platforms, predominantly in the area of Data Warehousing, Cloud Technologies, Database Development and Administration, Networking and Information Security. Worked for several product IT companies and governmental institutions in the fields of transport, international human law, stocks exchange and telecommunications.

Recent Blog Posts