Transport for NSW - NSW Roads and Maritime Services (RMS)
May 2019 - Present
RMS had a large amount of structured, semi-structured and non-structured residing in a range of formats and locations (both internal and external). None of this data was centrally aggregated and it was consequently impossible to gain insights from the combined data. They wanted a cloud-based data lake solution that could scale up and down in terms of storage and capacity.
Cloudten were engaged to design and deliver a modern data-lake solution that incorporated cloud native services and commercial tools such as Matillion for ETL and Snowflake for the data warehouse component. We actively worked with business units to map out requirements and data integration points.
AWS was used as the hosting platform that was securely and privately connected to the internal network via Direct Connect. AWS PrivateLink was used to enable private connectivity to the Snowflake SaaS such that existing on-premises Tableau services could privately connect through to the data warehouse.
We were also responsible for ETL development to onboard datasources including Oracle and SQL Server endpoints, as well as facilitate a managed file transfer service. All data was staged and curated through a data-lake on AWS S3.
Other cloud native services such AWS Glue, Athena, Lambda andDynamoDB were also used in the solution.
Data transformation and load was implemented via Matillion whilst maintaining end to end encryption both in transit and at rest. This included scheduled batch tasks run on both time and event-based triggers.
As there were varying security classifications of the stored data, access to objects was performed via RBAC from a combination of data “tagging”and metadata management. IAM roles within AWS and Snowflake were ultimately controlled by RMS Active Directory group policy as we federated the solution toAzure AD (which was in turn federated to the on-premises AD).
This project involved working with different customer and 3rd party vendor teams and required a variety of different skillsets from data architecture and engineering to business analysis and ETL (Matillion) development.
The diagram below gives a high-level overview of the solution:
NSW Transport (RMS) were originally trying to use IBM Info sphere and data warehouse products to build out their Data Lake in AWS and had engaged a large multi-national systems integrator for the design and build of the environment. After 4 months of getting nowhere, they engaged Cloudten to come in, re-assess the design and offer cloud native alternatives. At this stage a full pivot had to be done with aggressive timelines and reduced budgets. Wealso had to work with existing vendor and business teams to refactor a number of design elements in line with requirements.
Cloudten was able to introduce Snowflake and Matillion and deliver a working pilot within a 6 week sprint. To do this we used a small focused team who were highly skilled and experienced with the products at hand. We have since been tasked with expanding the original pilot into a full blown environment that actively ingests production data from a range of sources.
AWS Direct Connect, AWS Direct Connect, AWS S3, Glue, Athena, Lambda, IAM Roles and DynamoDB.
Matillion and Snowflake.