A data lake is a complex product that is usually custom-built for each individual business need. In the customer feedback we present here, the issue was the replacement of an on-premise data lake based on the Cloudera solution. The reason for this change was recurring scalability problems and storage that had reached saturation point. In this section, we will review which Cloud services and which organisation were chosen to best meet their needs.
The nature of work within the context of a data lake project can often bring with it an additional complexity factor, due to the number of actors involved.
For this project, it was a joint effort between the traditional IT department and the team dedicated to innovation. The first need conveyed by the client was to start building the service from day one to help them achieve results quickly.
To address this challenge, we simultaneously began to launch the build phase and onboarding of IT department. The objective was to build a data lake with Dataiku, with no distributed computing capacity to keep the architecture simple, and then to switch the data lake and current uses to the AWS Cloud.
If you want to learn more about:
- The project context
- The target architecture
- A use case
- The Challenge of a scalable solution
- AWS step functions