Data lake vs. data warehouse: What's your option?
Looking to implement a big data solution, a business has to decide upon the right repository for data storage. There are two variants: battle-tested data warehouses and shiny-new data lakes that have only started gaining popularity.
Our AWS experts will be glad to make the difference between data lakes and data warehouses clear and help you choose the better fit.
What are a data lake and a data warehouse?
A data warehouse is a highly structured repository for storing current and historical data coming from business applications and transactional systems for the needs of operational reporting and analysis. The data structure is defined before it gets to the data warehouse, which allows a comprehensive report with unified data coming from various sources.
A data lake is a centralized repository (established either on-premises or in the cloud) used to store data in its natural or "raw" format, without applying any transformations to it. The data flows into a data lake from the line of business apps as well as from IoT devices, mobile apps, social media, etc. Such data may be further used for reporting, analytics, and machine learning.
As stated by ReportLinker, the data lakes market is expected to witness growth at a CAGR of 27.4% from 2019 to 2024. The main drivers of such growth are the proliferation of IoT devices and AI solutions that are generating and processing huge volumes of varied data, thus causing the need for speedy and efficient data manipulation. Furthermore, using a data lake helps companies to be innovative and get data-driven insights without the exorbitant costs of maintaining a big data solution.
Let's have a look at the similarities and differences between these two options.
Data lake or a data warehouse: 8 points to consider
#1 The state of data
As we've hinted at above, a data lake stores structured (organized in tables with defined relationships and data types), semi-structured (data in XML, JSON, CSV formats) and unstructured data (photos, PDF files, chat logs, emails) coming from diverse sources, such as IoT devices, social media, etc. Conversely, a data warehouse stores only structured data cleansed and organized to fit in the schema.
A data lake consists of the Landing zone with the data that has been cleansed, or filtered; the Staging zone, a storage repository with data from the Landing zone; or data coming from various data sources that doesn't need any filtering or cleansing. Finally, there's the Analytics sandbox used by data analysts for creating prototypes, performing their experiments, and testing new business hypotheses. Only a Staging zone is an obligatory element of a data lake, while others are optional, which adds to the data lake's flexibility.
A data warehouse has a rigid multi-level architecture (Layered Scalable Architecture, or LSA) that consists of a Primary data layer, Core data layer, Data mart layer, and a Service layer. The information from external systems such as ERP, CRM, etc. is transferred into the data warehouse's Primary data layer.
The normalization, deduplication, and cleansing of data take place on the Core data layer. After the data is transformed and brought up to the quality standards, it flows to the Data mart level where it's turned into structures utilized by analytic and reporting solutions. All the operations of data auditing, management, monitoring, and error handling, together with metadata, are on the Service layer, which manages all the above-mentioned layers.
#3 Approach to data processing
When working with data lakes, the data transformation process follows the Extract-Load-Transform (ELT) steps. It means that data is pulled from the original data source, loaded into a data lake, and data cleansing, enrichment, and transformation take place after that. ELT allows the transforming of only the data needed for particular analysis, while other irrelevant data in the data lake remains unaffected and can be further stored for other purposes.
Such flexibility is impossible in a data warehouse, where the Extract-Transform-Load (ETL) process is employed. The data is extracted from varied data sources, goes through cleansing, enrichment, and transformation, and after that is loaded into a data warehouse in a structured form. The old-fashioned ETL transformations were time-consuming and required enormous manual effort (updating the data warehouse with new info meant going through each stage repeatedly). Now, numerous cloud-based ETL solutions make this process quicker and easier.
#4 The purpose of data use
Data lakes aggregate data that doesn't necessarily cater to a specific purpose (e.g., reporting, visualization), but might hypothetically be used for real-time analytics, deep learning, and beyond. Conversely, data in data warehouses is thoroughly selected and structured for generating reports, helping to manage finances, make strategic decisions, and answer particular questions posed by the business.
Data warehouses serve the needs of business users such as managers and the C-suite. They need to have comprehensive data sets structured in a report or a spreadsheet to get more insights and support their decision-making process.
As for data lakes, their main users are data scientists, engineers, and analysts who drill down into unstructured data for deep analysis and experiments, which goes far beyond the data warehouse's capabilities.
Data lakes can be implemented using various technologies, either cloud or on-premises ones. The list of popular cloud platforms includes Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, and IBM Cloud Pak for Data. As for on-premises solutions, they're Hadoop and Apache Cassandra for data storage, and MapReduce or Spark for data processing.
As for the top cloud data warehouse solutions, they're Amazon Redshift, Google BigQuery, and SAP Data Warehouse Cloud. Vertica Analytics Platform, Greenplum, and SAP HANA are on-premises options for building a data warehouse.
A data warehouse has a tangible advantage regarding transactions since they run on IO-optimized storage that ensures maximum performance. Increased performance is also guaranteed by keeping structured data, which is either archived or deleted after its expiration period. Data lakes keep huge amounts of varied data, including the one for the future needs, so they're noted for the storage volume, not transaction speed.
#8 The cost of data storage
An opportunity to store unstructured data from manifold sources makes a data lake a cost-effective option since data requires minimum or no structuring before being uploaded into the data lake. A data warehouse, in its turn, is more costly since it requires thorough data processing and structuring before loading it into the data warehouse, and the process takes a lot of time and effort.
A quick recap
Data lake vs. data warehouse: Together or apart?
If you still haven't picked sides, we'd say that you don't have to choose.
A data lake is no substitute for a data warehouse because they serve different goals. Thus, building an effective big data solution with a data warehouse is inherent for leveraging the potential of data for urgent business needs. When those critical business needs are covered, adding a data lake helps spot untapped business opportunities and get more value from going beyond your data warehouse capabilities. For example, you can leverage the data in a data lake to evolve your current products/services in line with consumer behavior and feedback.
Still, there are some industry-specific cases when a data lake or a data warehouse can better cover the needs of managing data. For instance, data lakes can be of utmost help for healthcare organizations that deal with enormous amounts of unstructured data coming from various sources: patient records, medical apps and devices, physicians' notes, etc.
The same goes for logistics and transportation, where operations can be made more efficient with insights and predictions provided by data lakes. For example, data-driven predictions can help companies better plan and optimize routes, thus cutting costs and increasing efficiency. Another promising use-case for data lakes is elearning, where data-driven insights combined with AI tools allow the analysis of students' performance, spotting improvement opportunities, and creating personalized learning paths to make the learning process match the needs of a particular student.
As for a data warehouse, it's the best possible solution for the highly regulated financial services industry and the public sector, which need to have the varied info properly structured and standardized for company-wide use. In general, valuable applications of a data warehouse can be made for the needs of any industry that requires storing and processing data of the unified format; for instance, images for constructing maps based on aerial photographs. You can have a look at our latest project for Vexcel Imaging that serves as an example.
Now that you know the difference between a data lake and a data warehouse, you can see that using them separately isn't always the best option. To make your big data solution a help and not an obstacle, all that's left to do is analyze your data usage goals and main user-groups (business, or technical users), design your big data solution's architecture, and select the best-suited technical tools.