A new trend in the Data Science and Data Engineering world is the term of “Data Lakes“. According to Wikipedia:
A data lake is a system or repository of data stored in its natural/raw format. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases, semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
Why we call it “Lake”
Data Lakes can be considered as organic information, just like nature and this is because they can store any structure of data (structured, semis-structured and unstructured)
Why Should we Use Data Lakes
The main mains reasons to use a data lake are:
- To increase operational efficiency
- Make data available from departmental silos
- Lower transactional costs
- Offload capacity from databases and warehouses
- Store data without thinking about its structure
Characteristics of Data Lakes
The main characteristics of the data lakes are:
- There are data agnostic meaning that are not limited to store just one data type. This is a main difference with the data warehouses where they expected structured data.
- There are “future proof” which means that you may not have a specific question to answer today, but if your manager asks you a question in the future, most probably you will be able to accommodate it since you have the data in a raw format.
- They have two kinds of processing such as when the process occurs before or while ingesting data and when the process occurs after data has been stored like cleansing, aggregating, transforming, merging with other datasets etc.
Data Lakes Components
The four main components of the Data Lakes are:
- Ingest and Store
- Catalog and Search
- Process and Serve
- Protect and Secure
Comparison of a Data Lake to a Data Warehouse
A data warehouse is usually a database optimized to perform analytical queries that leads to insights. But because it usually operates as an analytical database, you need to create tables and define the table structure before adding your data into your data warehouse. When you create those tables, you have to set the table columns and data types, in order words, a data schema that generally needs information to be structured. When the schema needs to be populated and it needs to be determined before you write the data, you have what we call a schema-on-write architecture. Although schema-on-write is good for data normalization because it would reject data that does not fit in that specific format, it is not ideal for flexibility, which is where data lakes really shine.
Data lakes are what we call schema-on-read. And that’s the first fundamental difference between data lakes and data warehouses. Data lakes can handle unstructured data and mainly operate in a schema-on-read fashion. Which means that you do not need to concern with the data schema while ingesting the data to your data lake. That allow you to take care of the schema only when read data for some future processing. Hence the name schema-on-read.
Another difference between data lakes and data warehouses is that data warehouses mostly use the SQL as the language for querying. That limits what you can do, and some engines even support the creation of user-defined functions and other functionalities to extend that a little bit.
Finally, another difference is that while data warehouses work with structured data only, data lakes work with unstructured and structured data natively.