Benefits and Use Cases of a Data Lake
With the capability to store high-volume, high-velocity raw data in a centralized location, data lakes are currently the most desirable technology for businesses seeking to reduce silos and maximize data value. The pay-as-you-go price model, the low cost of storage, and groundbreaking advances in big data technology make data lakes very affordable for any project.
This article discusses the benefits of data lakes and some typical use cases. For an introduction about data lakes, see this article about Data Lakes Explained.
Benefits of a Data Lake
Scales Infinitely
Thanks to inexpensive data storage as a service (Simple Storage Service (S3) on Amazon Web Services (AWS)), data lakes have no upper limit in size. Amazon, the online retailer, has over 175 fulfillment centers worldwide, over 1 million employees, and 200 million website visitors per month. To track their petabytes of data in a timely and accurate fashion, Amazon released its Galaxy data lake in 2019 [source].
Flexible Application Use Cases
Since data lakes do not require a pre-defined schema, they can house raw data without knowing what insights you might want to explore in the future. The same raw data could be used to find matching records, remove duplicate entries (de-duplication), clean data for external integrations, index text for search engines, training statistical models for classification, clustering, detection, and prediction in machine learning (ML) pipelines.
Centralized Governance
If a data lake replaces many data silos, the aspect of governance is also delivered through a single point of control. Governance covers encryption, access control, auditing/logging, backup/recovery, and compliance with regulations. The General Data Protection Regulation (GDPR) is an example of a user data protection law that warrants granular access to data (see https://gdpr.eu/).
Democratizing Data
Large organizations traditionally work in siloed groups. A shared data lake brings together disparate data into one central location and opens opportunities for improved collaboration across teams. An example from the field of medicine is the data lake at NYU Langone Health. It enabled them to report on the clinical effectiveness of quality and safety measures. A repository with integrated cardiovascular patient data allowed developing predictive statistical models to detect and manage patients with diabetes and hypertension.
Accelerating Data Strategy and Machine Learning
It is no surprise that centralized governance makes it easier to implement a global data strategy. Collecting data should be strategic and relate to a business purpose. Data is a strategic asset that enables innovation. The field of AI/ML depends on large amounts of quality data. A data lake with diverse data sets is the best place to get started.
Reducing Operating Costs
Traditional data warehouses for analytics and decision support systems have been in use for over 30 years. Enterprise-wide data was stored in carefully curated, cleaned, and processed ways. Today, a data lake can deliver this and a lot more at a fraction of the cost of a data warehouse. This AWS whitepaper details the total cost of a data lake with healthcare data for 70,000 users at a total cost of ownership (TCO) of $24 per month.
Use Cases for Data Lakes
These most common use cases drive the main features of modern data lakes.
Data Ingestion and Integration
The landing zone of a data lake can serve a large variety of data ingestion pipelines, from file transfer protocol (FTP) upload to file share to relational databases. It should be easy and intuitive to connect diverse data sources to a lake. Integration products like Talend, Alteryx, Pentaho, Oracle Golden Gate, and Boomi offer quick connectors to externally hosted services like Salesforce, SAP, NetSuite, etc. Also, AWS has powerful integration features in AWS Lake Formation and AWS Data Pipeline. The landing zone of a data lake is the best place to aggregate data across systems in a centralized repository.
Data Cleaning and Normalization
The raw data in a lake can be submitted to a data transformation job for cleaning and processing, traditionally known as extract, transform, load (ETL) jobs. Data cleaning and normalization usually follow custom business rules. Typical use cases are de-duplication, normalization, and imputation (backfilling missing values from statistical models).
Data de-duplication is deleting all but one authoritative record of something to avoid over counting.
Normalization is to convert data so far that it fits into the pre-defined schema of a database. Strings are converted to numbers, timestamps to universal time (UTC) dates. Strings can be scrubbed of unwanted characters (emojis) and truncated to a maximum length. Float variables can be assigned to the intervals of discrete buckets (e.g., a person’s age into age groups).
Data is converted to simple data types (string, numbers, dates, times) and written out to formats optimized for data analytics (OCR, Avro, Parquet).
Partitioning data by common keys (year, month, day, LOB ID, etc.) and compressing files additionally speeds up data access.
Finding Matching Records
Finding matching records between two data sets is a very standard use case. Suppose both data sets have a unique matching identifier, e.g., social security number or address. In that case, it is as simple as zipping the streams together (a JOIN operation in SQL if you are an analyst). If two data sets are only kind of similar, approximate (fuzzy) matching algorithms can be used. Consumer profiles can thus be built from multiple streams of data (point-of-sales records, credit card transactions, account signups, social media posts, etc.)
Data Enrichment
Adding metadata to raw data is an example of data enrichment. Other examples are computing a score from a predictive model or additional data from a lookup source. AWS Data Exchange offers many public data sets to enrich LOB data, e.g., demographics data for a zip code.
Data Access Control
Data lakes can be built to give different access levels to varying stakeholders without making copies of the data. AWS Lake Formation has an additional security layer that enables total control of access permissions in a declarative way (e.g., a lake administrator would ‘see everything’ while a data analyst would be restricted to not seeing the personally identifiable information (PII) of actual customers).
Data Compliance
Consumer privacy regulations like GDPR and CCPA (California) give the customer the right to know what data is collected (right to know) and if it is to be deleted (right to delete). As the central repository for reporting and data integrations, a data lake is an effective solution to regulatory compliance.
Reporting and Data Sharing
The cleaned-up tables in a data lake can be used to feed dashboards that track business performance metrics. Until now, this has been done with traditional enterprise data warehouses.
The buckets of a data lake can be shared with other cloud accounts or even opened up to the public. Alternatively, access to the data can be given through a data API.
Organizing and Searching for Data
Organizations with hundreds or thousands of different data sources need a repository to manage and work with all data streams. The data catalog of a data lake holds the table definitions, database names, and all metadata. Tags are an effective piece of metadata that can be attached to tables and even table columns. A taxonomy of tags can be used to search for data lake artifacts (databases, tables, columns). Data lakes have a search interface to find data by name, description, or tags. Tags can also be used for defining access permissions.
Replacement of a Data Warehouse
A data lake can be a replacement for a data warehouse, with certain limitations. If the data never changes (read-only), a data lake can be used for reporting, thus replacing the data warehouse. However, if updates and deletes do occur, a data lake needs to have some of its underlying data files replaced. Until now, data lakes did not support transactional updates. Recently, AWS Lake Formation introduced governed tables that support transactions so users can concurrently and reliably insert, delete, and modify data, just like in an actual database. Those are significant developments, as a data lake becomes more like a data warehouse. With the cost of a data lake being a fraction of a warehouse, it is a good candidate for replacing the latter.
Streaming Data
A data lake can be sourced from streaming data, as well. Streams of video, audio, transaction logs, or messages are submitted to analytics pipelines built with data lakes. Continuous reporting can thus be applied to streaming data (e.g., for outlier detection). One of the most common uses of data lakes is near-real-time data collection for the Internet of Things (IoT).
Historic Data Repository
A data lake is a good place to store large amounts of historical data that is less frequently accessed. Data warehouses can handle the most recent business data, while the historical records could be moved to a lake. This saves space and cost.
Industries using Data Lakes
The oil and gas industry has been an early adopter of big data technologies. They collect terabytes of measurements and use that in predictive models for exploration, supply chain, and maintenance management.
Smart Cities, conceived by governments, academia, and industry, promise to be more livable. The city of the future will be controlled by data from IoT sensors for traffic and more.
Data monitors constantly enhance medical treatments for weight, blood pressure, heart rate, temperature, enzymes, blood cell counts, etc. A lake of patient data can be used for automating diagnostics.
In marketing, data lakes have been used to build consumer profiles for advertising, even personalized campaigns, for a long time now.
Organizations in finance, insurances, logistics, procurement, in short, any business that processes vast amounts of data, can benefit from the significant advances of data lakes.
Summary
Data lakes have many benefits and already serve as an important tool in many data-driven organizations. Their main disadvantage of not supporting updates in a transactionally consistent manner is beginning to fade with the introduction of transaction support. The quick pace of innovation, new features for security and compliance, their high versatility, and cost-effectiveness make it a possible alternative to a traditional data warehouse.
__________
Related Blogs:
Stefan Deusch
Forward-thinking Engineering leader offering 20 years of experience with a background in effective and dynamic work environments. Clear communicator, decision-maker, and problem solver focused on achieving project objectives with speed and accuracy. I am passionate about building quality teams, cloud computing, and machine learning.