InterPARES Trust AI - Artificial Intelligence

data lake [English]

Syndetic Relationships

RT: data warehouse

InterPARES Definition

n. ~ A collection of many large datasets in their original, unrefined formats, without normalization or transformation, allowing users the ability to search, select, and analyze unrefined data in their original contexts.

Other Definitions

Gartner IT Glossary (†298 s.v. "data lake"): A collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).

Citations

Amazon Web Services, Data Lake 2017 (†867 ): Data Lake allows an organization to store all of their data, structured and unstructured, in one, centralized repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. ¶ A Data Lake should support the following capabilities: · Collecting and storing any type of data, at any scale and at low costs · Securing and protecting all of data stored in the central repository · Searching and finding the relevant data in the central repository · Quickly and easily performing new types of data analysis on datasets · Querying the data by defining the data’s structure at the time of use (schema on read) ¶ Furthermore, a Data Lake isn’t meant to be replace your existing Data Warehouses, but rather complement them. If you’re already using a Data Warehouse, or are looking to implement one, a Data Lake can be used as a source for both structured and unstructured data, which can be easily converted into a well-defined schema before ingesting it into your Data Warehouse. A Data Lake can also be used for ad hoc analytics with unstructured or unknown datasets, so you can quickly explore and discover new insights without the need to convert them into a well-defined schema. (†2607)
Campbell 2016 (†865 ): Pentaho CTO James Dixon has generally been credited with coining the term “data lake”. He describes a data mart (a subset of a data warehouse) as akin to a bottle of water…”cleansed, packaged and structured for easy consumption” while a data lake is more like a body of water in its natural state. Data flows from the streams (the source systems) to the lake. Users have access to the lake to examine, take samples or dive in. ¶ Data Lakes in a Modern Data ArchitectureThis is also a fairly imprecise definition. Let's add a few specific properties of a data lake: · All data is loaded from source systems. No data is turned away. · Data is stored at the leaf level in an untransformed or nearly untransformed state. · Data is transformed and schema is applied to fulfill the needs of analysis. ¶ Next, let's highlight five key differentiators of a data lake and how they contrast with the data warehouse approach. 1. Data Lakes Retain All Data . . . . 2. Data Lakes Support All Data Types . . . . 3. Data Lakes Support All Users 4. Data Lakes Adapt Easily to Changes . . . . 5. Data Lakes Provide Faster Insights ¶ Data Lakes in a Modern Data ArchitectureThis is also a fairly imprecise definition. Let's add a few specific properties of a data lake: · All data is loaded from source systems. No data is turned away. · Data is stored at the leaf level in an untransformed or nearly untransformed state. · Data is transformed and schema is applied to fulfill the needs of analysis. (†2602)
Stein and Morrison 2014 (†866 p. 2): To solve the challenge the hospital faced with data storage, integration, and accessibility, the hospital created a data lake based on a Hadoop architecture, which enables distributed big data processing by using broadly accepted open software standards and massively parallel commodity hardware. ¶ Hadoop allows the hospital’s disparate records to be stored in their native formats for later parsing, rather than forcing all-or-nothing integration up front as in a data warehousing scenario. Preserving the native format also helps maintain data provenance and fidelity, so different analyses can be performed using different contexts. The data lake has made possible several data analysis projects, including the ability to predict the likelihood of readmissions and take preventive measures to reduce the number of readmissions.1 (†2603)
Stein and Morrison 2014 (†866 p. 5): Many people have heard of data lakes, but like the term big data, definitions vary. Four criteria are central to a good definition: · Size and low cost: Data lakes are big. They can be an order of magnitude less expensive on a per-terabyte basis to set up and maintain than data warehouses. With Hadoop, petabyte-scale data volumes are neither expensive nor complicated to build and maintain. Some vendors that advocate the use of Hadoop claim that the cost per terabyte for data warehousing can be as much as $250,000, versus $2,500 per terabyte (or even less than $1,000 per terabyte) for a Hadoop cluster. Other vendors advocating traditional data warehousing and storage infrastructure dispute these claims and make a distinction between the cost of storing terabytes and the cost of writing or written terabytes. · Fidelity: Hadoop data lakes preserve data in its original form and capture changes to data and contextual semantics throughout the data lifecycle. This approach is especially useful for compliance and internal audit. If the data has undergone transformations, aggregations, and updates, most organizations typically struggle to piece data together when the need arises and have little hope of determining clear provenance. · Ease of accessibility: Accessibility is easy in the data lake, which is one benefit of preserving the data in its original form. Whether structured, unstructured, or semi-structured, data is loaded and stored as is to be transformed later. Customer, supplier, and operations data are consolidated with little or no effort from data owners, which eliminates internal political or technical barriers to increased data sharing. Neither detailed business requirements nor painstaking data modeling are prerequisites. · Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up-front data models. (†2604)
Wikipedia (†387 s.v. "data lake"): A method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files. The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning. The data lake includes structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (images, audio, video) thus creating a centralized data store accommodating all forms of data. (†2600)