This article will use some images from: Databricks - Databricks: Leading Data and AI Solutions for Enterprises and Data Camp - Learn Data Science and AI Online | DataCamp

Quick refresher:

View

Data storage

Note

Data stored in Data Lakes and after processing in can be used as information in Data Warehouses or elsewhere. Data lakehouses combine both concepts.

Database is a general term which can include RDBMS and Data warehouses.

Data Lakes

Do not enforce models on data

schema-on-read

Any data type can be stored, especially all raw data

Generally much bigger than warehouses

Harder to analyze.

Optimized for cost efficiency

Requires up-to-date catalog

Data warehouse

Enforce model and structures

Smaller than lakehouses

Easier to analyze

Generally use star schema for dimensional modelling which enables more effective OLAP queries.

Link to original

Different kinds of

Data storage

In addition to Warehouses and Lakes, there are

Data Marts

Marts are structured databases that contain data from a few sources and logically cover specific department and/or task requirement.

Typically, <100 GB in size.

Data Lakehouses

They join benefits of both Warehouses and Lakes.

Data architecture

Data Mesh

Data Fabric

Business logic

High-level lifecycle

Planning
- Requirements gathering. This phase is done to understand who’s and how’s of people which are going to use the warehouse.
  - Roles: Data Analyst, Data Scientist and perhaps BI Analyst
- Data modelling. Planning and organizing data transformation and/or integration from various, yet relevant sources. Crucial that the team understands and links these datasets.
  - Roles: Data Engineer, Database Administrator.
  - Support in business knowledge: Data Analyst, Data Scientist.
Implementation
- ETL and Data pipeline design.
  - Roles: Data engineer to create the pipeline and Database Administrator to extract the data.
- BI Application development. So the user can extract information from the data. Tableau, Google Looker or perhaps Power BI to be setup
  - Roles: Data Analyst, Data Scientist.
Support and Maintenance
- Maintenance, make any necessary modifications.
  - Roles: Data Engineer.
- Test and Deploy, where Data Analyst and Scientist test that the business requirements are met. Data Engineer deploys the tool.
  - Roles: Data Engineer, Data Analyst, Data Scientist.

Data Layers

Briefly:

Data Sourcing
1. Involves spreadsheet, logs, transactional RDBMS
ETL > Staging
1. Prepare data for storage, transform/tackle/clean use temporary staging tables. Turn semi-structured data to structured, in batches or fully.
Data Storage
1. Data can either be stored in warehouse and then moved to mart or vice versa
Data Presentation
1. BI tools, data mining tools, direct queries
  
  Also:

Architectures

Relates to Star and Snowflake schemas

Inmon - top-down

Assumes that data warehouse contains all data of the organization.

Pros and cons
Pros
Single source of truth
Normalization = less storage
Data Marts are easier to change
Cons
More join = slower response time
Lengthy upfront work (higher startup cost) to unify definitions can take a lot of work across org.

Kimball - bottom-up

Data is denormalized into Star Schema. To make query writing fast and straight forward. Data definitions are gathered from one department of the org, data then placed into data mart and is made available for reporting and cycle repeats for next departments. Later data mart connected to warehouse. Incremental approach.

Pros and cons
Pros
Upfront cost low
Denormalized = user-friendly
Cons
Higher ETL processing time
Greater possibility of duplicates because of denormalization
Ongoing development nature.

Kimball 4 step approach

To help users to find answers for business questions

Selected processes to track, like invoices, product quality or performace
Decide on grain, the lowest level of data stored in fact table, like “song grain” or “Line item grain”. So we can see which individual items did or did not perform well enough
Dimension identification: time, location, user’s name or id. Helps to describe data and get feedback
Facts identification: like measurement or metric of the process, number of listens, rides, orders or completed tasks, travel distance, time needed to finish process.

Slowly changing dimensions

For example, category of electric vehicle was to be renamed to electric-crossover in dimensions table.

classic approach

Type 1 - basic replacement of value, which removes history.
Type 2 - create new role with new ID and timeframe
Type 3 - add a new column

Modern approach

Make snapshots of entire dimensions table, considering low storage costs and then change the category name like in Type 1 Classic. Use previous snapshots for historic backtrack.

Row vs column data store

Relates to block storage concepts: Azure Storage, AWS S3 Simple Storage Service
TL;DR fewer blocks you use to write the data, the faster Read and Write will be.

In case of Row storage and storing in blocks, it is excellent option for transaction storage since transactions come in rows, and we can store each new transaction in a new block.

In case of Column storage, you store each column in a block. Which makes it good for analytical workloads because you can use fewer blocks to answer the business question, because not all business questions require all columns to be read. Also benefits of data compressions since were data in columns is the same type. Takes longer to add new rows

Explorer

Data Warehouses and Lakes

Data storage

Different kinds of

Data storage

Data Marts

Data Lakehouses

Data architecture

Data Mesh

Data Fabric

Business logic

High-level lifecycle

Data Layers

Architectures

Inmon - top-down

Kimball - bottom-up

Kimball 4 step approach

Slowly changing dimensions

classic approach

Modern approach

Row vs column data store

Table of Contents