This article will use some images from: Databricks - Databricks: Leading Data and AI Solutions for Enterprises and Data Camp - Learn Data Science and AI Online | DataCamp
Quick refresher:
View
Data storage
Note
Data stored in Data Lakes and after processing in can be used as information in Data Warehouses or elsewhere. Data lakehouses combine both concepts.
Database is a general term which can include RDBMS and Data warehouses.
Link to original
- Data Lakes
- Do not enforce models on data
- schema-on-read
- Any data type can be stored, especially all raw data
- Generally much bigger than warehouses
- Harder to analyze.
- Optimized for cost efficiency
- Requires up-to-date catalog
- Data warehouse
- Enforce model and structures
- Smaller than lakehouses
- Easier to analyze
- Generally use star schema for dimensional modelling which enables more effective OLAP queries.
Different kinds of
Data storage
In addition to Warehouses and Lakes, there are
Data Marts
Marts are structured databases that contain data from a few sources and logically cover specific department and/or task requirement.
Typically, <100 GB in size.
Data Lakehouses
They join benefits of both Warehouses and Lakes.
Data architecture
Data Mesh
Data Fabric
Business logic
High-level lifecycle
/Attachments/Pasted-image-20250222232614.png)
- Planning
- Requirements gathering. This phase is done to understand who’s and how’s of people which are going to use the warehouse.
- Roles: Data Analyst, Data Scientist and perhaps BI Analyst
- Data modelling. Planning and organizing data transformation and/or integration from various, yet relevant sources. Crucial that the team understands and links these datasets.
- Roles: Data Engineer, Database Administrator.
- Support in business knowledge: Data Analyst, Data Scientist.
- Requirements gathering. This phase is done to understand who’s and how’s of people which are going to use the warehouse.
- Implementation
- ETL and Data pipeline design.
- Roles: Data engineer to create the pipeline and Database Administrator to extract the data.
- BI Application development. So the user can extract information from the data. Tableau, Google Looker or perhaps Power BI to be setup
- Roles: Data Analyst, Data Scientist.
- ETL and Data pipeline design.
- Support and Maintenance
- Maintenance, make any necessary modifications.
- Roles: Data Engineer.
- Test and Deploy, where Data Analyst and Scientist test that the business requirements are met. Data Engineer deploys the tool.
- Roles: Data Engineer, Data Analyst, Data Scientist.
- Roles: Data Engineer, Data Analyst, Data Scientist.
- Maintenance, make any necessary modifications.
Data Layers
/Attachments/Pasted-image-20250223010716.png)
Briefly:
- Data Sourcing
- Involves spreadsheet, logs, transactional RDBMS
- ETL > Staging
- Prepare data for storage, transform/tackle/clean use temporary staging tables. Turn semi-structured data to structured, in batches or fully.
- Data Storage
- Data can either be stored in warehouse and then moved to mart or vice versa
- Data Presentation
- BI tools, data mining tools, direct queries
Also:
- BI tools, data mining tools, direct queries
Architectures
Relates to Star and Snowflake schemas
Inmon - top-down
Assumes that data warehouse contains all data of the organization.
/Attachments/Pasted-image-20250223022324.png)
Pros and cons
Pros
Single source of truth
Normalization = less storage
Data Marts are easier to change
Cons
More join = slower response time
Lengthy upfront work (higher startup cost) to unify definitions can take a lot of work across org.
/Attachments/Pasted-image-20250223022528.png)
Kimball - bottom-up
/Attachments/Pasted-image-20250223023416.png)
Pros and cons
Pros
Upfront cost low
Denormalized = user-friendly
Cons
Higher ETL processing time
Greater possibility of duplicates because of denormalization
Ongoing development nature.
/Attachments/Pasted-image-20250223024031.png)
Kimball 4 step approach
To help users to find answers for business questions
- Selected processes to track, like invoices, product quality or performace
- Decide on grain, the lowest level of data stored in fact table, like “song grain” or “Line item grain”. So we can see which individual items did or did not perform well enough
- Dimension identification: time, location, user’s name or id. Helps to describe data and get feedback
- Facts identification: like measurement or metric of the process, number of listens, rides, orders or completed tasks, travel distance, time needed to finish process.
Slowly changing dimensions
For example, category of electric vehicle was to be renamed to electric-crossover in dimensions table.
classic approach
- Type 1 - basic replacement of value, which removes history.
- Type 2 - create new role with new ID and timeframe
- Type 3 - add a new column
Modern approach
Make snapshots of entire dimensions table, considering low storage costs and then change the category name like in Type 1 Classic. Use previous snapshots for historic backtrack.
Row vs column data store
Relates to block storage concepts: Azure Storage, S3
TL;DR fewer blocks you use to write the data, the faster Read and Write will be.
In case of Row storage and storing in blocks, it is excellent option for transaction storage since transactions come in rows, and we can store each new transaction in a new block.
/Attachments/{CEC74A72-275E-4A4B-BB2E-69D74BFE59AA}.png)
In case of Column storage, you store each column in a block. Which makes it good for analytical workloads because you can use fewer blocks to answer the business question, because not all business questions require all columns to be read. Also benefits of data compressions since were data in columns is the same type. Takes longer to add new rows
/Attachments/{BF86D91A-F058-4603-ACC3-07576F0495C6}.png)