-
Notifications
You must be signed in to change notification settings - Fork 0
/
u.3.2 - data warehousing
185 lines (180 loc) · 10.3 KB
/
u.3.2 - data warehousing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
// LESSON 1 - DATA WAREHOUSING CONCEPTS
data warehouse
- built on top of a database
- data inside data warehouse comes from elsewhere
- possibly dozens of data sources
- data are copied into data warehouse, not moved
why data warehouse
- for making data driven decisions
- one stop shop: the data that we need is all in a single location rather than scattered among oltp database
rules of data warehousing by bill inmon (1990)
- integrated: data from a number of different sources are sent into a data warehouse
- subject oriented: data is organized by subject
- time variant: data warehouses contain historical data not just current data
- non volatile: diba we periodically load data into data warehouse (batch processing), between the time of the last load and the next one, the data warehouse stays as is
data warehouse vs data lake vs data virtualization
data warehouse: if incoming data is not big data then you could skip data lake
- often built in an rdbms (relational database management system)
- sometimes built in an mdbms (multidimensional database management system)
data lake: can use data lake first before data warehouse if the incoming data qualifies for big data
- built on a big data environment rather than a traditional rdbms
- big data (key to understanding how we can measure big data and just how different big data is to old fasioned data)
- volume: volumes or size of data
- velocity: growth of data
- variety: contains different format such as excel, csv, and nontraditionals like video, text, pdf, etc.
data virtualization
// LESSON 2 - DATA WAREHOUSING ARCHITECTURE
architecture concepts
dw architectures
centralized data warehouse: using a single database as the home for your data to support business intelligence and analytics
independent data mart: small scale version of data warehouse
component based data warehousing: opposite of centralized dw, composed of data warehouse and dependent data mart operating together
optionals to a data warehousing architecture
cube: specialized type of database
operational data store: variation of data warehouse that you may want to consider on your overall architecture
staging layer: segment of the data warehouse into which data initially gets loaded before the data is transformed
persistent
non persistent
data warehouse environment
centralized data warehouse
- using a single database as the home for your data to support business intelligence and analytics
- one stop shopping: every data that we need is all in here
- architecture: source -> etl -> data warehouse -> analytics
independent data mart
- you could think of it as a small scale data warehouse
- data mart represent a subject or topic then analytics is performed from those data marts
- architecture: source -> etl -> data marts -> analytics
component based data warehouse
- opposite of centralized data warehouse
- consist of data warehouse and data marts
- architecture(top down): source -> etl -> data warehouse -> etl -> data marts -> analytics
- architecture(bottom up): source -> etl -> data marts -> analytics
-> etl -> data warehouse -> analytics
optionals to a data warehouse environment
cube
- multidimensional database management system/mdbms, obviously not rdbms
- specialized "dimensionally-aware" database
- best for smaller scale data warehouses, data marts
- sometimes cubes and rdbms go hand in hand
- e.g. data warehouse - rdbms, data marts - mdbms
- e.g. data warehouse - rdbms, data mart 1 - rdbms, data mart 2 and remaining marts - mdbms
- application in architecture(example): source -> etl -> data warehouse(built on top of mdbms)
operational data store
- less popular today because or reduced latency in our data processing (everything's basically much faster now)
- only contains the current data unlike in data warehouse where it contains the current and historical data
- updated frequently
- application in architecture(option 1): source -> rapid etl -> ods
-> batch etl -> data warehouse
- application in architecture(option 2): source -> rapid etl -> ods -> batch etl -> data warehouse
staging layer
- landing zone for the incoming data
- why? because if there would be error in the data ingestion we don't have to extract from the start again
- a data lake(raw zone), data warehouse all have staging layer
type of staging layer
persistent
- data are not removed once it is copied into the next layer
non persistent
- data are removed once it is copied into the next layer
choosing a datawarehouse architecture flow
centralized approach
edw
relational
specialized
data lake
specialized
hadoop
aws s3,etc.
component based approach
architected
dw then dm(top down)
dm then dw(bottom up)
dm only
dw bus
non architected
federated edw
// LESSON 3 - BRINGING DATA TO YOUR DATA WAREHOUSE/DATA LAKE
etl
- extract, transform, load
- acronym used for data warehousing
- schema on write: schema must exist before loading it into data warehouse
type of etl
initial etl(full etl)
- normally one time etl only
- right before the data warehouse goes live
- redo the intial etl if data warehouse "blows up"(get corrupted, etc.)
incremental etl
- incrementally or continously refreshed the data warehouse
four major incremental etl patterns
- append (common)
- in place update (common)
- complete replacement
- rolling append
flavors of incremental load
- stream data load: load data in real time
- batch data load: load data in time interval
elt
- extract, load, transform
- acronym used for data lake
- schema on read: schema would exist after loading it into data lake
transformation (common transformation you'll be doing): data transformation main goal - uniformity across the whole data warehouse
- data value, data type, data size unification
e.g. table 1 has value of great, and not great for column_name while table 2 has value of g and ng for column_name, what we'll be doing is only pick from one of them what would be the value for both tables - either (g and ng) or (great and not great)
- de-duplication: prevent duplication
e.g. table 1(students) and table 2(students) both have an intersecting students (meaning one student is both in table), when you start unifying the data from both table then make sure to only put the intersected student once
- dropping columns (vertical slicing): remove unneccesary columns - columns we don't need for our analytical purposes
- value based row filtering (horizontal slicing): remove unnecessary rows - rows we don't need
- correcting known errors: if you intuitively know there is an error then fix it for god's sake
- fixing null and empty values (if ever they do need some fixing)
// LESSON 4 - DATA WAREHOUSING DESIGN: BUILDING BLOCKS | data modeling focus - dimensional
making data driven decisions
- one or more measurements
- facts
- something quantifiable
- e.g. annual faculty salary
- dimensional context for each measurement
- dimension
- e.g. by rank, by department, by academic year
- e.g. $6000 - payment is not enough dimensional context
bringing dimensional context to our measurement
dimensional context in wordings: almost always this happen
by: "sliced and grouped" by values of the entire dimension
for: one or more specific values from within the entire dimension
typical dimensional insight
- what is the average annual faculty salary(measurement) by(look, it's a by!) rank, by department, by academic year?
- what is the average student loan balance(measurement) by major(dimensional), by class year, by campus?
- what is the average annual faculty salary(measurement) for assistant professors, by department, by academic year
- what is the average student loan balance for engineering majors, by clas year, by campus
dimensional modeling
facts
- numeric and quantifiable
- measurement
- fact is not the same as a fact table, but obviously we store our fact in a fact table
- not the same as logical fact | e.g. of logical fact - i am now writing
examples
- salary
- number of credits a student is taking in a semester
- dollar amount
- number of years
type of fact
additive
- a measurement you could add
- e.g. faculy member's salary | you could add all of the faculty member salary and you'd come up with a good analysis
- e.g. a student's credit hours completed
non-additive
- a measurement you could not add because adding them provides nonsensical information
- e.g. gpa of student | it's nonsense to add up all of the students gpa
- typical example: margins, ratios, percentages, calculated average (e.g. gpa)
semi-additive
- a measurement where sometimes you can add and sometimes not
- typically used in periodic snapshot fact tables
dimension
- context for a fact
- dim is not the same as a dim table, but obviously we store our dim in a dim table
examples
- academic department
- major
- student
- employee
schemas
star
snowflake