-
Notifications
You must be signed in to change notification settings - Fork 0
/
u.3.1 - data warehousing
186 lines (166 loc) · 9.58 KB
/
u.3.1 - data warehousing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
// LESSON 1.1 - INTRO TO DATA WAREHOUSE
data warehousing
- rdbms that is designed for query and analysis rather than for transaction processing
- usually contains historical data derived from transactional data
operational database vs data warehouse
operational database
- oltp: online transactional processing | handles operational data, these are data contained in the operation of a particular system
- tables and joins are complicated since they are normalized for rdbms. This is done to reduce redundant files and to save storage space.
- optimized mostly for write operations
- datas are used to keep the business running
data warehouse
- olap: online analytical processing | handles historical data to perform analysis on these data
- tables and joins are accessible since they are de-normalized. This is done to minimize the response time for analytical queries.
- optimized for read operations
- datas are used for analysis
data warehouse architecture (using three tier architecture)
sources ->
etl with staging area(s3) ->
etl -> data warehouse ->
etl -> data marts ->
etl -> analysis
sources
- usually operational database, e.g. customer database, sales database, products database
staging area
- temporary location where a record from source systems is copied
data warehouse
- storage for cleaned and transformed data where we could extract data for analysis easily
data marts
- divide multiple groups within the data warehouse, this is a way to organize data warehouse
- they don't need to access all the tables in data warehouse
analysis
- data can be accessed efficiently, e.g. olap tools, reporting tools, data mining tools, etc.
types of data warehouses
host based data warehouses
lan based work group
single stage data warehouses
multi stage data warehouses
stationary data warehouses
distributed data warehouess
virtual data warehouses
data warehouse modeling
- different with operational databases systems data modeling
data warehousing modeling workflow
business requirements -> conceptual data modeling -> logical data modeling -> physical data modeling -> fulfilled business requirement for data storage
business requirements
- obviously determine business requirements
conceptual data model
- contains entities(tables) and the relationships among them
logical data model
- contains attributes(columns) and constraints(especially primary and foreign keys)
physical data model
- describes how the model will be presented in the database
data warehouse design: single data repository where a record from multiple data sources is integrated for olap | like in database design values for the data are already here, you're just going to avoid some anomalies in here
top down approach
- data at the lowest level of detail are stored in the data warehouse
- dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse
top down architecture
- sources(usually oltp) -> etl -> staging area -> etl -> data warehouse -> etl -> data mart
bottom up approach:
- data marts are first create to provide reporting and analytical capabilities for specific business processes
- these data marts can then be integrated to create a comprehensive data warehouse
UNFINISHED - bottom up architecture
// LESSON 1.2 - INTRO TO OPERATIONAL DATA STORES
definition
- is used for operational reporting and as a source of data for the enterprise data warehouse
operational data stores (ods) architecture
source ->
etl -> ods
-> management reports
-> web based applications
-> other applications
-> etl -> data warehouse
-> etl -> data marts
-> etl -> olap cubes
operational data stores(inmon and imhoff) vs data warehouse(kimball)
operational data store
- for operational reporting and supports current or near real time reporting requirements
- consist of only a short window of data
- typically detailed only
- used for detailed decision making and operational reporting
- serves as conduct for data between operational and analytics system
data warehouse
- intended for historical and trend analysis, usually reporting on a large volume of data
- includes the entire history of data
- contains summarized and detailed data
- used for long term decision making and management reporting
- serves as a repository for cleansed and consolidated data sets
// LESSON 1.3 - ETL, ELT AND EL
etl: extract, transform, load
- recurring method (daily,weekly,monthly) of a data warehouse system and needs to be agile, automated, and well documented
etl workflow
extraction(source) -> data cleansing(staging database) -> data transformation(staging database) -> loading(target database)
elt: extract, load, transform
elt workflow
extraction(source) -> load(target database) -> data cleansing(target database) -> data transformation(target database)
el: extract, load
- when you just want to move data but there is no transformation to be done
etl vs elt
etl
- transformation are performed in etl server
- typically used for small amount of data
- used in source to target transfer
- compute intensive transformations
elt
- transformation are performed in the target database
- typically used for big amount of data
// LESSON 1.4 - METADATA: data about data
- roadmap to a data warehouse
- metadata in a data warehouse defines the warehouse objects
// LESSON 1.5 - DATA MART
- data marts are derived from subsets of data in a data warehouse, though in the bottom-up data warehouse design methodology, the data warehouse is created from the union of organizational data marts
- fundamental use of a data mart is Business Intelligence (BI) applications
- holds only one subject area
// LESSON 2.1 - OLAP TECHNOLOGY
oltp: featured by a large number of short on-line transactions (INSERT,UPDATE,DELETE)
- main priority is very rapid query processing, maintaing record integrity in multi-access environments and effectiveness consistent by the number of transactions per second
olap: database in which business users get their data from
UNFINISHED - types of olap
rolap
molap
holap
others
wolap
dolap
molap
solap
// LESSON 3.1 - DIMENSIONAL MODELING
dimensional modeling: produce database architecture that is easy for end-clietns to understand and write queries
elements of dimensional modeling
fact table: event that is counted or measured
- numerical value
- e.g. a sale transaction can be converted into facts such as the number of products ordered and the price paid for the products
dimensional table: reference information that gives context to the facts or that describe the facts
- includes descriptive data about the numerical values in the fact table
- e.g. order data, user name, product number, order ship-to, bill-to locations, etc.
measure: numeric attribute(column) of a fact table, representing the performance or behavior of the business relative to the dimensions
additive: summarize across all dimensions
semi-additive: summarize across some of dimensions
non-additive: not summarize across all dimension
basic models used in dimensional modeling
- star model: one central fact table and a set of smaller tables (dimensions)
- snowflake model: conclusion of decomposing one or more of the dimensions (after performing normalization at star model - not sure)
multidimensional data modeling: represents data in the form of cube
- represents data in a data cube where you could see the three dimensional tables in one model
- this is less computationally intensive to create (i think? idk)
- UNFINISHED
// LESSON 3.2 - SCHEMAS
star schema: one central fact table and a set of smaller dimensioanl tables
- very simple schema
fact: only one fact table can contain in star schema
dimensional: doesn't contain another dimensional table under them unlike in snowflake
snowflake schema: still only contains one fact table and dimensional tables that contains another sub-dimensional tables under them
- when dimensional tables are all normalized in star schema it becomes a snowflake schema
star vs snowflake schema
star schema
- fact table will be at the center and is connected to the dimension tables
- tables are completely in a denormalized structure
- sql queries performance is good as there is less number of joins involved
- data redundancy is high and occupies more disk space
snowflake schema
- snowflake schema is an extension of star schema where the dimension tables are connected to one or more dimensions
- dimensional tables are normalized
- performacne of sql queries is a bit less when compared to star schema as more number of joins are involved
- data redundancy is low and occupies less disk space when compared to star schema
fact constellation: also called as galaxy schema
- when two or more fact tables sharing one or more dimensional table/s