1.
Definition of data warehousing?
ü Data warehouse is a Subject oriented, Integrated, Time variant, Non volatile collection of data in support of management's decision making
process.
Subject Oriented
Data warehouses are designed to help
you analyze data. For example, to learn more about your company's sales data, you
can build a warehouse that concentrates on sales. Using this warehouse, you can
answer questions like "Who was our best customer for this item last
year?" This ability to define a data warehouse by subject matter, sales in
this case makes the data warehouse subject oriented.
Integration is closely related to
subject orientation. Data warehouses
must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and
inconsistencies among units of measure. When they achieve this, they are said
to be integrated.
Nonvolatile means that, once entered
into the warehouse, data should not change. This is logical
because the purpose of a warehouse is to enable you to analyze what has
occurred.
In order to discover trends in
business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical
data be moved to an archive. A data warehouse's focus on change over time is
what is meant by the term time variant.
2.
How many stages in Data warehousing?
Data warehouse generally includes two stages
ü ETL
ü Report Generation
ETL
·
Extract -- the process of
reading data from a source database.
·
Transform -- the process of
converting the extracted data from its previous form into required form
·
Load -- the process of writing the data
into the target database.
ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format to another format.
It is used to retrieve the data from various operational
databases and is transformed into useful information and finally loaded into
Data warehousing system.
1 INFORMATICA
2 ABINITO
3 DATASTAGE
4. BODI
5 ORACLE WAREHOUSE BUILDERS
Report generation
In report
generation, OLAP is used (i.e.) online analytical processing. It is a set of
specification which allows the client applications in retrieving the data for
analytical processing.
It is a specialized tool that sits between a database and user
in order to provide various analyses of the data stored in the database.
OLAP Tool is a reporting tool which generates the reports that
are useful for Decision support for top level management.
1.
Business Objects
2.
Cognos
3.
Micro strategy
4.
Hyperion
5.
Oracle Express
6. Microsoft Analysis Services
·
Different Between OLTP and OLAP
OLTP
|
OLAP
|
|
1
|
Application
Oriented (e.g., purchase order it is functionality of an application)
|
Subject
Oriented (subject in the sense customer, product, item, time)
|
2
|
Used
to run business
|
Used
to analyze business
|
3
|
Detailed
data
|
Summarized
data
|
4
|
Repetitive
access
|
Ad-hoc
access
|
5
|
Few
Records accessed at a time (tens), simple query
|
Large
volumes accessed at a time(millions), complex query
|
6
|
Small
database
|
Large
Database
|
7
|
Current
data
|
Historical
data
|
8
|
Clerical
User
|
Knowledge
User
|
9
|
Row
by Row Loading
|
Bulk
Loading
|
10
|
Time
invariant
|
Time
variant
|
11
|
Normalized
data
|
De-normalized
data
|
12
|
E
– R schema
|
Star
schema
|
3.
What are the types of data warehousing?
EDW (Enterprise data warehousing)
ü It provides a
central database for decision support throughout the enterprise
ü It is a collection
of DATAMARTS
DATAMART
ü It is a subset of
Data warehousing
ü It is a subject oriented database which supports the
needs of individuals depts. in an organizations
ü It is called high performance query structure
ü It supports
particular line of business like sales, marketing etc.
ODS (Operational data store)
ü It is defined as an
integrated view of operational database designed to support operational
monitoring
ü It is a collection
of operational data sources designed to support Transaction processing
ü Data is refreshed
near real-time and used for business activity
ü It is an
intermediate between the OLTP and OLAP which helps to create an instance reports
4. What are the modeling involved in Data
Warehouse Architecture?
5. What are the types of Approach in DWH?
Bottom up approach: first we need to develop data mart
then we integrate these data mart into EDW
Top down approach: first we need to
develop EDW then form that EDW we develop data mart
Bottom up
OLTP
ETL
Data mart
DWH
OLAP
Top down
OLTP
ETL
DWH
Data mart
OLAP
Top down
ü Cost of initial
planning & design is high
ü Takes longer
duration of more than an year
Bottom up
ü Planning &
Designing the Data Marts without waiting for the Global warehouse design
ü Immediate results
from the data marts
ü Tends to take less
time to implement
ü Errors in critical
modules are detected earlier.
ü Benefits are
realized in the early phases.
ü It is a Best
Approach
Data Modeling Types:
ü Conceptual Data
Modeling
ü Logical Data
Modeling
ü Physical Data
Modeling
ü Dimensional Data
Modeling
1. Conceptual Data Modeling
ü Conceptual data
model includes all major entities and relationships and does not contain much
detailed level of information about attributes and is often used in the INITIAL
PLANNING PHASE
ü Conceptual data
model is created by gathering business requirements from various sources like
business documents, discussion with functional teams, business analysts, smart
management experts and end users who do the reporting on the database. Data
modelers create conceptual data model and forward that model to functional team
for their review.
ü Conceptual data
modeling gives an idea to the functional and technical team about how business
requirements would be projected in the logical data model.
2. Logical Data Modeling
ü This is the actual
implementation and extension of a conceptual data model. Logical data
model includes all required entities, attributes, key groups, and relationships
that represent business information and define business rules.
3. Physical Data Modeling
ü Physical data model
includes all required tables,
columns, relationships, database properties for the physical
implementation of databases. Database performance, indexing strategy, physical
storage and demoralization are important parameters of a physical model.
Logical vs. Physical Data Modeling
Logical Data Model
|
Physical Data Model
|
Represents
business information and defines business rules
|
Represents
the physical implementation of the model in a database.
|
Entity
|
Table
|
Attribute
|
Column
|
Primary
Key
|
Primary
Key Constraint
|
Alternate
Key
|
Unique
Constraint or Unique Index
|
Inversion
Key Entry
|
Non
Unique Index
|
Rule
|
Check
Constraint, Default Value
|
Relationship
|
Foreign
Key
|
Definition
|
Comment
|
Dimensional Data Modeling
ü Dimension model
consists of fact and dimension tables
ü It is an approach
to develop the schema DB designs
Types of Dimensional modeling
ü Star schema
ü Snow flake schema
ü Star flake schema
(or) Hybrid schema
ü Multi star schema
ü The Star Schema
Logical database design which contains a centrally located fact table
surrounded by at least one or more dimension tables
ü Since the database
design looks like a star, hence it is called star schema db
ü The Dimension table
contains Primary keys and the textual descriptions
ü It contain
de-normalized business information
ü A Fact table
contains a composite key and measures
ü The measure are of
types of key performance indicators which are used to evaluate the enterprise performance in the
form of success and failure
ü Eg: Total revenue ,
Product sale , Discount given, no of customers
ü To generate
meaningful report the report should contain at least one dimension and one fact
table
The advantage of star schema
ü Less number of
joins
ü Improve query
performance
ü Slicing down
ü Easy understanding
of data.
Disadvantage:
ü Require more
storage space
Example of Star Schema:
Snowflake Schema
ü In star schema, If
the dimension tables are spitted into one or more dimension tables
ü The de-normalized
dimension tables are spitted into a normalized dimension table
Example of Snowflake Schema:
ü In Snowflake
schema, the example diagram shown below has 4 dimension tables, 4 lookup tables
and 1 fact table. The reason is that hierarchies (category, branch, state, and
month) are being broken out of the dimension tables (PRODUCT, ORGANIZATION,
LOCATION, and TIME) respectively and separately.
ü It increases the
number of joins and poor performance in retrieval of data.
ü In few
organizations, they try to normalize the dimension tables to save space.
ü Since dimension
tables hold less space snow flake schema approach may be avoided.
ü Bit map indexes
cannot be effectively utilized
Important aspects of Star Schema & Snow Flake Schema
ü In a star schema
every dimension will have a primary key.
ü In a star schema, a
dimension table will not have any parent table.
ü Whereas in a snow
flake schema, a dimension table will have one or more parent tables.
ü Hierarchies for the
dimensions are stored in the dimensional table itself in star schema.
ü Whereas hierarchies
are broken into separate tables in snow flake schema. These hierarchies help to
drill down the data from topmost hierarchies to the lowermost hierarchies.
Star flake schema (or) Hybrid Schema
ü Hybrid schema is a
combination of Star and Snowflake schema
Multi Star schema
ü Multiple fact
tables sharing a set of dimension tables
ü Confirmed Dimensions are nothing but Reusable Dimensions.
ü The dimensions
which u r using multiple times or in multiple data marts.
ü Those are common in
different data marts
Measure Types (or) Types of Facts
·
Additive - Measures that can
be summed up across all dimensions.
o
Ex: Sales Revenue
·
Semi Additive - Measures that can
be summed up across few dimensions and not with others
o
Ex: Current Balance
·
Non Additive - Measures that
cannot be summed up across any of the dimensions.
o
Ex: Student attendance
Surrogate Key
ü Joins between fact
and dimension tables should be based on surrogate keys
ü Users should not
obtain any information by looking at these keys
ü These keys should
be simple integers
A sample data warehouse schema
WHY
NEED STAGING AREA FOR DWH?
ü Staging area needs
to clean operational data before loading into data warehouse.
ü Cleaning in the
sense your merging data which comes from different source.
ü It’s the area where
most of the ETL is done
Data Cleansing
ü It is used to
remove duplications
ü It is used to
correct wrong email addresses
ü It is used to
identify missing data
ü It used to convert
the data types
ü It is used to
capitalize name & addresses.
Types of Dimensions:
There are three types of Dimensions
ü Confirmed
Dimensions
ü Junk Dimensions
Garbage Dimension
ü Degenerative
Dimensions
ü Slowly changing
Dimensions
Garbage Dimension or Junk Dimension
ü Confirmed is
something which can be shared by multiple Fact Tables or multiple Data Marts.
ü Junk Dimensions is
grouping flagged values
ü Degenerative
Dimension is something dimensional in nature but exist fact table.(Invoice No)
Which is neither fact nor strictly dimension attributes. These are
useful for some kind of analysis. These are kept as attributes in fact table
called degenerated dimension
Degenerate dimension: A column of the key section
of the fact table that does not have the associated dimension table but
used for reporting and analysis, such column is called degenerate dimension or
line item dimension.
For ex, we have a fact table with customer_id, product_id,
branch_id, employee_id, bill_no, and date in key section and price, quantity,
amount in measure section. In this fact table, bill_no from key section is a
single value; it has no associated dimension table. Instead of creating a
Separate dimension table for that single value, we can Include
it in fact table to improve performance. SO here the column, bill_no is a
degenerate dimension or line item dimension.
The main focus of the data warehouse service provider is to create a well-maintained environment for daily monitoring of the tasks.
ReplyDeletevery nice information and very well explained. Keep posting.
ReplyDeleteetl testing training
etl testing online
This company offers various other effective strategies. If you also want to check out about the different other plans or strategies, then click here to learn more about data warehouse services .
ReplyDeleteMicrosoft recently announced the release date for its upcoming cloud-hosted data warehouse product,Azure Synapse. The product is designed to allow companies to store large amounts of data in a way that makes it easy to search, access, and analyze.
ReplyDeleteMicrosoft recently announced the release date for its upcoming cloud-hosted Data Warehouse Service product, Azure Synapse. The product is designed to allow companies to store large amounts of data in a way that makes it easy to search, access, and analyze.
ReplyDeletedata modelling online training
ReplyDeletescom online training
Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing
ReplyDeleteMicro Strategy Certification
Micro Strategy Training Online