Friday 20 January 2017

What is a Data warehouse (DW)?,




What is a Data warehouse (DW)?
A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions.
-      WH Inmon
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
A data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis
This definition of the data warehouse focuses on data storage. Data warehouse arises in an organization’s need for reliable, consolidated, unique and integrated reporting and analysis of its data, at different levels of aggregation.
The practical reality of most organizations is that their data infrastructure is made up by a collection of heterogeneous systems. For example, an organization might have one system that handles customer-relationship, a system that handles employees, systems that handles sales data or production data, yet another system for finance and budgeting data etc. In practice, these systems are often poorly or not at all integrated and simple questions like: "How much time did sales person A spend on customer C, how much did we sell to Customer C, was customer C happy with the provided service, Did Customer C pay his bills" can be very hard to answer, even though the information is available "somewhere" in the different data systems.

Characteristics of Data Warehouse:     
                             A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:
Ø  Subject Oriented
Ø  Integrated
Ø  Nonvolatile
Ø  Time Variant
 


Subject Oriented:
Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case makes the data warehouse subject oriented.

Integrated:
Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

Nonvolatile:
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Time Variant:
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.














OLTP vs. OLAP:

OLTP Or Operational Systems
OLAP Or Data Warehouse
Full Form
Online Transactional Processing
Online Analytical Processing
Purpose
OLTP systems are used to carry the day-to-day business operations.
OLAP is used for the analysis of the business.
Source of Data
OLTP’s are the original source of data
OLAP will get the data from the various OLTP Databases.
Type of Data
It will have the detailed data
It will have detailed and summarized data
End Users
Customers
Managers, Departmental Heads & Board of Directors.
No of End users
Millions
Hundreds to Thousands
Typical operations
Select, Inserts, Updates & Delete
Read Only ( i.e. Select)
Processing Speed
Very fast (Response time will be in ms)
Depending on the amount of data involved. Usually it will be slow.
Access Frequency
Large number of customers will access the OLTP applications for every second.
Rarely the OLAP databases are accessed by the managers. Example once in a day or monthly.
Volatile
In OLTP applications the data will be changing very fast (i.e. Volatile data)
In OLAP once the data is entered it is not going to change (i.e. Non Volatile Data)
Historical Data
OLTP’s will hold maximum of 30 to 90 Days of data.
OLAP will hold 3 to 5 Years of data.
Space Required
Need relatively less amount of space.
Need Large amount of space required since data warehouses will store historical data
DB Design
Database will be in Normalized form.
Database will be in Denormalized form.
DB Schema
E-R Model
Dimensional Model
Redundant Data
Less duplicate data
More Duplicate data
Indexes
Few
Many



Fact tables and dimension tables are the two types of objects commonly used in dimensional data warehouse schemas.

FACT TABLE:
Fact tables are the large tables in your warehouse schema that store business measurements. Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables represent data, usually numeric and additive, that can be analyzed and examined. Examples include sales, cost, and profit.
          Fact table consists of the measurements, metrics or facts of a business process. It is often located at the centre of a star schema or a snowflake schema, surrounded by dimension tables.
DIMENSION TABLE:
Dimension tables, also known as lookup or reference tables, contain the relatively static data in the warehouse. Dimension tables store the information you normally use to contain queries. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. Examples are customers or products.
The primary function of dimensions is threefold: to provide filtering, grouping and labeling. For example, in a data warehouse where each person is categorized as having a gender of male, female or unknown, a user of the data warehouse would then be able to filter or categorize each presentation or report by either filtering based on the gender dimension or displaying results broken out by the gender.
          Each dimension in a data warehouse may have one or more hierarchies applied to it. For the "Date" dimension, there are several possible hierarchies: "Day > Month > Year", "Day > Week > Year", "Day > Month > Quarter > Year", etc.

Hierarchy:
A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level. A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals or not.

Level:
A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.

TYPES OF FACTS
Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with others.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:

Date
Store
Product
Sales_Amount
         
The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.
Say we are a bank with the following fact table:

Date
Account
Current_Balance
Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to
add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

FACTLESS FACT TABLE:
          A factless fact table is a fact table that does not contain any facts (i.e., metrics). Factless fact tables are used to track events that did, or did not, happen.
          Use a factless fact table to track events of interest to the organization. For example, attendance at a cultural event can be tracked by creating a fact table containing the following foreign keys (i.e., links to dimension tables): event identifier, speaker/entertainer identifier, participant identifier, event type, date. This table can then be queried to find out information, such as which cultural events or event types are the most popular.
          Factless fact tables can also be used to track events that did not happen. Using the example above we can also identify the events that were not attended or the event types that were the least popular.
Factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts. They are often used to record events
Coverage information. Common examples of factless fact tables include:
- Identifying product promotion events (to determine promoted products that didn’t sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or university.
TYPES OF DIMENSIONS
Conformed Dimension:
Dimensions are conformed when they are either exactly the same (including keys) or one is a perfect subset of the other. Most important, the row headers produced in the answer sets from two different conformed dimensions must be able to match perfectly.
Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension. Dimension tables are not conformed if the attributes are labeled differently or contain different values. Conformed dimensions come in several

Different flavors. At the most basic level, conformed dimensions mean the exact same thing
With every possible fact table to which they are joined. The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.

Junk Dimension:
A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, these flags and indicators are removed from the fact table while placing them into a useful dimensional framework.
Junk dimensions are dimensions that contain miscellaneous data (like flags and indicators) that do not fit in the base dimension table.
When developing a dimensional model, we often encounter miscellaneous indicators and flags that don't logically belong to the core dimension tables. These unattached attributes are usually too valuable to ignore or exclude. Designers sometimes want to treat them as facts (supposed textual facts) or clutter the design with numerous small dimensional tables. A third, less obvious but preferable, solution is to incorporate a junk dimension as a holding place for these flags and indicators.
An interesting use for a junk dimension is to capture the context of a specific transaction. While our common, conformed dimensions contain the key dimensional attributes of interest, there are likely attributes about the transaction that are not known until the transaction is processed.
For example, a healthcare insurance provide may need to capture the context surrounding their claims transactions. The grain for this key business process is one row for each line item on a claim. Due to the complexities of the healthcare industry, similar claims may be handled quite differently. They may design separate junk dimensions to capture the context of how the claim was processed, how it was paid, and the contractual relationship between the healthcare providers at the time of the claim.
There are two approaches for creating junk dimensions. The first is to create the junk dimension table in advance. Each possible, unique combination generates a row in the junk dimension table. The second approach is to create the rows in the junk dimension on the fly during the extract, transformation, and load (ETL) process. As new unique combinations are encountered, a new row with its surrogate key is created and loaded into the junk dimension table.
If the total number of possible rows in the junk dimension is relatively small, it is probably best to create the rows in advance. On the other hand, if the total number of possible rows in the junk dimension is large, it may be more advantageous to create the junk dimension as unique rows are encountered. One of the junk dimensions encountered in
The recent healthcare design had over 1 trillion theoretical rows, while the actual number of observed rows was tens of thousands.
IT Online Training @ Monstercourses
Obviously, it did not make sense to create all the theoretically possible rows in advance. If the number of rows in the junk dimension approaches or exceeds the number of rows in the fact table, the design should be clearly re-evaluated.
Since a junk dimension includes all valid combinations of attributes, it will automatically track any changes in dimension attributes. Therefore slowly changing dimension strategies do not need to be considered for junk dimensions.

Degenerated Dimension:
A dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table. Degenerate dimensions are very common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent. Degenerate dimensions often play an integral role in the fact table's primary key.

Role-Playing Dimensions:
Dimensions are often recycled for multiple applications within the same database. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a "role-playing dimension".

Data Mart (DM):
A Data Mart is the subset of the Datawarehouse Monstercourses (DW). Usually data marts are oriented or focused towards one Line of Business (LOB) or one Department or one Subject Area (Examples such as Sales, Marketing or Finance) where as Data Warehouse deals with multiple subject areas.


You can have two types of Data Marts.
Ø  Dependent Data Marts
Ø  Independent Data Marts

Dependent Data Mart: If the data marts are designed using conformed dimensions, then they will be related, Such Data Marts are called as Dependent Data Marts. Dependent data marts are usually built to achieve improved performance and availability, better control, and lower telecommunication costs resulting from local access of data relevant to a specific department.

Independent Data Mart: A data mart that is not dependent on the other data marts is called the independent data mart. The creation of independent data marts is often driven by the need to have a solution within a shorter time.

Bottom-up design:
          Ralph Kimball, a well-known author on data warehousing, is a proponent of an approach to data warehouse design frequently considered as bottom-up. In the so-called bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes. Data marts contain atomic data and, if necessary, summarized data. These data marts can eventually be unioned together to create a comprehensive data warehouse. The combination of data marts is managed through the implementation of what Kimball calls "a Datawarehouse Monstercourses bus architecture"
          Business value can be returned as quickly as the first data marts can be created. Maintaining tight management over the data warehouse bus architecture is fundamental to maintaining the integrity of the data warehouse. The most important management task is making sure dimensions among data marts are consistent. In Kimball's words, this means that the dimensions "conform".



Top-down design:
Bill Inmon, one of the first authors on the subject of data warehousing, has defined a data warehouse as a centralized repository for the entire enterprise. Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the

Data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. In the Inmon vision the data warehouse is at the center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI) and business management capabilities.

Data Warehouse Vs Data Mart:

Data Warehouse
Data Mart
Subjects
Multiple subject areas data.
Single partial or multiple partial subject areas data.
Data Perspective
Will have the Historical Detailed Data.
Will have the Detailed data with some historical data.
Implement time Frame
9–18 Months for the first stage.
Multiple stage implementations.
4-12 Months
ROI(Return of Investment)
Will take more time to get the ROI
Will take less time to get the ROI.
Implementation Cost
Expensive
Relatively Cheap.
Change Management
It is very difficult to implement the change.
Easy to implement the Change.
Corporate Support
It is very difficult to get the continuous support from all the departments.
Can lead to independent data marts.
Database Design
Technical challenges in building the large databases.
We can easily build the databases.

Star Schema:
Star Schema is a relational database schema for representing multi-dimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple
Dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantages of star schema are slicing down, performance increase and easy understanding of data.












Snowflake Schema:
A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. I.e. dimension table hierarchies are broken into simpler tables.
In star schema example we had 4 dimensions like location, product, time, organization and a fact table (sales).
Snowflake schema, the example diagram shown above has 4 dimension tables, 4 lookup tables and 1 fact table. The reason is that hierarchies (category, branch, state, and month) are being broken out of the dimension tables (PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and shown separately. In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data. In few organizations, they try to normalize the dimension tables to save space. Since dimension tables hold less space, Snowflake schema approach may be avoided.


Surrogate Key:
A surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table.
Data warehouses typically use a surrogate, (also known as artificial or identity key or Meaningless Key), key for the dimension tables primary keys. They can use Info sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key. It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but, not only can these change,
Indexing on a numerical value is probably better and you could consider creating a

Surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as
The client is concerned you may display only the AIRPORT_NAME.
Another benefit you can get from surrogate keys (SID) is: Tracking the SCD - Slowly Changing Dimension.

Slowly Changing Dimensions:
Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.
Example:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:
Customer Key
Name
State
1001
Christina
Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" issue.
There are in general three ways to solve this type of problem, and they are categorized as follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois

After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table:

Customer Key
Name
State
1001
Christina
California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key.
In our example, recall we originally have the following table:

Customer Key
Name
State
1001
Christina
Illinois

After Christina moved from Illinois to California, we add the new information as a new row into the table:
Customer Key
Name
State
Eff_Start_Date
End_Date
1001
Christina
Illinois
01-June-2011
05_June_2011
1005
Christina
California
05-June-2011
null
         
Advantages:
- This allows us to accurately keep all historical information.

Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.
Type 3: The original record is modified to reflect the change.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
         
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
  • Customer Key
  • Name
  • Original State
  • Current State
  • Effective Date

After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003):
Customer Key
Name
Original State
Current State
Effective Date
1001
Christina
Illinois
California
15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.

Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.

Grain Level: Is the level at which data as to be captured in the fact table and the dimension table. Grain level should be as low as possible.
Data warehouses and their architectures vary depending upon the specifics organization's situation.
Three common architectures are:
Ø  Data Warehouse Architecture (Basic)
Ø  Data Warehouse Architecture (with a Staging Area)
Ø  Data Warehouse Architecture (with a Staging Area and Data Marts)
Data Warehouse Architecture (Basic)
Figure below shows a simple architecture for a data warehouse. End users directly
Access data derived from several source systems through the data warehouse.
In Figure, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data Warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary in Oracle is called a materialized view.
Data Warehouse Architecture (with a Staging Area)
In Figure below, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a
Staging area instead. A staging area simplifies building summaries and general warehouse management. Figure illustrates this typical architecture.










Data Warehouse Architecture (with a Staging Area and Data Marts)
Although the architecture in Figure above is quite common, you may want to customize your warehouse’s architecture for different groups within your organization.
You can do this by adding data marts, which are systems designed for a particular line of business. Figure below illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.










There are three types of data warehouses:
  1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision support throughout the enterprise.
  2. ODS (Operational Data Store) - This has a broad enterprise wide scope, but unlike the real entertprise data warehouse, data is refreshed in near real time and used for routine business activity. An ODS may contain 30 to 90 days of information.
  3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function.
IF ANY Video Tutorials or IT Online Training NEEDED PLEASE MAIL TO info@monstercourses.com, http://www.monstercourses.com/