Data Warehousing Tips: Tips

Showing posts with label Tips. Show all posts

Thursday, September 29, 2011

Interview Questions for Data Warehousing

What is Data warehousing?

A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form.

What are fact tables and dimension tables?

As mentioned, data in a warehouse comes from the transactions. Fact table in a data warehouse consists of facts and/or measures. The nature of data in a fact table is usually numerical. On the other hand, dimension table in a data warehouse contains fields used to describe the data in fact tables. A dimension table can provide additional and descriptive information (dimension) of the field of a fact table.

e.g. If I want to know the number of resources used for a task, my fact table will store the actual measure (of resources) while my Dimension table will store the task and resource details. Hence, the relation between a fact and dimension table is one to many.

What is ETL process in data warehousing?

ETL stands for Extraction, transformation and loading. That means extracting data from different sources such as flat files, databases or XML data, transforming this data depending on the application’s need and loads this data into data warehouse.

Explain the difference between data mining and data warehousing?

Data mining is a method for comparing large amounts of data for the purpose of finding patterns. Data mining is normally used for models and forecasting. Data mining is the process of correlations, patterns by shifting through large data repositories using pattern recognition techniques.

Data warehousing is the central repository for the data of several business systems in an enterprise. Data from various resources extracted and organized in the data warehouse selectively for analysis and accessibility.

What is an OLTP system and OLAP system?

OLTP stands for OnLine Transaction Processing. Applications that supports and manges transactions which involve high volumes of data are supported by OLTP system. OLTP is based on client-server architecture and supports transactions across networks.

OLAP stands for OnLine Analytical Processing. Business data analysis and complex calculations on low volumes of data are performed by OLAP. An insight of data coming from various resources can be gained by a user with the support of OLAP.

What are cubes?

Multi dimensional data is logically represented by Cubes in data warehousing. The dimension and the data are represented by the edge and the body of the cube respectively. OLAP environments view the data in the form of hierarchical cube. A cube typically includes the aggregations that are needed for business intelligence queries.

What is snow flake scheme design in database?

Snow flake schema is one of the designs that are present in database design. Snow flake schema serves the purpose of dimensional modeling in data warehousing. If the dimensional table is split into many tables, where the schema is inclined slightly towards normalization, then the snow flake design is utilized. It contains joins in depth. The reason is that, the tables split further.

Explain the difference between star and snowflake schemas?

Star schema: A highly de-normalized technique. A star schema has one fact table and is associated with numerous dimensions table and depicts a star.

Snow flake schema: The normalized principles applied star schema is known as Snow flake schema. Every dimension table is associated with sub dimension table.

Differences:

* A dimension table will not have parent table in star schema, whereas snow flake schemas have one or more parent tables.

* The dimensional table itself consists of hierarchies of dimensions in star schema, where as hierarchies are split into different tables in snow flake schema. The drilling down data from top most hierarchies to the lowermost hierarchies can be done.

What is the difference between view and materialized view?

A view is created by combining data from different tables. Hence, a view does not have data of itself.

On the other hand, Materialized view usually used in data warehousing has data. This data helps in decision making, performing calculations etc. The data stored by calculating it before hand using queries. When a view is created, the data is not stored in the database. The data is created when a query is fired on the view. Whereas, data of a materialized view is stored.

What is junk dimension?

A single dimension is formed by lumping a number of small dimensions. This dimension is called a junk dimension. Junk dimension has unrelated attributes. The process of grouping random flags and text attributes in dimension by transmitting them to a distinguished sub dimension is related to junk dimension.

What is degenerate dimension table?

A degenerate table does not have its own dimension table. It is derived from a fact table. The column (dimension) which is a part of fact table but does not map to any dimension.

E.g. employee_id

What is conformed fact and conformed dimensions use for?

Conformed fact in a warehouse allows itself to have same name in separate tables. They can be compared and combined mathematically. Conformed dimensions can be used across multiple data marts. These conformed dimensions have a static structure. Any dimension table that is used by multiple fact tables can be conformed dimensions.

What is Virtual Data Warehousing?

A virtual data warehouse provides a compact view of the data inventory. It contains Meta data. It uses middleware to build connections to different data sources. They can be fast as they allow users to filter the most important pieces of data from different legacy applications.

What is active data warehousing?

An Active data warehouse aims to capture data continuously and deliver real time data. They provide a single integrated view of a customer across multiple business lines. It is associated with Business Intelligence Systems

What is the difference between dependent and independent data warehouse?

A dependent data warehouse stored the data in a central data warehouse. On the other hand independent data warehouse does not make use of a central data warehouse.

Difference between data modeling and data mining?

Data modeling aims to identify all entities that have data. It then defines a relationship between these entities. Data models can be conceptual, logical or Physical data models. Conceptual models are typically used to explore high level business concepts in case of stakeholders. Logical models are used to explore domain concepts. While Physical models are used to explore database design.

Data mining is used to examine or explore the data using queries. These queries can be fired on the data warehouse. Data mining helps in reporting, planning strategies, finding meaningful patterns etc. it can be used to convert a large amount of data into a sensible form.

What is the difference between ER Modeling and Dimensional Modeling?

ER modeling, that models an ER diagram represents the entire businesses or applications processes. This diagram can be segregated into multiple Dimensional models. This is to say, an ER model will have both logical and physical model. The Dimensional model will only have physical model.

What is Data Mart?

Data mart stores particular data that is gathered from different sources. Particular data may belong to some specific community (group of people) or genre. Data marts can be used to focus on specific business needs.

What are various methods of loading Dimension tables?

Conventional load: Here the data is checked for any table constraints before loading.

Direct or Faster load: The data is directly loaded without checking for any constraints.

What is the difference between OLAP and data warehouse?

A data warehouse serves as a repository to store historical data that can be used for analysis. OLAP is Online Analytical processing that can be used to analyze and evaluate data in a warehouse. The warehouse has data coming from varied sources. OLAP tool helps to organize data in the warehouse using multidimensional models.

Describe the foreign key columns in fact table and dimension table?

The primary keys of entity tables are the foreign keys of dimension tables.

The Primary keys of fact dimensional table are the foreign keys of fact tables.

Define the term slowly changing dimensions (SCD)?

SCD are dimensions whose data changes very slowly. An example of this can be city of an employee. This dimension will change very slowly. The row of this data in the dimension can be either replaced completely without any track of old record OR a new row can be inserted, OR the change can be tracked

What is a Star Schema?

A star schema comprises of fact and dimension tables. Fact table contains the fact or the actual data. Usually numerical data is stored with multiple columns and many rows. Dimension tables contain attributes or smaller granular data. The fact table in start schema will have foreign key references of dimension tables.

What is the difference between star and snowflake schema?

Star Schema: A de-normalized technique in which one fact table is associated with several dimension tables. It resembles a star.

Snow Flake Schema: A star schema that is applied with normalized principles is known as Snow flake schema. Every dimension table is associated with sub dimension table.

Explain the use lookup tables and Aggregate tables?

An aggregate table contains summarized view of data. Lookup tables, using the primary key of the target, allow updating of records based on the lookup condition.

What is real time data-warehousing?

In real time data-warehousing, the warehouse is updated every time the system performs a transaction. It reflects the businesses real time information. This means that when the query is fired in the warehouse, the state of the business at that time will be returned.

Define non-additive facts?

The facts that can not be summed up for the dimensions present in the fact table are called non-additive facts. The facts can be useful if there are changes in dimensions. For example, profit margin is a non-additive fact for it has no meaning to add them up for the account level or the day level.

Define BUS Schema?

A BUS schema is to identify the common dimensions across business processes, like identifying conforming dimensions. BUS schema has conformed dimension and standardized definition of facts.

What is data cleaning? How can we do that?

Data cleaning is the process of identifying erroneous data. The data is checked for accuracy, consistency, typos etc.

Data cleaning Methods:

Parsing – Used to detect syntax errors.

Data Transformation – Confirms that the input data matches in format with expected data.

Duplicate elimination – This process gets rid of duplicate entries.

Statistical Methods- values of mean, standard deviation, range, or clustering algorithms etc are used to find erroneous data.

What is the purpose of Fact less Fact Table?

Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they don’t really have facts or any information but are more commonly used for tracking some information of an event.

Eg. To find the number of leaves taken by an employee in a month.

What is a level of Granularity of a fact table?

A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table.

E.g. Employee performance is a very high level of granularity. Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity.

What is Bit Mapped Index?

Bitmap indexes make use of bit arrays (bitmaps) to answer queries by performing bitwise logical operations.

They work well with data that has a lower cardinality which means the data that take fewer distinct values.

Bitmap indexes are useful in the data warehousing applications.

Bitmap indexes have a significant space and performance advantage over other structures for such data.

Tables that have less number of insert or update operations can be good candidates.

The advantages of Bitmap indexes are:

They have a highly compressed structure, making them fast to read.

Their structure makes it possible for the system to combine multiple indexes together so that they can access the underlying table faster.

The Disadvantage of Bitmap indexes is:

The overhead on maintaining them is enormous.

What is Data Cardinality?

Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation.

There are 3 basic types of cardinality:

High data cardinality:Values of a data column are very uncommon.

e.g.: email ids and the user names

Normal data cardinality:Values of a data column are somewhat uncommon but never unique.

e.g.: A data column containing LAST_NAME (there may be several entries of the same last name)

Low data cardinality:Values of a data column are very usual.

e.g.: flag statuses: 0/1

Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the relationships

Types of cardinalities:

The Link Cardinality – 0:0 relationships

The Sub-type Cardinality – 1:0 relationships

The Physical Segment Cardinality – 1:1 relationship

The Possession Cardinality – 0: M relation

The Child Cardinality – 1: M mandatory relationship

The Characteristic Cardinality – 0: M relationship

The Paradox Cardinality – 1: M relationship.

Thursday, January 22, 2009

CHAID in Data Mining

Chi Square Automatic Interaction Detection. A decision tree technique used for classification of a dataset.

Provides a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome.

Segments a dataset by using chi square tests to create multi-way splits. Preceded, and requires more data preparation than, CART.

CART in Data Warehousing

Classification and Regression Trees. A decision tree technique used for classification of a dataset.

Provides a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome.

Segments a dataset by creating 2-way splits. Requires less data preparation than CHAID.

Data Mining - Intro

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses.

Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions.

The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems.

Data mining tools can answer business questions that traditionally were too time consuming to resolve.

They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

Sunday, January 18, 2009

Advantages and Disadvantages of ROLAP

Advantages:

-Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.

-Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:

-Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

-Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

Advantages and Disadvantages of MOLAP

Advantages:

-Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.

-Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:

-Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.

-Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

MOLAP Vs ROLAP Vs HOLAP

MOLAP : This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

ROLAP : This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

HOLAP : HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Data Warehousing Terms A to Z

Aggregation: One way of speeding up query performance. Facts are summed up for selected dimensions from the original fact table. The resulting aggregate table will have fewer rows, thus making queries that can use them go faster.

Attribute: Attributes represent a single type of information in a dimension. For example, year is an attribute in the Time dimension.

Conformed Dimension: A dimension that has exactly the same meaning and content when being referred from different fact tables.

Data Mart: Data marts have the same definition as the data warehouse (see below), but data marts have a more limited audience and/or data content.

Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process (as defined by Bill Inmon).

Data Warehousing: The process of designing, building, and maintaining a data warehouse system.

Dimension: The same category of information. For example, year, month, day, and week are all part of the Time Dimension.

Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the "fact", or measures.

Dimensional Table: Dimension tables store records related to this particular dimension. No facts are stored in a dimensional table.

Drill Across: Data analysis across dimensions.

Drill Down: Data analysis to a child attribute.

Drill Through: Data analysis that goes from an OLAP cube into the relational database.

Drill Up: Data analysis to a parent attribute.

ETL: Stands for Extraction, Transformation, and Loading. The movement of data from one area to another.

Fact Table: A type of table in the dimensional model. A fact table typically includes two types of columns: fact columns and foreign keys to the dimensions.

Hierarchy: A hierarchy defines the navigating path for drilling up and drilling down. All attributes in a hierarchy belong to the same dimension.

Metadata: Data about data. For example, the number of tables in the database is a type of metadata.

Metric: A measured value. For example, total sales is a metric.

MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional cubes.

OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users a quick way of slicing and dicing the data.

ROLAP: Relational OLAP. ROLAP systems store data in the relational database.

Snowflake Schema: A common form of dimensional model. In a snowflake schema, different hierarchies in a dimension can be extended into their own dimensional tables. Therefore, a dimension can have more than a single dimension table.

Star Schema: A common form of dimensional model. In a star schema, each dimension is represented by a single dimension table.

Business Intelligence

Business intelligence is a term commonly associated with data warehousing.

In fact, many of the tool vendors position their products as business intelligence software rather than data warehousing software.

There are other occasions where the two terms are used interchangeably.

Business intelligence usually refers to the information that is available for the enterprise to make decisions on.

A data warehousing (or data mart) system is the backend, or the infrastructural, component for achieving business intelligence.

Business intelligence also includes the insight gained from doing data mining analysis, as well as unstrctured data (thus the need fo content management systems).

For our purposes here, we will discuss business intelligence in the context of using a data warehouse infrastructure.

Data Mart in Data Warehousing

A data mart (DM) is a specialized version of a data warehouse (DW).

Like data warehouses, data marts contain a snapshot of operational data that helps business people to strategize based on analyses of past trends and experiences.

The key difference is that the creation of a data mart is predicated on a specific, predefined need for a certain grouping and configuration of select data.

A data mart configuration emphasizes easy access to relevant information (Reference : Wiki).

Data Marts are designed to help manager make strategic decisions about their business.

Define : Surrogate Key

Surrogate key is a substitution for the natural primary key in Data Warehousing.

It is just a unique identifier or number for each row that can be used for the primary key to the table.

The only requirement for a surrogate primary key is that it is unique for each row in the table.

It is useful because the natural primary key can change and this makes updates more difficult.

Surrogated keys are always integer or numeric.

Bus Schema Vs Str Schema Vs Snow Flake Schema

Bus, Star, Snow flake Schemas in Data Warehousing

BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.

Star schema is a type of organizing the tables such that we can retrieve the result from the database quickly in the warehouse environment.

Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.

Define : Hybrid SCD

Hybrid Slowly Changing Dimension in Data Warehousing:

Hybrid SCDs are combination of both SCD 1 and SCD 2.

It may happen that in a table, some columns are important and we need to track changes for them i.e capture the historical data for them whereas in some columns even if the data changes, we don't care.

Define : SCD (Slowly Changing Dimensions)

SCD is abbreviation of Slowly changing dimensions in Data Warehousing.

SCD applies to cases where the attribute for a record varies over time. There are three different types of SCD.

a) SCD1 : The new record replaces the original record.
Only one record exist in database - current data.

b) SCD2 : A new record is added into the customer dimension table.
Two records exist in database - current data and previous history data.

c) SCD3 : The original data is modified to include new data.
One record exist in database - new information are attached with old information in same row.

Non-Additive Facts in Data Warehousing

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

However they are not considered as useless. If there is changes in dimensions the same facts can be useful.

Level of Granularity of a Fact Table

It means level of detail that you put into the fact table in a data warehouse.

Level of granularity would mean what detail are you willing to put for each transactional fact.

Define : Confirmed Fact

Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly.

Conformed Dimensions in Data Warehousing

Conformed dimensions mean the exact same thing with every possible fact table to which they are joined.

They are common to the cubes.

Define : Real Time Data warehousing

Data warehousing captures business activity data.

Real-time data warehousing captures business activity data as it occurs.

As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly.

Full Load Vs Incremental (Refresh) Load in DW

Full Load: completely erasing the contents of one or more tables and reloading with fresh data.

Incremental Load: applying ongoing changes to one or more tables based on a predefined schedule

Refresh Load: the table will be truncated and data will be loaded again. Here we use to load static dimension table or type tables using this method.

Incremental Load: It is a method to capture on the newly created or updated record. Based upon the falg or Date this load will be performed.

Full Load: when we are loading the data for first time, either it may be a base load or history all the set of records will be loaded at a strech depends upon the volume.