Thursday, September 29, 2011

Interview Questions for Data Warehousing

What is Data warehousing?

A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form.

What are fact tables and dimension tables?

As mentioned, data in a warehouse comes from the transactions. Fact table in a data warehouse consists of facts and/or measures. The nature of data in a fact table is usually numerical. On the other hand, dimension table in a data warehouse contains fields used to describe the data in fact tables. A dimension table can provide additional and descriptive information (dimension) of the field of a fact table.

e.g. If I want to know the number of resources used for a task, my fact table will store the actual measure (of resources) while my Dimension table will store the task and resource details. Hence, the relation between a fact and dimension table is one to many.



What is ETL process in data warehousing?

ETL stands for Extraction, transformation and loading. That means extracting data from different sources such as flat files, databases or XML data, transforming this data depending on the application’s need and loads this data into data warehouse.



Explain the difference between data mining and data warehousing?

Data mining is a method for comparing large amounts of data for the purpose of finding patterns. Data mining is normally used for models and forecasting. Data mining is the process of correlations, patterns by shifting through large data repositories using pattern recognition techniques.

Data warehousing is the central repository for the data of several business systems in an enterprise. Data from various resources extracted and organized in the data warehouse selectively for analysis and accessibility.



What is an OLTP system and OLAP system?

OLTP stands for OnLine Transaction Processing. Applications that supports and manges transactions which involve high volumes of data are supported by OLTP system. OLTP is based on client-server architecture and supports transactions across networks.

OLAP stands for OnLine Analytical Processing. Business data analysis and complex calculations on low volumes of data are performed by OLAP. An insight of data coming from various resources can be gained by a user with the support of OLAP.



What are cubes?

Multi dimensional data is logically represented by Cubes in data warehousing. The dimension and the data are represented by the edge and the body of the cube respectively. OLAP environments view the data in the form of hierarchical cube. A cube typically includes the aggregations that are needed for business intelligence queries.



What is snow flake scheme design in database?

Snow flake schema is one of the designs that are present in database design. Snow flake schema serves the purpose of dimensional modeling in data warehousing. If the dimensional table is split into many tables, where the schema is inclined slightly towards normalization, then the snow flake design is utilized. It contains joins in depth. The reason is that, the tables split further.



Explain the difference between star and snowflake schemas?

Star schema: A highly de-normalized technique. A star schema has one fact table and is associated with numerous dimensions table and depicts a star.



Snow flake schema: The normalized principles applied star schema is known as Snow flake schema. Every dimension table is associated with sub dimension table.



Differences:



* A dimension table will not have parent table in star schema, whereas snow flake schemas have one or more parent tables.

* The dimensional table itself consists of hierarchies of dimensions in star schema, where as hierarchies are split into different tables in snow flake schema. The drilling down data from top most hierarchies to the lowermost hierarchies can be done.



What is the difference between view and materialized view?

A view is created by combining data from different tables. Hence, a view does not have data of itself.



On the other hand, Materialized view usually used in data warehousing has data. This data helps in decision making, performing calculations etc. The data stored by calculating it before hand using queries. When a view is created, the data is not stored in the database. The data is created when a query is fired on the view. Whereas, data of a materialized view is stored.



What is junk dimension?

A single dimension is formed by lumping a number of small dimensions. This dimension is called a junk dimension. Junk dimension has unrelated attributes. The process of grouping random flags and text attributes in dimension by transmitting them to a distinguished sub dimension is related to junk dimension.



What is degenerate dimension table?

A degenerate table does not have its own dimension table. It is derived from a fact table. The column (dimension) which is a part of fact table but does not map to any dimension.

E.g. employee_id



What is conformed fact and conformed dimensions use for?

Conformed fact in a warehouse allows itself to have same name in separate tables. They can be compared and combined mathematically. Conformed dimensions can be used across multiple data marts. These conformed dimensions have a static structure. Any dimension table that is used by multiple fact tables can be conformed dimensions.



What is Virtual Data Warehousing?

A virtual data warehouse provides a compact view of the data inventory. It contains Meta data. It uses middleware to build connections to different data sources. They can be fast as they allow users to filter the most important pieces of data from different legacy applications.



What is active data warehousing?

An Active data warehouse aims to capture data continuously and deliver real time data. They provide a single integrated view of a customer across multiple business lines. It is associated with Business Intelligence Systems



What is the difference between dependent and independent data warehouse?

A dependent data warehouse stored the data in a central data warehouse. On the other hand independent data warehouse does not make use of a central data warehouse.



Difference between data modeling and data mining?

Data modeling aims to identify all entities that have data. It then defines a relationship between these entities. Data models can be conceptual, logical or Physical data models. Conceptual models are typically used to explore high level business concepts in case of stakeholders. Logical models are used to explore domain concepts. While Physical models are used to explore database design.

Data mining is used to examine or explore the data using queries. These queries can be fired on the data warehouse. Data mining helps in reporting, planning strategies, finding meaningful patterns etc. it can be used to convert a large amount of data into a sensible form.



What is the difference between ER Modeling and Dimensional Modeling?

ER modeling, that models an ER diagram represents the entire businesses or applications processes. This diagram can be segregated into multiple Dimensional models. This is to say, an ER model will have both logical and physical model. The Dimensional model will only have physical model.



What is Data Mart?

Data mart stores particular data that is gathered from different sources. Particular data may belong to some specific community (group of people) or genre. Data marts can be used to focus on specific business needs.



What are various methods of loading Dimension tables?

Conventional load: Here the data is checked for any table constraints before loading.

Direct or Faster load: The data is directly loaded without checking for any constraints.



What is the difference between OLAP and data warehouse?

A data warehouse serves as a repository to store historical data that can be used for analysis. OLAP is Online Analytical processing that can be used to analyze and evaluate data in a warehouse. The warehouse has data coming from varied sources. OLAP tool helps to organize data in the warehouse using multidimensional models.



Describe the foreign key columns in fact table and dimension table?

The primary keys of entity tables are the foreign keys of dimension tables.

The Primary keys of fact dimensional table are the foreign keys of fact tables.



Define the term slowly changing dimensions (SCD)?

SCD are dimensions whose data changes very slowly. An example of this can be city of an employee. This dimension will change very slowly. The row of this data in the dimension can be either replaced completely without any track of old record OR a new row can be inserted, OR the change can be tracked



What is a Star Schema?

A star schema comprises of fact and dimension tables. Fact table contains the fact or the actual data. Usually numerical data is stored with multiple columns and many rows. Dimension tables contain attributes or smaller granular data. The fact table in start schema will have foreign key references of dimension tables.



What is the difference between star and snowflake schema?

Star Schema: A de-normalized technique in which one fact table is associated with several dimension tables. It resembles a star.

Snow Flake Schema: A star schema that is applied with normalized principles is known as Snow flake schema. Every dimension table is associated with sub dimension table.



Explain the use lookup tables and Aggregate tables?

An aggregate table contains summarized view of data. Lookup tables, using the primary key of the target, allow updating of records based on the lookup condition.



What is real time data-warehousing?

In real time data-warehousing, the warehouse is updated every time the system performs a transaction. It reflects the businesses real time information. This means that when the query is fired in the warehouse, the state of the business at that time will be returned.



Define non-additive facts?

The facts that can not be summed up for the dimensions present in the fact table are called non-additive facts. The facts can be useful if there are changes in dimensions. For example, profit margin is a non-additive fact for it has no meaning to add them up for the account level or the day level.



Define BUS Schema?

A BUS schema is to identify the common dimensions across business processes, like identifying conforming dimensions. BUS schema has conformed dimension and standardized definition of facts.



What is data cleaning? How can we do that?

Data cleaning is the process of identifying erroneous data. The data is checked for accuracy, consistency, typos etc.



Data cleaning Methods:

Parsing – Used to detect syntax errors.

Data Transformation – Confirms that the input data matches in format with expected data.

Duplicate elimination – This process gets rid of duplicate entries.

Statistical Methods- values of mean, standard deviation, range, or clustering algorithms etc are used to find erroneous data.



What is the purpose of Fact less Fact Table?

Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they don’t really have facts or any information but are more commonly used for tracking some information of an event.

Eg. To find the number of leaves taken by an employee in a month.



What is a level of Granularity of a fact table?

A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table.

E.g. Employee performance is a very high level of granularity. Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity.



What is Bit Mapped Index?

Bitmap indexes make use of bit arrays (bitmaps) to answer queries by performing bitwise logical operations.

They work well with data that has a lower cardinality which means the data that take fewer distinct values.

Bitmap indexes are useful in the data warehousing applications.

Bitmap indexes have a significant space and performance advantage over other structures for such data.

Tables that have less number of insert or update operations can be good candidates.



The advantages of Bitmap indexes are:

They have a highly compressed structure, making them fast to read.

Their structure makes it possible for the system to combine multiple indexes together so that they can access the underlying table faster.



The Disadvantage of Bitmap indexes is:

The overhead on maintaining them is enormous.



What is Data Cardinality?

Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation.



There are 3 basic types of cardinality:

High data cardinality:Values of a data column are very uncommon.

e.g.: email ids and the user names

Normal data cardinality:Values of a data column are somewhat uncommon but never unique.

e.g.: A data column containing LAST_NAME (there may be several entries of the same last name)

Low data cardinality:Values of a data column are very usual.

e.g.: flag statuses: 0/1



Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the relationships

Types of cardinalities:

The Link Cardinality – 0:0 relationships

The Sub-type Cardinality – 1:0 relationships

The Physical Segment Cardinality – 1:1 relationship

The Possession Cardinality – 0: M relation

The Child Cardinality – 1: M mandatory relationship

The Characteristic Cardinality – 0: M relationship

The Paradox Cardinality – 1: M relationship.