Thursday, January 22, 2009

CHAID in Data Mining

Chi Square Automatic Interaction Detection. A decision tree technique used for classification of a dataset.

Provides a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome.

Segments a dataset by using chi square tests to create multi-way splits. Preceded, and requires more data preparation than, CART.

CART in Data Warehousing

Classification and Regression Trees. A decision tree technique used for classification of a dataset.

Provides a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome.

Segments a dataset by creating 2-way splits. Requires less data preparation than CHAID.

Data Mining - Intro

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses.

Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions.

The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems.

Data mining tools can answer business questions that traditionally were too time consuming to resolve.

They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

Sunday, January 18, 2009

Advantages and Disadvantages of ROLAP

Advantages:

-Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.

-Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:

-Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

-Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

Advantages and Disadvantages of MOLAP

Advantages:

-Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.

-Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:

-Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.

-Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

MOLAP Vs ROLAP Vs HOLAP

MOLAP : This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

ROLAP : This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

HOLAP : HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Data Warehousing Terms A to Z

Aggregation: One way of speeding up query performance. Facts are summed up for selected dimensions from the original fact table. The resulting aggregate table will have fewer rows, thus making queries that can use them go faster.

Attribute: Attributes represent a single type of information in a dimension. For example, year is an attribute in the Time dimension.

Conformed Dimension: A dimension that has exactly the same meaning and content when being referred from different fact tables.

Data Mart: Data marts have the same definition as the data warehouse (see below), but data marts have a more limited audience and/or data content.

Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process (as defined by Bill Inmon).

Data Warehousing: The process of designing, building, and maintaining a data warehouse system.

Dimension: The same category of information. For example, year, month, day, and week are all part of the Time Dimension.

Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the "fact", or measures.

Dimensional Table: Dimension tables store records related to this particular dimension. No facts are stored in a dimensional table.

Drill Across: Data analysis across dimensions.

Drill Down: Data analysis to a child attribute.

Drill Through: Data analysis that goes from an OLAP cube into the relational database.

Drill Up: Data analysis to a parent attribute.

ETL: Stands for Extraction, Transformation, and Loading. The movement of data from one area to another.

Fact Table: A type of table in the dimensional model. A fact table typically includes two types of columns: fact columns and foreign keys to the dimensions.

Hierarchy: A hierarchy defines the navigating path for drilling up and drilling down. All attributes in a hierarchy belong to the same dimension.

Metadata: Data about data. For example, the number of tables in the database is a type of metadata.

Metric: A measured value. For example, total sales is a metric.

MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional cubes.

OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users a quick way of slicing and dicing the data.

ROLAP: Relational OLAP. ROLAP systems store data in the relational database.

Snowflake Schema: A common form of dimensional model. In a snowflake schema, different hierarchies in a dimension can be extended into their own dimensional tables. Therefore, a dimension can have more than a single dimension table.

Star Schema: A common form of dimensional model. In a star schema, each dimension is represented by a single dimension table.

Business Intelligence

Business intelligence is a term commonly associated with data warehousing.

In fact, many of the tool vendors position their products as business intelligence software rather than data warehousing software.

There are other occasions where the two terms are used interchangeably.

Business intelligence usually refers to the information that is available for the enterprise to make decisions on.

A data warehousing (or data mart) system is the backend, or the infrastructural, component for achieving business intelligence.

Business intelligence also includes the insight gained from doing data mining analysis, as well as unstrctured data (thus the need fo content management systems).

For our purposes here, we will discuss business intelligence in the context of using a data warehouse infrastructure.

Data Mart in Data Warehousing

A data mart (DM) is a specialized version of a data warehouse (DW).

Like data warehouses, data marts contain a snapshot of operational data that helps business people to strategize based on analyses of past trends and experiences.

The key difference is that the creation of a data mart is predicated on a specific, predefined need for a certain grouping and configuration of select data.

A data mart configuration emphasizes easy access to relevant information (Reference : Wiki).

Data Marts are designed to help manager make strategic decisions about their business.

Define : Surrogate Key

Surrogate key is a substitution for the natural primary key in Data Warehousing.

It is just a unique identifier or number for each row that can be used for the primary key to the table.

The only requirement for a surrogate primary key is that it is unique for each row in the table.

It is useful because the natural primary key can change and this makes updates more difficult.

Surrogated keys are always integer or numeric.

Bus Schema Vs Str Schema Vs Snow Flake Schema

Bus, Star, Snow flake Schemas in Data Warehousing

BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.

Star schema is a type of organizing the tables such that we can retrieve the result from the database quickly in the warehouse environment.

Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.

Define : Hybrid SCD

Hybrid Slowly Changing Dimension in Data Warehousing:

Hybrid SCDs are combination of both SCD 1 and SCD 2.

It may happen that in a table, some columns are important and we need to track changes for them i.e capture the historical data for them whereas in some columns even if the data changes, we don't care.

Define : SCD (Slowly Changing Dimensions)

SCD is abbreviation of Slowly changing dimensions in Data Warehousing.

SCD applies to cases where the attribute for a record varies over time. There are three different types of SCD.

a) SCD1 : The new record replaces the original record.
Only one record exist in database - current data.

b) SCD2 : A new record is added into the customer dimension table.
Two records exist in database - current data and previous history data.

c) SCD3 : The original data is modified to include new data.
One record exist in database - new information are attached with old information in same row.

Non-Additive Facts in Data Warehousing

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

However they are not considered as useless. If there is changes in dimensions the same facts can be useful.

Level of Granularity of a Fact Table

It means level of detail that you put into the fact table in a data warehouse.

Level of granularity would mean what detail are you willing to put for each transactional fact.

Define : Confirmed Fact

Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly.

Conformed Dimensions in Data Warehousing

Conformed dimensions mean the exact same thing with every possible fact table to which they are joined.

They are common to the cubes.

Define : Real Time Data warehousing

Data warehousing captures business activity data.

Real-time data warehousing captures business activity data as it occurs.

As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly.

Full Load Vs Incremental (Refresh) Load in DW

Full Load: completely erasing the contents of one or more tables and reloading with fresh data.

Incremental Load: applying ongoing changes to one or more tables based on a predefined schedule

Refresh Load: the table will be truncated and data will be loaded again. Here we use to load static dimension table or type tables using this method.

Incremental Load: It is a method to capture on the newly created or updated record. Based upon the falg or Date this load will be performed.

Full Load: when we are loading the data for first time, either it may be a base load or history all the set of records will be loaded at a strech depends upon the volume.

Define : ODS (Operation Data Source)

ODS - Operational Data Store.

ODS Comes between staging area & Data Warehouse.

The data is ODS will be at the low level of granularity.

Once data was poopulated in ODS aggregated data will be loaded into into EDW through ODS.

It is also a simalar small DWH which will help analyst to analysis the bussiness.

It will have data for less number of days. generally it will be around 30-45 days.

Like DWH here also we will primary keys will be genrated, error and reject handling will be done.

Define : Aggregate Tables in Data Warehousing

Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions.

It is always easy to retrieve data from aggregated tables than visiting original table which has million records.

Aggregate tables reduces the load in the database server and increases the performance of the query and can retrieve the result quickly.

Define : Lookup Tables

A lookup table is the table placed on the target table based upon the primary key of the target, it just updates the table by allowing only modified (new or updated) records based on thelookup condition

Define : VLDB

VLDB is abbreviation of Very Large DataBase.

A one terabyte database would normally be considered to be a VLDB.

Typically, these are decision support systems or transaction processing applications serving large numbers of users.

Define : ETL

ETL is abbreviation of extract, transform, and load.

ETL is software that enables businesses to consolidate their disparate data while moving it from place to place, and it doesn't really matter that that data is in different forms or formats.

The data can come from any source.

ETL is powerful enough to handle such data disparities.

First, the extract function reads data from a specified source database and extracts a desired subset of data.

Next, the transform function works with the acquired data - using rules orlookup tables, or creating combinations with other data - to convert it to the desired state.

Finally, the load function is used to write the resulting data to a target database.

Define : ODS

ODS is abbreviation of Operational Data Store.

A database structure that is a repository for near real-time operational data rather than long term trend data.

The ODS may further become the enterprise shared operational database, allowing operational systems that are being reengineered to use the ODS as there operation databases.

Define : ER Diagram in Data Warehousing

Entity Relationship Diagrams are a major data modelling tool and will help organize the data in your project into entities and define the relationships between the entities.

This process has proved to enable the analyst to produce a good database structure so that the data can be stored and retrieved in a most efficient manner.

An entity-relationship (ER) diagram is a specialized graphic that illustrates the interrelationships between entities in a database.

A type of diagram used in data modeling for relational data bases. These diagrams show the structure of each table and the links between tables.

view Vs Materialized View

A view takes the output of a query and makes it appear like a virtual table and it can be used in place of tables.

A materialized view provides indirect access to table data by storing the results of a query in a separate schema object.

Define : Data Mining

Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information.

Foreign Key Columns in Fact and Dimension Table

Foreign keys of dimension tables are primary keys of entity tables.

Foreign keys of facts tables are primary keys of Dimension tables.

OLTP vs OLAP : Difference

Data Source OLTP: Operational data is from original data source of the data

OLAP: Consolidation data is from various source.

Process Goal OLTP: Snapshot of business processes which does fundamental business tasks

OLAP: Multi-dimensional views of business activities of planning and decision making Queries and Process Scripts OLTP: Simple quick running queries ran by users.

OLAP: Complex long running queries by system to update the aggregated data. Database

Design OLTP: Normalized small database. Speed will be not an issue due to smaller database and normalization will not degrade performance. This adopts entity relationship(ER) model and an application-oriented database design.

OLAP: De-normalized large database. Speed is issue due to larger database and de-normalizing will improve performance as there will be lesser tables to scan while performing tasks.

This adopts star, snowflake or fact constellation mode of subject-oriented database design. Back up and System Administration OLTP: Regular Database backup and system administration can do the job.

OLAP: Reloading the OLTP data is good considered as good backup option.

Define : OLAP

OLAP is abbreviation of Online Analytical Processing.

This system is an application that collects, manages, processes and presents multidimensional data for analysis and management purposes.

Define : OLTP

OLTP is abbreviation of On-Line Transaction Processing.

This system is an application that modifies data the instance it receives and has a large number of concurrent users.

Loading Dimension Tables in Data warehousing

Conventional (Slow) : All the constraints and keys are validated against the data before, it is loaded, this way data integrity is maintained.


Direct (Fast) : All the constraints and keys are disabled before the data is loaded. Once data is loaded, it is validated against all the constraints and keys.

If data is found invalid or dirty it is not included in index and all future processes are skipped on this data.

Define : Dimension Table

Dimensional table contains textual attributes of measurements stored in the facts tables.

Dimensional table is a collection of hierarchies, categories and logic which can be used for user to traverse in hierarchy nodes.

Fact Table in Data Warehousing

Fact table contains measurements of business process.

Fact table contains the foreign keys for the dimension tables. Example, if you are business process is "paper production", "average production of paper by one machine" or "weekly production of paper" will be considered as measurement of business process.

Dimensional Modeling in Data Warehousing

Dimensional data model concept involves two types of tables and it is different from the 3rd normal form.

This concepts uses Facts table which contains the measurements of the business and Dimension table which contains the context(dimension of calculation) of the measurements.

Fundamental Stages of Data Wareshousing

Offline Operational Databases -Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.

Offline Data Warehouse -Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)

Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.;

Datawarehousing : Intro

What is Data Warehousing?

A data warehouse is the main repository of an organization's historical data, its corporate memory.

It contains the raw material for management's decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems (Ref:Wikipedia).

Data warehousing collection of data designed to support management decision making.

Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time.

It is a repository of integrated information, available for queries and analysis.