ETL tools as the components of the Data Warehouse efficiency Essay

ETL tools as the components of the Data Warehouse efficiency, 493 words essay example

Essay Topic:efficiency

The Data Warehouse (DW) efficiency and effectiveness are mainly dependent on data extraction, transformation, and loading (ETL) components. Design and implementation of ETL are considered as a supporting task for DW.

ETL refers to the software tools that are devoted to perform the extraction, transformation and loading of data into the data warehouse in an automatic way.

Data Scrubbing (DS), which is one of ETL tools, is used to ensure Data Quality (DQ).

DQ is a significant topic in data warehousing, data mining and information systems. Low data quality will impact the quality of the results and analyses. Also, it will impact the decisions made on the basis of these results and analysis.

A main challenge in the DS process is the existing ambiguousness about the scrubbing decisions that should be taken by the scrubbing algorithms (e.g., deciding whether two records are duplicates or not). Existing data scrubbing systems deal with the ambiguities in data scrubbing decisions by preferring only one option, based on some heuristics, while ignoring all other options, which results in a false sense of removing ambiguities. Generally, recommencement of the DS from scratch is unavoidable whenever we need to incorporate new proofs.

This thesis presents a table of comparison and analysis for DS algorithms (purification data). The measures that are used in the comparison include accuracy and time complexity. Also, the advantages and disadvantages of each algorithm are presented in the aforementioned table of comparison. Additionally, we present a comparison and analysis of the DS frameworks and determine the best framework.

This thesis focuses on two DS problems missing values and duplicate records. Microsoft SQL server and Microsoft SQL analysis service are used in this thesis. Microsoft SQL server provides two key technologies Query Analyzer and Data Transformation Services (DTS).

We have two approaches to solving these problems

  1. Manually, this approach is timeconsuming and may not be feasible given a large data set with a big low data quality.
  2. Automated, this approach is suitable to deal with two problems of data scrubbing which are the missing values and duplicate records problems.

In this thesis we proposed two algorithms

  1. Detecting & treatment missing values algorithm. It comes from mixed two approaches Data Wrangler and An Enhanced technique for Data Cleaning. Each approach is applied one after the other to remove missing values.
  2. Detecting & treatment duplicates records algorithm. In practical side, any organization does not allow to do data scrubbing process on its database directly. Data scrubbing process appears as part of ETL, so we develop the procedure based on the one above. We proposed some steps, to avoid making any changes to the database directly, as follows

After detecting duplicated records, we add temporary table named imagination table.

Move data from the original table to imagination table.

Remove the duplicate records from imagination table instead of the original table.

These are very important steps due to some problems appear if there is more than one relationship between tables.

Forget about stressful night
With our academic essay writing service