ETL Process (Extract Transform Load)
Now a days by seeing current market situation, data processing becomes very crucial part of an effective business. Business partners are investing more in data processing since the amount (volume) and variety of data increases very rapidly.
To do effective business, data planning, processing and maintaining is playing very important role. Since last decade, ETL process became fruitful to flow business processes smoothly.
Data mining and Data Warehousing are the key concepts to analyze big data. Also to make best business decisions and strategies which can increase revenue and profit of organization.
Data Warehouse is the collection of historical data from multiple sources (homogeneous/heterogeneous data) which is used to analyze and querying data to retrieve some meaningful and knowledgeable information.
Data Warehousing is the process to store data from multiple sources to Data Warehouse which can be a text file, images, spreadsheet, operation data, relational/non-relational data etc., but heterogeneous data may contain ambiguity and redundancy called “Dirty Data“.
To deal with dirty data some cleaning processing must be applied to that heterogeneous data which is known as “Data Preprocessing“.
After data preprocessing, consistent data will be pulled from multiple sources and loaded to Data Warehouse, this entire process is called “ETL Process“.
ETL process mainly used for data cleaning, data processing, data loading. Here the loaded data is a meaningful data which will be used by end user for different purposes. Below are the main three stages shown of a ETL Process:
Main characteristic of ETL process is that extraction, transformation and loading stages can run in parallel.
Data is captured/extracted according to business requirement from source data files periodically. What to extract is depends on the available sources and business requirement.
To filter the data, validations rules are applied in Extraction process. Extraction process may have to modify validation rule to accept data for moving to the next phase based on business requirement. It is not one time process, changes are applied from source at regular time period.
There are various methods available to apply changes on extracted data:
- Notifying System
- Incremental Update
- Full Extraction
- Online Extraction
It is not necessary to store data directly to warehouse sometimes according to requirement it is stored into “Warehouse Staging Area” which is temporary storage part of Data Warehouse.
Extracted data are transformed from source file to Staging and Staging to Data Warehouse and Warehouse to Data Mart.
Extracted Data needs to be transformed into structured format which is compatible with pre-defined Data Warehouse format.
In this phase, set of rules are applied to convert data into meaningful and structured data which will be used to gain business processes.
It includes following steps:
- Cleaning: Mapping particular value by code (i.e. null value to 0, male to ‘m’, female to ‘f’).
- Deriving: Generate new values using available values (e.g. Total from price and quantity).
- Sorting: Sort data according to some specific value for fast retrieval.
- Joining: Join data of multiple sources and generate default values (lookup & merge).
- Generate surrogate key value for uniformity in semantics and performance.
- Validation: During refreshment of DW, some values may be changed which would cause inconsistency in DW. By applying validation rules, we can preserve old value in other field and replace original value with new value to handle “Slowly Change Dimension” problem. There also more solutions available for this problem now a days.
After completion of each step, data will be stored into staging area for further processing.
In case of any failure, transformation process will be resumed from staging.
Transformed (structured) data will be loaded to its appropriate table in Data Warehouse.
Data consistency must be maintained because records may be updated at loading time. Also referential integrity should be maintained to preserve data consistency.
Loading stage could also be used to load data to Data Marts (Subject Oriented Data).
Loading can be performed in two ways:
- Record by record
- Bulk Load
Bulk load is the most preferable strategy to load data in warehouse in order to improve performance since it provides less overhead than Record by record strategy.
To improve query performance data table should be partitioned while loading data. Table partitioning may depend on time interval strategy (interval like year, month, quarter etc.).
Available ETL tools in the market
Open Source Tools:
- Jaspersoft ETL
- Pentaho’s Data Integration (KETTLE)
- Talend – Talend Open Studio
- IKAN – ETL4ALL
- Informatica – Power Center
- IBM – Websphere DataStage(Formerly known as Ascential DataStage)
- SAP – BusinessObjects Data Integrator
- IBM – Cognos Data Manager (Formerly known as Cognos DecisionStream)
- Microsoft – SQL Server Integration Services
- Oracle – Data Integrator (Formerly known as Sunopsis Data Conductor)
- SAS – Data Integration Studio
- Oracle – Warehouse Builder
- AB Initio
- Information Builders – Data Migrator
- Embarcadero Technologies – DT/Studio
- IBM – DB2 Warehouse Edition
- Pervasive – Data Integrator
- ETL Solutions Ltd. – Transformation Manager
- Group 1 Software (Sagent) – DataFlow
- Sybase – Data Integrated Suite ETL
- Expressor Software – Expressor Semantic Data Integration System
- Elixir – Elixir Repertoire
- Business Objects Data Integrator