Data Transformation in Data Warehouse

What is data transformation?

When you extract the data from the source systems into the data warehouse, there is a need to perform various data transformations according to the standards. Data transformation means that you perform manipulations on the data to change the extracted data and convert them into usable information. Therefore, different types of issues need to be considered, and sufficient time and effort need to be allocated to design the transformations.

Data Transformation in Data Warehouse

Basic tasks involved in data transformation

  • In the selection, ? You can extract whole records or small parts from the whole records present in the source system.
  • Splitting/joining? You can split the records further, which you have selected from the source systems. Joining involves joining the parts which you have selected from many source systems from the data warehouse environment.
  • Summarization ? It is not feasible to place the data in the lowest level of their details. Hence there is a need to summarize the data.
  • Enrichment? It rearranges and simplifies the individual fields to make them more meaningful for the data warehouse environment.

Primary functions in data transformation

The primary functions in data transformation are as follows:

  • Format revisions ? You modifications to the data types and lengths of individual fields are included in format revisions.
  • Decoding of fields ? when you extract the data from multiple source systems makes it evident that you will encounter the same data items. E.g., coding for gender, Source System "A" might use 1 & 2 for male and female respectively, and source system "B" might use M & F. Hence there is a need to decode all these cryptic codes and convert these cryptic codes into values which would be helpful to the users in terms of understanding.
  • Splitting of single fields ? The legacy systems were used to store names, address of customers and all employees in a single large text field. Similarly, state, city, zip code were also stored in a single large text field. But there is a need to store individual components of names and address in different/separate fields because of the following reasons:
  • It improves the performance of the operating systems by indexing individual components.
  • Users might need to perform analysis that involves using individual components such as city, state, and zip code.

            (5) Merging of information ? It does not involve merging many fields to create a single data field. For Example, information associated with a product will come from different data sources. The product code and its description might come from one data source. The cost of the product might come from the different data sources. in this scenario, merging of information indicates the combination of product code, description, cost together into a single entity.

(6) Character set conversion ? The character set conversion relates to convert the character sets into an agreed standard character set for textual data represented in the data warehouse. For Example, if mainframe legacy systems were used as a source system, source data would be present in the form of "EBCDIC" characters. However, if your architecture choice is PC-based for your data warehouse, then there is a need to convert the "EBCDIC" characters to "ASCII" format. And if your source data is present on other hardware and operating systems, you need to perform similar set character conversions.

(7) conversion of units of measurements ? if your company contains overseas operations, then there is a conversion of the metrics. All the numbers are in one standard unit of measurement.

(8) data/time conversion ? the date/time conversion refers to data representation and time in a standard format. For Example, American and British date formats can be standardized to an international date format. E.g., October 11, 2008, can be written as 10/11/2008 in US format and 11/10/2008 in the British format. Both dates can be standardized as 11 OCT 2008.

(9) De-duplication ? A customer files consist of several records for the same customer The duplicate records exists because additional records have been created by mistake. If you want to keep a single record for one customer in your data warehouse and link all the duplicates in the source record to a single record, that process is called de-duplication.

(10) Summarization ? It involves creating summaries to be inserted into the data warehouse instead of inserting the most granular level of data.

(11) Key restructuring ? when you extract data from input sources, you need to note the primary keys of the extracted records. Therefore, you need to come up with the keys for the fact and dimension-based tables on the keys of the extracted records.