Data Extraction and Data Transformation

What is data Extraction?

Data extraction can be defined as the process of collecting or retrieving different types of data from different sources, many of which are semi-structured or completely unstructured.

Types of data extraction techniques:

There are two types of data extraction techniques.

  • Immediate data extraction
Data extraction and Data transformation

In this type of data extraction technique, the data extraction is real-time. It involves the collection of data through various transaction logs or database triggers.

Option 1: Capture data through transaction logs

Data extraction and Data transformation

Fig 1.1 shows the data present in the weblog

From the above figure, we can observe that in the weblog, various types of data can be collected such as IP address, Time, URL Visited, and the status about what activities have been performed by the user on the website (We can understand this with the help of ajax codes).

Option 2: Capture data through database triggers

Capturing data through database triggers occurs at the source, and hence this method can be trusted. For example, capturing before and after images. The drawback of this method is that the construction and maintenance of trigger programs puts a burden on the development team. This method is applicable only for the source data which is present in the database.

Option 3: capture data through source applications

This technique is also called application-assisted data capture. There is a need to modify relevant data programs that write to the source files and database. In this method, there is a need to revise the programs to write all adds, updates, deletes to the source files and database tables.

  • Deferred data extraction
Data extraction and Data transformation

In this technique, extraction of data does not take place immediately, it takes place in later stages. In this technique extraction of data is done by comparing various files or timestamps and extracting some meaningful information out of it.

Option 1: capture based on date and time stamp

Whenever a source record is newly formed or updated, it can be highlighted with a stamp. It describes the date and the time when it was created/updated. The timestamp helps us to retrieve records for data extraction. In this method, the capturing of data occurs at a later stage and does not occur while each source record is being created or updated. This technique works best if the number of revised records is less in number.

Data extraction and Data transformation

Fig 2.1 Capturing data based on date and time stamp

The above diagram displays the timestamp of data that is recorded. To find any newly inserted or updated data, the same can be done by viewing the timestamp of the data.

Option 2: Capture data by comparing files

This technique is also known as the "snapshot differential technique" because it compares two snapshots. This technique is feasible when the timestamps of the source data are not recorded and transaction logs are not available. This method is less complex and to the point which deals with the comparison of full rows in a large file. For example, if you want to perform this technique to make a note of changes to your product data, the following steps need to be followed:

Step 1: When you perform data extraction on today's data to note any changes in your product data, there is a need for you to do a complete comparison of files between today's file and yesterday's file of the product data file.

Step 2: compare the record keys to find inserts and deletes

Step 3: Capture any changes between the 2 files copies

What is data transformation?

Before transmitting the data from source systems into a data warehouse, you need to perform various kinds of data transformations according to the standards of the data. According to the standards, data transformation is derived from a wide range of different source systems. The database administrator must ensure that once every data is placed together, the combined data does not violate any business rules.

There are various challenges associated with the transformation of data, which involves data being manipulated to convert all the extracted data into meaningful information. While performing data transformations on data, we need to consider all different issues that may arise, and we need to allocate specific time and effort to design the transformations.