ETL systems are used by businesses to integrate data from multiple sources. These software systems are key components in ensuring that your company is processing its data efficiently, allowing your business to run smooth and without interruption. ETL systems are part of a company’s ‘data warehouse’ and to understand ETL you’ll need to understand the warehouse. ETL stands for ‘extraction, transforming, and loading.’ This is the process by which data is taken from various sources, such as database servers, and transformed to fit with your business’s operational needs. This step involves assessing the quality levels of the data. Once that information has been assessed, it is then loaded into your company’s database, referred to as a warehouse. The process of ETL allows a business to collect important data, change it to fit their goals and models, and then store it for use. In a world of digital enterprise, it is a critical part of running an effective and efficient business.
data is so important to a successful business, poor performance or inaccurate procedure can cost time and money. Therefore, ETL testing is designed to ensure that the person in charge of the ETL process has the best understanding of it possible. By asking questions of new applicants or current employees ETL testing will ensure the most efficient and up to date abilities of the person and the process.
ETL testing will also help to ensure the best work possible is being done and will confirm the correctness of data being transformed. This is measured against the rules and requirements of the company itself. It also verifies that the transformed data can loaded without any loss of information, is used to validate the accuracy of reports, ensures that the process meets performance requirements, and finally, is used to evaluate the entire reporting structure.
Depending on the type of business, the nature of the project have applied or been selected to work on, or any other requirements, the interviewer might dive very deep into your ETL knowledge in order to assess your skills. However, there are some basic testing concepts are mostly standardized across the board and that you should have some level of mastery over before attempting to interview for such a position.
What is a three-tier data warehouse?
Most data warehouses are considered to be a three-tier system. This is essential to their structure. The first layer is where the data lands. This is the collection point where data from outside sources is compiled. The second layer is known as the ‘integration layer.’ This is where the data that has been stored is transformed to meet company needs. The third layer is called the ‘dimension layer,’ and is where the transformed information is stored for internal use.
What is the difference between data mining and data warehousing?
Data warehousing comes before the mining process. This is the act of gathering data from various exterior sources and organizing it into one specific location: the warehouse. Data mining is when that data is analyzed and used as information for making decisions.
What is partitioning and what are some types of partitioning?
Partitioning is when an area of data storage is sub-divided to improve performance. Think of it as an organizational tool. If all your collected data is in one large space without organization the digital tools used for analyzing it will have a more difficult time finding the information in order to analyze it. Partitioning your warehouse will create an organizational structure that will make locating and analyzing easier and faster.
Two types of partitioning are round-robin partitioning and Hash Partitioning. Round-robin partitioning is when the data is evenly distributed among all partitions. This means that the number of rows in each partition is relatively the same. Hash partitioning is when the server applies a hash function in order to create partition keys to group data.
What are active and passive transformations?
In an active transformation, the number of rows that is created as output can be changed once a transformation has occurred. This does not happen during a passive transformation; the information passes through the same number given to it as input.
Explain the following: mapping, session, worklet, workflow, and mapplet.
- Mapping is the movement of data from the source to the destination.
- Session is the parameters set to instruct the data on during the above movement.
- Worklet represents a specific set of tasks given.
- A workflow is a set of instructions that tell the server how to execute tasks.
- A mapplet creates or arranges sets of transformation.
Why is it important to clean data?
The data in the warehouse is used for important analysis and reporting. Cleaning data means finding similarities in the information so that it can be categorized correctly. This will increase the efficiency of the analysis. For example, data that contains the word ‘store’ and ‘stores’ might fall into separate groupings because they are not exactly the same. Data cleansing is a way to ensure that categorically similar information is grouped together. This will make the information easier to find and read. This is an important step before loading. Cleaning the data after the fact is very difficult and time consuming.
What are ETL tester responsibilities?
This question is specific to those looking to take on the role of ETL tester. It will require the same knowledge as those who run the ETL system, but may also require knowledge about double-checking the system and how to go about that process.
An ETL tester is responsible for writing SQL queries for various scenarios. They run a number of tests including primary key, duplicate, default, and attribute tests of the process. In addition, they are in charge of running record count checks as well as reconciling records with source data. They also confirm the quality of the data and the loading process overall.
Some other general questions that interviewees may be asked include:
- What is data mart?
- Who are the participants of data warehouse testing?
- What are OLTP and OLAP?
- What is look up table?
- What is Master data management?
You also might be asked to give examples of real time data warehousing in order for the interviewer to better understand what kind of experience you’ve had and what your knowledge base is with ETL.
Be prepared to answer advanced questions as well. These will be more specific to the project or to the overall needs of the company, and will likely require extensive research on your part in order to answer appropriately. Here are two advanced questions as examples of what you might need to know.
If there are ten thousand records in a source system, how do we ensure that all ten thousand are loaded to the target without any dysfunctional values?
The issue here is that with such a large amount of transformed and loaded records, you cannot go through their entirety one by one to confirm their validity. You start by checking the number of records on the source and the target. Create a data type for each. Select the sums and compare the information. This is referred to as a ‘check sum’ method.
I have ten records in my source system but I need to load only 2 records to the target for each run. How do I do this?
Creating a mapping variable and a filtered transformation is one possible way to solve this issue. First, start by creating a mapping variable. This is usually assigned a default variable of zero. You might need to generate a sequence in order to have the specifically sorted record you require. From there you structure your expression and turn on your filters.