In my last post, I covered how to use an unpivot transformation in SSIS. There are, however, at least two other ways the same output can be achieved. In addition to the built in transformation, a multicast into a union all or a script task can be used. While all three methods produce the same dataset, the latter two are slower and one is much less efficient than the native transformation. If you are interested in each of the three methods, continue reading; otherwise, you can jump to the results.
To run the tests, I created a very large sample data set by concatenating the test data (see attached file) to itself until I had a file that was slightly larger than one gigabyte (around 11.5 million rows). As in the previous post, on the Error Output tab of the Source Editor I've set Truncation to Ignore failure.
Also, in each example below I am not writing out the resulting data set to avoid ending up with multiple gigabytes of useless data lying around. What follows is a discussion of each method.
As mentioned earlier, the built in Unpivot is the first option for moving multiple sets of columns into rows.
The second option is a Multicast into a Union All.
In order for this method to work, we need to create five additional columns containing a number to identify each set of data columns. Do this in a Derived Column transformation.
Drag five data flows from the Multicast to a Union All. Each set is then mapped to a different input of the Union All. Note that we only need to line up the SEQ number with the group, not the automatically assigned input number (the column headers).
Finally map the output of the Union All to the desired output. In the example, it's simply another Derived Column transformation.
A third option is to use a Script Component transformation.
After opening the Script Component, select every column as an input.
On the Inputs and Outputs pane add a new output containing each of the fields that will be in your unpivoted dataset, making sure that the data types match for each field.
Move on to the script pane and edit the script by clicking on Design Script. The full code of the component is attached. Below is a descriptive snippet from the code.
UnpivotBuffer.AddRow() UnpivotBuffer.SEQ = 1 UnpivotBuffer.Salesperson = Row.Salesperson UnpivotBuffer.Sales = Row.Sales UnpivotBuffer.Units = Row.Units UnpivotBuffer.Type = Row.Type
For each set of data columns to be moved to its own row Outputname
Buffer.AddRow() is called and each field in the output is assigned a value. Note that in this method the SEQ field is added in the script and does not need its own column before the script.
Running Tests and Monitoring
After creating a Data Flow task for each of methods 1-3, I ran each Data Flow task from Visual Studio. While this impacts performance, it should affect all three methods the same way.
While running each, I kept track of runtimes as well as a graph in Perfmon. Perfmon is the server performance monitor and can be opened from the Windows Run… dialog box.
I wanted to monitor % Processor Time and Avg. Disk Queue Length which were already present in the counters. Pages/sec isn't necessary so I removed it.
Right click in the counter pane to add one more relevant counter.
Select SQLServer:SSIS Pipeline from the Performance object dropdown, select Buffer Memory in the counter list, add it, and close the window.
The unpivot method was the fastest of the three at about 2.5 minutes. The processor time and buffer memory were both around 90% for the duration of the run. The spike you see in disk queue length was when a coworker helpfully sent me a message through communicator.
The multicast method came in second at 50 seconds slower. That's about 33% longer for a slight reduction in processor time.
Finally, the script method comes in dead last at nearly six times longer than the unpivot method. Each method used approximately the same SSIS Pipeline buffer memory but here we see a significant reduction in processor time. Unfortunately it comes with a fairly severe time cost.
There are two lessons to take away from these tests.
- First, there may be multiple ways to accomplish the same task and they are not always equally efficient. It can be advantageous to spend some time determining whether there are multiple methods, and if so, which is optimal for your setup.
- Second, the 'best' way may change depending on your needs. If your system is extremely processor bound but your processing window is wide open, the script method may be a better choice (though this seems farfetched).