Extending the Data Profiling Task
SQL Server 2008 RTM has been out a while now and yet, I am just now making my way to using the Data Profiling Task, new to SSIS in this release. Time and client demands are always a factor governing when and what I get to look at, however I had read up on the features and was hoping that what I had read was not all the new task could do.
The new task is a good start, but unfortunately, it is a little more limited than I would have hoped. Fortunately though, with the SSIS APIs open to us, we can use the class object as we please.
So first, what did I find limiting about the Data Profiling Task?
It appears that its intended design was to serve as a checkpoint during loads/transforms and that conditional logic can be applied based upon results; this is incredibly useful. My major concern is with activities that happen before the design phase – and quite directly, before any development begins at all.
Reference this great little tip from Kimball here on data profiling and its necessity according to his framework – profiling is tantamount to understanding the system that you are getting ready to work with in the course of a project.
To get to the heart of what I wanted to see the Data Profiling task do: I want to use it to generate an aggregate view of a given system so that I can view the results and learn more about the data within the database. Many of us have been given a database, custom built or extended far from the original software vendor’s spec, with little documentation.
Even for those who are not preparing to move this data for warehousing or reporting, anything that provides a reference to what the data is, how it is stored, how unique, etc. is a boon for developers.
The Data Profiling task can deliver this data — but only one table at a time?
There are no expression settings or variables that you can use to set which table and where the output goes. For two of the profiling tasks, I can see this as a bit of a problem – both the Candidate Key and Value Inclusion profile do not lend themselves to being easily generated dynamically via iterating over all the tables and views in a database.
But all the others are useful in this aggregated manner, so why, if I am looking at a client database that has 96 tables, must I build 96 tasks via Visual Studio and still not end up with an aggregate report (just 96 xml files)? I would like to be able to run this on the entire database, and start looking at the hotspots or data profile results that are the fat tails in the distribution chart.
My solution is to build the SSIS package programmatically, run it and then read and aggregate the results. I still need to clean up the solution a bit and finish some unit tests, so I will post it soon for everyone to download and enjoy.
My envisioned endgame for this project is for the output to be configurable for user desired thresholds and for that output to render in HTML or Word – still a little ways away from that, but I have the bulk of the SSIS work done. If anyone is interested in contributing, let me know and I will see about getting this on CodePlex.
Notes for anyone who wants to work with this project or attempt this on their own:
- This effort makes use of the Microsoft.DataDebugger.DataProfiling class, which is not necessarily a supported API. An example of using this class is on the SSIS Team Blog here.
- DLL references necessary for this project come from multiple places. For example, the Microsoft.SqlServer.DataProfilingTask is in ~\Program Files\Microsoft SQL Server\100\DTS\Tasks and Microsoft.SqlServer.DataProfiler is in ~\Program Files\Microsoft SQL Server\100\DTS\Binn
- It is a good idea to read up on building a package programmatically on MSDN as there are class objects, such as TaskHost, which are not immediately intuitive as they are not seen when using the designer