The Dirtiest of all ‘Dirty Jobs’: Data Warehousing Front Lines
Ok, so you probably already know that Mike Rowe never shot a ‘Dirty Jobs’ episode on data warehousing, but if he had there is no doubt that it would have been the episode that turned stomachs like no other. Mothers would have turned the channel to protect their sweet babies from being exposed to the disgusting source system data quality issues that many of us have hidden from civilized life for so long.
Editing the mountain of content down to 45 minutes would have been the biggest challenge that this episode would have encountered. Because I know that many of you reading this are currently knee deep in slop, I’m going to limit this post to three 15 minutes segments focused on some of the most egregious source system data violations out there. Feel free to pile on at your leisure.
Segment 1: Oops… Did I forget to tell you that record was modified?
If you have not watched the ‘Dirty Jobs’ episode titled Pig Slop Processor, then stop what you are doing and go watch this YouTube video right now! This image of food waist being dumped into a trough pretty much sums up how I feel about source systems that don’t have the common decency to include a modified date on all records. There may be some excuse for this when working with very old legacy systems, but any application built in the modern era which modifies records in place without marking them as such has some serious explaining to do. Sure, we have ways of dealing with this type of shenanigan, but the cost of doing so is often much higher than marking changes in the first place.
I realize that there is a very low chance of application developers actually reading this post, but in the off chance that one of you out there is building an application right now, save the future generation of data warehousing elves & add a modified date to all records. Oh, and if you are really thoughtful, it would be awesome if you could inactivate records instead of deleting them. Thank you so very much!
Segment 2: Spin the wheel of ambiguity
One of the slimiest tricks that source systems seem to get joy out playing is obviously parroted after the very popular Wheel of Fortune playbook. The trick is very simple. Create an object of your choice; employee, company, product, whatever. Next, add a few attributes; name, city, size, and so on. Now here comes the punch line. Find a way to store multiple versions of attribute values that are effective at the same point in time, but do not match. This is where the Wheel of Ambiguity comes in. We on the data integration front lines are left to spin the wheel of ambiguity and blindly pick an attribute value.
This trick reminds me of the episode titled The Island Snake Lady. Just when you think you have found all of the data quality issues, yet another snake turns its ugly head and bites you. Go ahead; go watch this piece of art while reminiscing about your own source system nightmares.
Segment 3: Oh, come on; that’s not important
In this final segment let’s shift our attention to the data that is long gone & never to be recreated. This is also an oldie but a goodie. Simply don’t store historical values for key data elements. For some reason source systems overlook the importance of knowing capacity amounts over time. When did the store open/close? When did the hospital wing open/close? Generally, demand is tracked fairly well, but capacity isn’t. So, yet again we are left using empirical information to try to piece together the bits of data that would have cost little to nothing to store in the first place.
That’s a wrap
There you have it; the episode that would have saved ‘Dirty Jobs’ from the chopping block. Here’s a virtual glass raised to all of you dirty data warehousing pros that (as Mike Rowe would say) "make civilized life possible for the rest of us."
Oh… I feel so much better now. Don’t get me wrong. I love turning a source system’s trash can of a database into a gold mine of information, but it would be nice to not need to do quite so much cleanup.
Go ahead, vent a little. There is nothing like knowing that I’m not in the septic tank alone.
If you are tired of reworking your SSIS packages each time you find a new data quality issue, then take a look at LeapFrogBI – agile data mart development platform. LFBI makes change management a cinch.