Great introductory session on Data Quality Service by Elad Ziklik in Tech-ed’11. I have highlighted the key learning from this presentation.
Data Quality represents the degree to which the data is suitable for business usage. Data Quality is built through People, Technology and Processes.
With Data Quality Services, both IT and Biz (Data Stewards) work together to create a knowledge driven solution. Data Quality Services relies on this knowledge to clean the data.
Knowledge Driven – It is based on Data Quality Knowledge Base (DQKB)
Semantics – Data Domains captures the Semantics of your Data
Knowledge Discovery – Acquires additional knowledge the more you use it
Open and Extendible – Supports use of user-generated knowledge and IP by 3rd party reference data providers
Easy to use – Designed for ease of use and increased productivity
SQL Server Data Quality Services UI:
Domain Management: Create domains for your data and assign data quality rules.
Account ID length is between 3 and 9.
Account ID cannot contain characters.
Domain Values: What are the different values that are expected for this attribute? The DQS tool lets you Map, Discover and Manage the Domain values.
Map Domain values:
Discover Domain Values:
Manage Domain Values: Note DQS identified that the domain “Revenue Type” contains a potential issue: Food & Beverage, Food & Beverages.
Save the acquired knowledge:
After the knowledge is acquired, it is time to clean the data. Build a new Data Quality Project and use the Knowledge base we just created.
Map and Cleanse the data:
See the corrected records inline and approve or reject the corrections manually. Nice!
DQS can suggest corrections:
Once the data is corrected, you can export the data to a database or a csv file:
Compare the reports pre & post running Data Quality Services:
Post-DQS: clean data
DQ Server is now part of the SQL Server Denali.You can use the DQ client to cleanse your data or you the new SSIS DQ component to clean your data at the time of ETL. In future we should see Excel and SharePoint support Data Cleansing capabilities added to it.
Use 3rd Party Data using a Reference Data Services to build the Knowledge needed to cleanse the data.
Map 3rd party schema with our schema:
Create a new DQS Project and create the mappings:
this time the data is sent to the Azure Data Market:
DQS comes back with suggested values:
SSIS in SQL Server Denali now comes with a DQS transformation which enables data corrections to be run in a batch mode:
Create Matching Policies to identify duplicates and create consolidated views:
Match the Artists based on the song. If the songs are similar, the artists are probably the same.
See the results:
See the details on why a Match was suggested:
DQS shows you the stats:
DQS Creates a Knowledge Base when you create the Matching Policy and run the project once which is saved and used whenever new data comes in.
Note the DQS now knows that B. Spears and Britney Spears is one and the same artist.