When you are involved in an SSIS project, you have to collect data from various data sources like from databases, flat files, excel files etc. Generally when you find that the data is not well formed, it’s of very poor quality for analyzing, lots of data are missing etc. etc. So before starting the project and designing a database structure if we can analyze the data properly it will be helpful for us to design our data ware house properly.
SSIS provides a wonderful control ‘Data Profiling Task’ to accomplish this task that helps us to provide the Meta information of how the data is organized.
Suppose you have received data in excel format as follows:
Now we need to analyze this data to get the insight and Meta information to design our data warehouse.
Firstly we need to import this excel data into Sql Server. The reason for this is currently ‘Data Profiling Task’ control only supports SQL Server ADO.NET connection type to import data. Hopefully in future Microsoft will give more option to connect excel file, text file etc. for profiling directly.
It’s basically a three steps process:
Step 1: Drag the control ‘Data Profiling Task’ in ‘Control Flow’ tab.
Step 2: Select the path where the profiling information should be stored.
E.g. ‘ProfileInfo’ is the file name in this case.
Step 3: Select the Server and Database name on which the profiling will be performed.
3.1 Click ‘Quick Profile’ in ‘General’ tab.
3.2 Select Server and Database
3.3 Select the table name and check all the ‘Compute’ options.
3.4 Run the Package.
Now the profile data has been stored in the file ‘ProfileInfo’ in this case, which is basically a XML file and looks like as follows if you open in web explorer:
But, the above file is not in a readable format. To read this file Microsoft has provide us a DataProfile Viewer that is stored in the path ‘C:Program Files (x86)Microsoft SQL Server110DTSBinn’
Open the DataProfile Viewer and open the file ‘ProfileInfo’(in this case) in it that looks like as follows:
In the above snap if you unfold the ‘Tables’ and then [dbo].[Sheet 1$], you will find seven profile that I will discuss one by one in coming Blogs elaborately.
- Candidate Key Profiles : Suggest which column has the unique values.
- Column Length Distribution Profiles : Suggest the minimum and maximum length of column.
- Column Null Ratio Profiles : Suggest the null percentage of each column value.
- Column Pattern Profiles : Suggest the pattern of column value.
- Column Statistics Profiles : Statistic for numeric field such as Minimum value, Maximum Value, Mean, Standard Deviation.
- Column Value Distribution Profiles : Give a repetition percentage of a column value.
- Functional Dependency Profiles : Imply if any column value is dependent on another column value.
In this blog I basically tried to explain how you can use the SSIS Data Profiling Task Control to get the insight of the received data. I hope this is helpful.