SQL Server 2008 has a great feature called change data capture, or CDC. CDC is a built-in functionality that tracks all changes on a table or a certain set of columns in a table and allows the user to query these changes. It is easy to see how this can be very useful for auditing, warehousing, reporting, even data recovery. Here is a quick guide to getting started and working with CDC.
First, the database needs to be enabled for CDC. This is as easy as issuing a simple system procedure call.
Second, you will need to do some analysis to determine which tables and which columns should be enabled for CDC. It is not a recommended approach to immediately begin tracking all tables and all columns with CDC unless the database is relatively small and/or experiences low usage. This could introduce a very large amount of overhead into the system and it is likely that not all tables will need change tracking. After analysis, tables can be enable for CDC as follows:
EXEC sys.sp_cdc_enable_table @schema, @TABLE, @capture_instance, @net_changes, @role_name, @index_name, @column_list, @filegroup, @allow_partition_switch
The default works for all parameters other than schema and table. These fields must be filled in. Here is a quick guide to some of the core parameters for enabling a table...
Capture instance is needed if the user desired to track different columns in the same table separately.
Net changes, which is on by default, allows the user to query for the end result of all changes in a given time period. When net changes is disabled, the user can only query for all changes to the table or a row within a time period. This makes sub-queries necessary in order to find only the most recent changes
Column list is a comma separated list of column names that will be tracked in this instance. Say, for example, that a client is only interested in tracking the changes to employees' salaries. The enable statement would look something like this.
EXEC sys.sp_cdc_enable_table @source_schema = dbo, @source_name = employee, @captured_column_list = 'salary'
At this point, everything is set up and ready to go.
Now that CDC has been enabled for the database and all necessary tables, the retention period should be defined. The retention period tells CDC how far back to keep records in the capture tables. SQL Server defaults this value to 3 days, but business requirements may mandate longer time frames and disc space limitations might require shorter ones. To set the retention rate, run the system procedure sys.sp_cdc_change_job as below with x being the number of minutes to keep records in the table. *Note: When defining the retention period, the job_type parameter must be 'cleanup'.
EXEC sys.sp_cdc_change_job @job_type = 'cleanup', @retention = x
WORKING WITH CDC
For a list of all built-in tables for CDC with links to all cdc system functions and stored procedures, go here... http://msdn.microsoft.com/en-us/library/bb500353.aspx.
Now, you have a basic start to begin working with CDC. The big question is, what do you do with it? Well, there are several great ways to get some real value out of CDC. One particular utilization that I have employed in the past is to use CDC when loading new data from one environment into another. In my case the destination was a data warehouse. Using CDC, I created procedures that could isolate only the changes within a source table that needed to be propagated to the warehouse. This eliminated all "full pulls" from the ETL process and greatly improved the overall efficiency of the process. Imagine a source system with an employee table in which there are thousands of records. If this table in the source system does not have auditing columns (created date, modified date, etc...), or worse, it allows hard deletes, then you are reliant on full table comparisons in the ETL process. The process must verify that each record is an insert an update or a delete. Using CDC, within either SSIS packages or stored procedures, it is fairly simple to write queries that extract only the records of interest and to identify the modification type.
Another valuable use of CDC is in reporting. For example, the business may want to run reports based on invoice totals being changed for a completed invoice. CDC can be configured to record changes to the invoice amount field alone thereby ignoring all other changes to the invoice. Then, since CDC logs updates in pairs, the report can easily show an old vs new values comparison after the change. Keep in mind that CDC allows only two capture instances per table. This means that recording only the invoice amount column uses up one of those capture instances. The alternative set up would be to use an existing capture instance and test for the invoice amount being changed using the CDC function sys.fn_cdc_is_bit_set and pass in the update mask for the change record and the ordinal position of the column of interest. In the illustration below, notice the __$update_type field. The old record has a value of three while a value of 4 denotes the record after the change. Values of 1 and 2 indicate deletes and inserts respectively.
One final example of the value of CDC is in data recovery and auditing. If a delete statement is errantly issued against a set of data in a table that is tracked by cdc, a user can run queries using CDC to find the data. Then it is simply a matter of deciding what to do with the data. This option is only available for as long as the retention period is configured for and should not replace other backup and recovery plans. It just makes for a convenient method of performing an "undo".
CDC is a great feature and can be used in several ways to help you manage your database. Perhaps you have other practical uses for this feature?