When I design a database, and local standards permit, I include both a surrogate primary key and audit columns on every table.
A surrogate primary key is a system-generated integer that increments by one with each new row inserted. Most DBMSs make it easy to add surrogate keys. Oracle uses a construct called a Sequence, SQL Server calls its variant an Identity.
The audit columns I like to include are these, shown with their SQL Server datatypes: CreationDatetime (datetime), UpdateDatetime (datetime), CreationUserId (varchar (30)), and UpdateUserid (varchar (30)). Most DBMS's offer the ability to update columns like these with either default values or triggers.
One often-heard objection to surrogate keys is that they are not needed for a normalized model – the natural data in the table includes unique business data that identifies the row. This is true, but sometimes natural keys that include dates or varchar columns, for example, aren't efficient from the DBMS perspective. A single integer is an excellent choice for a practical key format, and the natural key can still be represented as a unique index.
Audit columns make it possible to audit who added or changed a given row and when they did it. This has obvious benefit in operational systems where fraud, security breaches, or compliance to procedure is a concern. Managers can use these columns to see who may have inappropriately or incorrectly added or changed data.
System administrators can use these columns to better understand usage patterns by time period or user group in order to best optimize performance.
For data warehousing and data mart applications, CreationDatetime and UpdateDatetime make it easy for extract processes to identify added or changed rows. Otherwise, extract processes rely on complex Change Data Capture mechanisms or resource-intensive full table data compares to identify new or changed rows for extract purposes.
One difficulty with these columns shows up in very large tables. If there are hundreds of millions of rows, the four audit columns show up once for each of those hundreds of millions of rows, significantly adding to the size of the already huge table. The solution in such cases is to pull the audit columns into their own table and include a foreign key to the relevant audit row from the data table.
Another potential difficulty lies in managing tables with audit columns and surrogate keys. Administrators must take care not to change the audit data or key values with administrative actions like copying the table to a different database. In the case of surrogate keys, there's a risk of losing connections to child tables that have foreign keys that reference the table in question.
I've often worked on projects building data warehouses or operational data stores, and in reviewing the source systems found tables with no way to identify changed rows. In some of those cases it was hard to find business people who could accurately interpret data in the table so that we could design an efficient extract strategy, and we defaulted to the "brute force" method of comparing the entire table with the most recent extract to see if anything had changed – a time consuming, resource-intensive process. Surrogate keys and audit columns are a simple way to improve database efficiency, enable more accurate and efficient data integration, as well as improve ability to audit and correct your data.
This is one of a series of articles in which CapTech Data Management and Business Intelligence (DMBI) consultants explain the tips and tricks that make them successful on their projects.