There's consensus among data quality experts that, generally speaking data quality is pretty much bad (here, here, and here). Data quality approaches generally focus on profiling, managing, and correcting data after it is already in the system. This makes sense in a data science or warehousing context, which is often where quality problems surface. To quote William McKnight at the first of those sources:
"Data quality is no longer the domain of just the data warehouse. It is accepted as an enterprise responsibility. If we have the tools, experiences, and best practices, why, then, do we continue to struggle with the problem of data quality?"
So if the data quality problem is Garbage In Garbage Out (GIGO), then I would think that it would be easy to find data quality guidelines for app dev, and that those guidelines would be lightweight and helpful to those projects. Based on my research there are few to none such sources. So, all that said here's my cut at app dev data quality guidelines by project activity:
Whether the project is agile or waterfall, good requirements are the foundation of a good application, and this early stage is where teams can set data quality requirements that nip potential data quality problems in the bud. Here's how:
Find the data quality stakeholders and follow their lead
As the project starts up a key task is to find those affected, either positively or negatively, and make sure the requirements include their needs. In this activity, don't forget the data quality stakeholders. For example:
- What organizations and systems will use data entered and processed through the application? What data will they need and what are their interface requirements?
- Are there regulations requiring audit info or reporting from the application?
- Is there a data governance organization that sets standards applicable to the project?
Name and define business objects and their attributes consistently throughout your app
Every part of the application should identify and define business objects and events the same way. For example, if one app component hires an employee or contractor, and another administers employee benefits that don't apply to contractors, then the requirements should differentiate employees from contractors and be specific about which is applicable in each case.
One way to promote consistent object names and definitions is to make conceptual data modeling a part of the requirements process, as described here. Conceptual data modeling defines business objects and events and the relationships among them, providing a solid base not only for database design but also for functional requirements that treat those objects, events, and relationships consistently.
Understand the quality of incoming interfaced data
If the application draws data from other systems, then requirements analysts should profile the content of the data, not just review the layout. One thing profiling provides is the range of possible values in each column. The requirements should specify how to handle unusual or out of bounds values within the application and, depending on reporting and outbound interfacing requirements, how to avoid passing a data quality problem on to to the next outbound interface.
Find and reuse reference and master data
Just as software reuse reduces application cost, code value reuse saves naming and defining codes for your app dev project, not to mention data integration cost and headache for outbound interface destinations. If there's no data governance or data integration team, try to work with nearby teams to define things like employee class, product type, etc, the same way.
Be even more aggressive about reusing rather than inventing new layouts and definitions for customers, products, suppliers, and other business critical master data. If at all possible find a source you can load master data from. I know of one now-defunct Fortune 500 company that had hundreds of different records for each of their largest customers. So every month Bank of America literally received hundreds of invoices from this supplier. Don't be a part of that customer service nightmare!
Design and Development
In any app dev project, the right place to think about data quality is in the requirements phase, but there are important follow through steps in design and development:
Apply Variable Naming Best Practice
Every established programming language has generally accepted best practices for variable naming and use. Here's an example. Those guidelines always include use of meaningful names, following installation standard abbreviations, and not using the same name for different data elements. In addition to making code more readable, maintainable, and reusable, good naming practices mean better outbound interfaces and more understandable data for users of interfaced data.
Design for Data Integrity
All efficient application databases are denormalized in some way, but careless denormalization can degrade data quality. For example, if we decide to keep Employee and Contractor data in separate tables, then we've degraded data quality if there are different columns for Employee and Contractor addresses.
The rule of thumb for denormalization is Codd's Rule of Reconstruction. Paraphrasing Richard Root's interpretation of the rule, a SQL command should be able to convert the normalized table to the denormalized table, and vice versa. A conceptual data model is a nice stand in for a normalized database design, so database designers should keep the database design consistent with - but not identical to - the conceptual data model.
For structured data, this guideline still applies to Big Data and NoSQL. I recently attended Cassandra data modeling training, and the first 1/3 of the class focused on conceptual data modeling.
Test, Maintenance, and Operations
App dev teams should apply the same rigor in defect corrections and enhancements that they did in initial development, or they'll risk degrading data quality over time. They should also be responsive to enhancement requests, because in many cases data quality problems emerge when business needs for enhancement aren't met. For example, if the business needs to distinguish between temp workers and contractors, and the maintenance team can't prioritize the enhancement, then they might decide to add a ".T" at the end of each temp worker's name -- a manual stopgap that in practice would be almost impossible to apply consistently.
Hopefully this list fills a gap by helping app dev teams address the data quality problem at the source and reduce the Garbage Out of downstream data glitches and costly data quality remediation in data warehouses, marts, and lakes.