Measuring the Data Quality


In today’s information-driven world, implementing an effective data quality management or DQM strategy cannot be overlooked. DQM refers to a business principle that requires a combination of the right people, processes and technologies all with the common goal of improving the measures of data quality.




The subject is the single most important concept in the modern data quality approach. The subject is the entity which will be the target of the data quality investigation at the most granular level. Before we begin any data quality initiative we must discover what the subject of the study is. Like most concepts in our approach, the subject is a concept reflected in the data but not attached to any Technical object.

For ex: Employee Status, Hours, Earnings belongs to subject "Employee". If we implement a Telecom Data warehouse, subject areas can be Subscriber, Finance, Marketing.Once identified, the subject becomes more than a concept and will define the granularity with which you will measure data quality.

“We identified data quality issues with 20 percent of the subscribers contained in our database” is a more useful statement than “Thirty-eight percent of the rows in the SUBSCRIBER table have a field that fails one of our data criteria.”



It is not the right way to define a business rule by programming, creating SQL statements that grabs Bad data. Instead create a business rule that can be expressed in simple sentence, agree and program them. Programming is one property of the business rule. The rule should be independent of ties to a database, table, field. these associations come later. Each business rule must be designed and understood by the entire team

Building strong business rules can be effectively done by SMEs. Because they are most familiar with the data and who knows its history, linage, problems, and nature of the data.
  


A Data Quality (DQ) Dimension is a recognized term used by data management professionals to describe a feature of data that can be measured or assessed against defined standards in order to determine the quality of data.

Completeness – a percentage of data that includes one or more values. It’s important that critical data (such as customer names, phone numbers, email addresses, etc.) to be complete and accurate.

Uniqueness – When measured against other data sets, there is only one entry of its kind.

Timeliness – How much of an impact does date and time have on the data? This could be previous sales, product launches or any information that is relied on over a period of time to be accurate.

Validity – Does the data conform to the respective standards set for it?

Accuracy – How well does the data reflect the real-world person or thing that is identified by it?

Consistency – How well does the data align with a preconceived pattern? Birth dates share a common consistency issue, since in the U.S., the standard is MM/DD/YYYY, whereas in Europe and other areas, the usage of DD/MM/YYYY is standard.

A typical Data Quality Measurement approach might be:
1.    Identify which data items need to be assessed for data quality, typically this will be the Subject areas critical to the business operations and associated management reporting.
2. Assess which data quality dimensions to use and their associated weighting
3.   For each data quality dimension, define values or ranges representing good and bad quality data. Please note, that as a data set may support multiple requirements, a number of different data quality assessments may need to be performed
4.     Apply the assessment criteria to the data items
5.     Review the results and determine if data quality is acceptable or not
6.  Where appropriate take corrective actions e.g. clean the data and improve data handling processes to prevent future recurrences
7.     Repeat the above on a periodic basis to monitor trends in Data Quality.

References: https://tdwi.org, www.whitepapers.em360tech.com


Comments