More than 70% of revenue leaders in an InsideView Alignment Report 2020 rank data management as the highest priority, yet, a Harvard Business Review study estimates only 3 percent of companies’ data meets basic quality standards.
There is a major gap between what companies want in terms of data quality and what they are doing to fix it.
The first step to any data management plan is to test the quality of data and identify some of the core issues that lead to poor data quality. Here’s a quick guide-based checklist to help IT managers, business managers and decision-makers to analyze the quality of their data and what tools and frameworks can help them to make it accurate and reliable.
What is data quality and why does it matter? Before we delve into the checklist, here’s a quick briefing on what is data quality and why it matters.
There is no specific definition of data quality and to give one would be to limit the scope of data itself. There are however benchmarks that can be used to assess the state of your data. For instance, data of high quality would mean:
Put simply, poor data, left neglected impacts every aspect of your business process – from sales to marketing, customer support to customer service, and team efficiency. In recent years, data quality is no longer a backburner process. It’s affecting businesses drastically, which makes it all the more important to treat data quality as a burning issue that needs a resolution before it endangers the growth plan of a business.
Before you can test your data efficiently, it is necessary to define and set the right expectations from the process and the data itself. Let’s look at what you should know before starting your data quality testing process.
Is it supposed to fuel your business intelligence process? Or help you identify new market opportunities and customer segments? Whatever the intended purpose of data is at your company, identify it. If you don’t understand what data can do for you, you’ll never be able to measure whether it is fulfilling its purpose.
You must understand the metrics that will help you to measure data quality. This could be as simple as the ten critical data quality dimensions that we all know so well. But it is better if you make this a bit more specific to your use case. For example, the Date column in a dataset should contain formatted dates only. But you could also have dates that are actually garbage values since they represent dates that are too old to be accurate. So, you could have your own, more specific definition of what accurate, complete, consistent, valid, timely, and unique means to your company.
This is probably the most important information that you need prior to your data quality testing process. Metadata is the information that describes your data. It helps you to understand the descriptive and structural definition of each data field in your dataset, and hence measure its impact and quality.
Examples of metadata include the data’s creation date and time, the purpose of data, source of data, process used to create the data, creator’s name and so on. Metadata allows you to define why a data field is being captured in your dataset, its purpose, acceptable value range, appropriate channel and time for creation, etc., and use that while testing and measuring data for quality.
Now here’s the part that you’ve been waiting for. Once you’ve prepared and set the broad testing criteria, you are now ready to begin your testing process.
There are multiple levels of data quality testing depending on the depth and perspective of the test plan you’re following.
Since data is being captured from our surroundings, we can quickly validate its accuracy by comparing it with known truth. For example, does Age column contain any negative values; are required Name fields set to null; do Address field values represent real addresses; does Date column contain correctly formatted dates; and so on.
This level of testing can be performed by generating a quick data profile of your dataset. It is a simple compare and label test where your dataset values are compared against your defined validations and some known/correct values, and classified as valid or non-valid. Although it can be done manually, you can also use an automated tool that will a run a quick profile test and show you where your data stands as compared to the validation rules defined.
But keep in mind that this level only tests the data itself, and not the metadata.
The level-1 testing is focused on validating each individual value present in the dataset. The next level requires you to consider and test your dataset more holistically. This means testing your dataset vertically as well as horizontally. This level of testing is very useful if implemented at data-entry level as it stops errors from cascading into your dataset.
It means computing the statistical distribution of each data attribute, and validating that all values are following the distribution. This allows you to continuously keep in check that the nature of new, incoming data is the same as the data residing within your dataset.
Furthermore, for this type of testing, you can determine the median and average values for each distribution, and set minimum and maximum thresholds. On every new entry to the dataset, you can check the probability that the new data belongs to this distribution. If the probability is high enough (approx. 95% or more), you can conclude that the data is valid and accurate.
You can also use the metadata of an attribute to compute distribution and test incoming data against it. For example, the Name field usually contains 7-15 number of characters. If a new Name entry has only 2 characters, it can be considered as a potential error as the new metadata value did not conform to the expected distribution.
It means performing a holistic analysis to qualify the uniqueness of each record in your dataset. For this type of testing, you need to go row by row in a dataset and verify that all records represent uniquely identifiable entities, and there are no duplicates present. This is a more complex form of testing as it might be difficult to assess uniqueness of a record in the absence of a unique key. For this purpose, advanced algorithms are utilized for performing fuzzy matching techniques and determining probabilistic matches.
Level 3 testing is the same as level 2, but instead of considering only current dataset, historical records are also used for computing row matches, and field distributions. This is done so that any changes in data that happen with time are also considered while validating data values.
For example, yearly sales are expected to spike at the end of the year due to holidays and are comparatively slower in the seasons leading up to it. So, you can end up drawing incorrect conclusions about your data if you don’t take time into consideration. With this level, you can also run tests for detecting anomalies in your data. This is done by looking at the history of values in a data attribute and classifying current values as normal or abnormal.
Now that we’ve covered the different levels of data quality testing, let’s look at the tools and frameworks available out there that can help you implement your testing process.
In traditional data warehouse environments, a data quality test is a manual verification process. Users manually verify values for data types, length of characters, formats, and whether the value falls within an acceptable range. This manual verification does only makes the processing time-intensive but also makes the testing results prone to human errors.
This whitepaper highlights the kinds of challenges companies face while implementing a manual data management system, and how you can overcome these problems through automated solutions.
A number of open-source projects are available that can help you to test your data using various coded functions. Many organizations find these solutions easily adaptable, but some do require customizations to be done before they can leverage these tools for their use cases. As these tools only offer the code for functional scripts, you may need to a developer to complete the process of reporting test results, or programming custom alerts every time a data quality rule is violated.
It is very common for companies to decide on building a custom solution for any problem that they are facing. And it is no different for data quality testing. Management either outsources the project or utilizes a team of in-house developers to understand their data quality control issues and invests in the implementation of a custom solution. Although the idea of having a data quality control system build specifically for your organization’s use case seems attractive, it is usually very difficult to maintain the validity of such code scripts, as data quality definition constantly needs reviewal and changes.
More than 70% of all in-house software development projects fail. Read this whitepaper to understand why in-house data quality solutions end up being a major liability for businesses, and how you can leverage automated solutions.
As data quality challenges become more complex, modern problems require modern solutions. Data scientists and data analysts are spending 80% of their time in testing data quality, and only 20% of the time in extracting business insights. Automated data quality testing tools leverage advanced algorithms to free you from manual labor of testing datasets for quality, or maintaining coded solutions over a period of time as data quality definitions evolve.
These tools are designed to be self-service and user-friendly so that anyone – business users, data analysts, IT managers – can generate quick data profiles as well as perform in-depth analysis of data quality through proprietary data matching techniques.
Normally, these tools specialize in offering two different types of testing engines – some come with only one and very few specialize in both types. Let’s take a look at them.
Rules-based testing tools allow you to configure rules for validating datasets against your custom-defined data quality requirements. You can define rules for different dimensions of a data field. For example, its length, allowed formats and data types, acceptable range values, required patterns, and so on. These tools quickly profile your data against configured rules, and offer a concise data quality summary report which covers the results of the test.
Suggestions-based testing tools are usually based on machine learning algorithms. They analyze your current and historical datasets to train models of data distribution. Next, they test every incoming data value against the model, and output a data quality suggestion based on the result. Instead of manually configuring the rules of data quality, suggestion-based tools suggest you how qualified your data is. This is a very efficient way of analyzing and capturing anomalies at data-entry level.
Data quality testing is not a static, one-time process. Right when you feel like you’ve got the quality of your dataset under control, invest in implementing a long-term plan for quality maintenance. There are different activities that need to be performed at regular intervals to ensure that the quality achieved is being maintained. Some of them include:
As new data enters into your ecosystem, the overall quality of your data deteriorates. This is why you need to implement data quality checks at the data entry or data integration level. You want to make sure that new data is introduced into the system is accurate and unique and is not a duplicate of any entity currently residing in your master record.
This is probably one of the most important post-testing activities. You need to continuously assess the state of your data. This requires you to run quick profile tests on your dataset at regular intervals to ensure resolution of errors on time. It is a good practice to store the results of these profiles over time as they would help you to understand at what point in time your data quality went south.
Keep an eye out on the kind of errors your data profile reports usually contain. Does your data mostly alert you about incorrect date formats? Are there null values present for required fields? Maybe you need to fix your data entry form validations. This activity will help you to eliminate your data quality errors at the root and will allow you to leverage data directly for its intended purpose.
Most companies don’t engage in data quality tests unless critical for data migration or a merger, but at that time, it’s way too late to salvage the problems caused by poor data. Test your data quality, define the criteria, and set benchmarks to drive improvement.
Luckily, you no longer have to put in the effort of manually testing your data as most ML-based data quality testing solutions today allow businesses to do that with a few easy steps. You’re choosing between 2 minutes vs 12 hours. And the choice doesn’t have to be daunting. Best-in-class solutions like DataMatch Enterprise allow free trials that you can benefit from. All you have to do is plug in your data source and let the software guide you through the process. You’ll be surprised at the hours and manual effort you’d be saving your team with an automated solution that also delivers more accurate results than manual methods.