Test Data Is Critical: How to Best Generate, Manage, and Use It

Notwithstanding testing in production—which should be part of any mature QA strategy—you should avoid using production data directly. Instead, use…

Testim
By Testim,

Notwithstanding testing in production—which should be part of any mature QA strategy—you should avoid using production data directly. Instead, use test data.

This article explains how to safely generate test data, how to manage that test data, and last, how to use the test data. Here’s a summary of what we’ll cover

  • What Is Test Data?
  • Challenges In Obtaining Test Data
  • The case for meaningful test data. Why it matters to have meaningful, realistic test data.
  • Generating Test Data Using Different Techniques. Techniques to produce quality test data.
  • 4 Common Pitfalls When Using Production Data as Test Data. Common problems you might face and how to avoid them.
  • Test Data Management: How Can You Do It? How to evolve to a proper test data management strategy.
  • How Can You Use Test Data? Learn how to use test data in the real world.

Let’s start.

What Is Test Data?

We’ll open the post by answering the most basic question about test data: the “what.” Once we have that out of the way, we’ll be free to go deeper on the topic. So, what is test data?

In short, test data is data used to feed automated software tests. That might sound simple, but provisioning test data is often a gigantic challenge. As you’ll soon see, it’s paramount that the data you feed your tests is of excellent quality, consistent, and available at the right time and in the right amounts.

Performing tests with subpar data sets will inevitably lead to subpar results. And since software testing is essential for any company that wants to ship high-quality code at a fast pace, test data, by consequence, is just as crucial.

Challenges In Obtaining Test Data

When it comes to obtaining test data, there are a few approaches you can use. However, each one of them comes with its set of challenges.

Synthetic generation, sometimes also called test data fabrication, is the process of generating “artificial” data only for testing purposes. When leveraging this approach, one of the biggest challenges is ensuring the validity of the data. In other words, the generated data needs to be realistic. More than that, it needs to be consistent and compliant with the business rules and domain logic of the system under test.

As the name suggests, production cloning is copying real data from the production environment. In this case, you don’t need to worry about the validity of the data. After all, we’re talking about production; it doesn’t get any more real than that. However, production cloning is problematic in different ways.

First, you don’t really want to copy all of the data since that would make you to incur high infrastructure costs. The recommended route is copying a fraction of the data and ensuring all the relationships are kept integral, which can present difficulties. Also, due to privacy concerns, it’s crucial that any production data is masked so personal information from real users is protected.

The Case For Meaningful Test Data

Why does the accuracy and structure of test data matter? Well, it doesn’t make sense to test your software with completely meaningless data. This might sound harsh, but it’s probably better not to perform testing at all with such data, as you might be testing nothing.

Meaningless data doesn’t add any value to the quality of your application. Therefore, test data needs to be meaningful for you to perform worthwhile tests without revealing private information.

In addition, with the rise of automated testing, there’s little room for manual test data creation. Continuous testing has gained a lot of attention within the DevOps and software development communities. This part of the testing strategy involves generating test data on the fly while running test cases.

Basically, your testing strategy relies on a script that can generate the ideal testing data for you and your projects.

Generating Test Data Using Different Techniques

There are different techniques for obtaining test data. One of them is production cloning—i.e. copying the data from production servers.

It’s essential to mask or substitute any sensitive data to avoid disclosing any personally identifiable information. Additionally, you might want to adopt techniques such as slicing—that is, copying just a portion of the data from production. You don’t need all of the data, and copying just a small amount will improve performance and save costs.

Other techniques might include synthetic data generation, manual data generation—through a front-end—and even web-scraping.

It’s essential to have a mix of different strategies to generate test data. If you are only using a single strategy to generate test data, you might end up testing the same cases over and over.

test data

If you are generating test data, the following three properties will make sure you generate high-quality test data each time:

  1. Accurate data: Data should be realistic and resemble real-life situations. You don’t want to fill in a date that lies 100 years in the future.
  2. Valid data: Data should match the purpose of your tool. If you have a webshop, don’t test for crazy scenarios where someone would buy 200 items. Make sure to simulate valid scenarios where a user buys one to ten products at once.
  3. Exceptions data: Make sure that your data also covers exceptions. For instance, a user returned a product to a webshop and received a coupon code for its next purchase. Make sure to cover the scenario where a user checks out using a coupon code only. It’s a clear exception scenario that deserves testing.

Next, let’s explore four common pitfalls you may encounter when generating test data that’s based on your production data.

4 Common Pitfalls When Using Production Data as Test Data

Using your production data might be a smart approach for your organization to generate test data. However, many organizations forget about the limitations of this data. Here are four common pitfalls companies encounter when basing their test data on production data.

Pitfall 1: Missing Data

When the development team creates new functionality, this might introduce new data that’s being captured. This means that you have new tables in your database for which you don’t have any sample data. When you’re blindly copying production data to be used as test data, you might forget about these new data tables.

Therefore, analyze if any new data has been introduced that your testing engineers need to generate.

Pitfall 2: Production Data That Follows the Happy Path

For testing engineers, the “happy path” is a common term that refers to testing only the success scenarios. This happy path is also easy to find in production data, as every action that a user completes should be successful.

Considering this, your production data might not be the ideal data set to use for testing. You’ll likely have to create data for negative scenarios as well, so you can test failures.

Pitfall 3: Testing Edge Cases

Your production data often doesn’t represent any edge cases. Because the production data represents the happy path, you won’t find many edge cases or advanced flows in your data. This might be an issue if you want to test all possible scenarios to reach 100% test coverage. Your production data might test only 70% to 80% of the scenarios.

In short, you won’t reach 100% test coverage solely with production data. Your application requires additional data to represent advanced flows. Furthermore, generating this data might require a more manual approach.

Pitfall 4: Generating Data Without a Testing Purpose

Testing engineers often forget to determine the purpose of the data they are generating. What are you trying to test with the data you’ve generated? There are several clear purposes you can adopt when generating test data:

  • Data for white box testing: The test data should cover as many code paths as possible for your application, even negative paths. Therefore, you want to pass invalid parameters or invalid combinations to see how your application responds.
  • Data for security testing: There’s a big difference between data for testing all code paths and testing security issues. Data for security testing is often much more sophisticated to uncover security issues. For instance, you want to verify if only authorized people can access your system. It’s much different from generating data that verifies if the login form works as expected.
  • Data for black box testing: Black box testing focuses on verifying the application’s behavior without knowing anything about the application or code itself. Therefore, you want to generate a wide variety of data to test as many problems and cases as possible to find bugs or issues. For example, you want to generate different birth date formats to verify how a form reacts when you pass formats it doesn’t expect.

Test Data Management: How Can You Do It?

Test data management includes many aspects, such as removing personally identifiable information and performing data validity checks. Here are four approaches you should follow to manage your test data accordingly. Each approach is equally important when managing your test data.

Approach 1: Remove Any Personally Identifiable Information

First, check if your data contains any personally identifiable information (PII). If so, apply data masking techniques such as substitution, shuffling, or blurring. These techniques help you to make data non-identifiable.

Next, check the validity of your test data regularly.

Approach 2: Perform a Data Validity Check

As development moves forward and you or your team members add new features, your data should move forward too. Therefore, perform test data audits regularly to find outdated data. Furthermore, validate if any data is missing to support new functionality.

As mentioned earlier in this article, you might end up introducing new features, meaning that you also need new data tables.

Approach 3: Refresh Your Test Data Regularly

Besides checking the validity of your data, it’s important to regularly refresh your data. This process can be easily automated with scripts that help you generate new data. This buyer’s guide can help you evaluate automated testing solutions.

Refreshing your test data can improve the quality of your application. Different data might expose bugs that your team hasn’t discovered yet with previous test data. Therefore, it’s important to make the time to regularly update your test data.

test data

Approach 4: Manage Data Access

Last of all, manage data access. Your organization needs to know how to access all important data. In addition, to ensure smooth testing, make sure your testing engineers always have access to the required data. You don’t want to slow down a release because of data inaccessibility.

Tip: Consider creating a list of data sources you need for testing and where they are located. This helps the testing engineers to easily find test data.

How Can You Use Test Data?

First of all, always make a copy of your test data before using it. That way, if something goes wrong, you can still access the original test dataset.

Next, you can use test data in various ways. Scripts can convert test data into different formats or insert the data into a database. For example, you may want to directly inject the data into a test database in order to test whether the application runs correctly.

After you run the test cases, you can do several things with your test data:

  • Store the final state of your database as a reference.
  • Delete all test data to avoid confusion about which is the original test data file. Also, clean the imported or outputted test files in your application. It can be a tricky process to clean everything accordingly. For example, output files may be hidden in several places in your tests. It’s easy to miss a couple of files when cleaning them up.
  • Use the end state of your database as input for further testing.

Give Your Test Data Some Love!

As you can see, test data management can be quite complex. The most important question to solve is whether you should use production data to generate test data.

Using production data as test data saves your organization a lot of time, but has its downsides. Synthetic data generation comes in handy in these scenarios.

Now that you know more about test data generation and test data management, you’re probably interested in tools at your disposal. Here’s a post what will be a nice segue to this one: The Top 5 Test Data Management Tools.

If you want to learn more about data masking tools, we’ve got your back as well: Which Data Masking Tools Should You Choose in 2021?

And even though it’s not a test data management tool in the traditional sense, Testim Automate—a powerful AI-based test automation solution—has features that makes it easier to work with data sets when performing tests. If you still don’t use Automate, create your account for free and check it out.

This post was written by Michiel Mulders. Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!