Test Data Management (TDM)

"Data isn't information, any more than fifty tons of cement is a skyscraper" - Clifford Stoll

Test data management is an important task in the testing life cycle. An enormous generated data needs to be managed and some of the common activities around it are identification, aging, masking and archiving. In this blog, below questions are discussed in detail for more clarity –

What is Test Data Management?
Why do we need TDM?
What are some of the indicators that your project needs TDM?
What are some of the major activities of TDM?
Test Data Management Challenges?
Test Data Management strategy?
What are some pros and cons of cloning production databases?
What are some challenges of using Production Data in Test Environment (Production Cloning)?
What are some pros and cons of generating synthetic data?
What are some pros and cons of Subsetting production databases?
What are some challenges in Data Subsetting?
Key features of TDM tool?

What is Test Data Management (TDM)?
TDM consists of managing the provisioning of required test data efficiently and effectively, while at the same time ensuring compliance to regulatory and organizational standards. Below are some building blocks of TDM

Data Subset – a process of slicing a part of the production database and load it into the test DB
Data Masking – a process of masking the sensitive fields from the complete data set
Data Archive- a process of storing data snapshot to restore it later as per build / release / cycle
Test Data Refresh – a process of loading / refreshing the test data with latest data from prod
Test Data Ageing – a process required for time based testing. Depending on the scenario that needs testing, either backdate or front date the given date
Gold Copy – the baseline version of data that can be used for future releases

Why do we need TDM?
Research shows that projects cancelled due to poor data quality are 15 percent more costly than successful projects of the same size and type. It is noticed that almost over 10% of the defects raised in production are due to data that could have easily been captured during the various testing phases.

To create "right-sized" test databases that accurately reflect E2E business processes
To enable developers to correct defects early in the life cycle
Allow to execute comprehensive non-functional tests
To create realistic and manageable test databases by applying data sub-setting techniques
To safeguard customer privacy/security by applying data privatization techniques
Quickly and easily refresh data in Test Environments
To empower test teams to select and book test data set
To reproduce any reported bugs, the data used must be available

What are some of the indicators that your project needs TDM?

Testing deadlines getting slipped due to data related outages and/or data synchronization issues
Testers wasting more time in preparing test data than the actual testing
Testers depends a lot on BA to provide meaningful test data
High risk and penalties associated with not adhering to compliance and/or data privacy laws
Lots of false defects due to data related issues
Testers complaining about complexity in creating test data for consumption
Test data are as voluminous as production and hinder performance
Test data not being reused and every time being created from scratch (using the same process)
A big delay in providing the test data as waiting for another system to get ready
With projects growing, team complaining about managing the test data
Outsourced and / or off-shored testing services have access to the customer’s PII data

What are some of the major activities of TDM?

Acquiring an initial understanding of the test data landscape like a list of test regions, applications, types of data stores, frequency of data requests for each application etc.
Carrying out data profiling exercise for each of the individual data stores across the enterprise
Identify

Data types
Data dependencies
Data sources and providers
Tools for data extraction, masking, creating, loading and so on
Who needs test data, a tester, a developer or a vendor
When to refresh the test data and when to clean
Phase of cycle test data needs to be used, unit, integration, system or UAT?

Assigning a version number to existing data
Identify test region(s) where data need to be loaded or refreshed
Restore "used" data to original "unused" state
Carrying out masking
Test data preparation

Cloning production databases
Generating synthetic data
Sub setting production data

Distribute unused data from other projects
Load data dump (masked or unmasked) to target region
Take back-up of data of new data (both databases and files) once the data is set up
Assign version number of the backup and catalog it with proper description
Refresh with data dumps (production slice or other regions)

Test Data Management Challenges

Data Requirements

How to synchronize and share test data among multiple applications and teams?
How to resolve contention of environments?
How to analyze existing data if are not profiled properly?
How to handle sudden and immediate requests for test data during test execution?
How to ensure proper data distribution so as to prevent redundant or unused data?
How to ensure data reuse?

Data Validity and Consistency

How can it be ensured that the data has not ‘aged’ and has not become obsolete?
How are you planning to refresh test data on a regular basis to avoid poor data quality and data integrity?
How to manage complex and heterogeneous system coupled with different file formats having multiple touch points?
What is your strategy for proper versioning of data?
How to enable traceability from end-2-end business process?
How to maintain traceability between test data to test cases to business requirements?

Data Privacy

How to mask sensitive personal information before migrating it to test environment(s)?
Are you aware about different government mandates and regulations in place that stipulate the data must be masked, de-identified or encrypted?
How to enable auditing of data?

Data Selection and Subsetting

How to plan a smaller subset of data in a scaled down, non production environment without risking coverage (of test data)?
How to plan subset of data in different format for different teams (DW, Performance, Functional, System etc.) without resulting in long test cycles?

Data Storage and Safety

Is your company ready for high storage, license and maintenance cost when copies of full production data are required in a test environment?
How many test environments require copies of full production data?
What is the policy for version control, access-security and backup mechanisms?

Data Refresh

How to manage impact of data refresh on ongoing projects?

Effort

DBA like skills required for team managing TDM
Is there any separate team for data engineering, data provisioning and data mocking etc.?
Managing and maintaining referential integrity & data quality while data generation
What is the time taken in copying huge volume of production data to different environments?
How to strategize test data identification, extraction and conditioning?
Coordination with multiple stakeholders

Test Data Management strategy
Quality data is a must for testing business functionality in the test environment. However, managing quality of data is often challenging due to complex relationships, limited infrastructure, sensitivity of data, and the lack of data conforming to business rules. A better test data management strategy not only ensures greater development and testing efficiencies, but helps organizations identify and correct defects early in the development process, when they are cheapest and easiest to fix. Any test data management strategy must efficiently supply a steady supply of relevant test data to support ever-tightening development cycles, while avoiding testing bottlenecks.

Gathering and Analyzing test data

Does the relevant production data exists, which can be used as test data?
Test cases not covered by production data must be covered by newly created test data

Data Generation

Have you outlined a set of criteria to automatically generate the quality of data required?
Are the data generated are re-usable or needed to generate every time?
Are the data generated from scratch or copied subset of data from production?

Data de-identification

Mask corporate, client, employee, etc. information
Supports compliance with government and industry regulations
Mask consistently complete business objects (e.g. Customer Order)
Who will have access to this data? All internal team members or vendors doing testing?
Do data need to be encrypted?

Data Planning

Capture E2E business process and the associated data for the testing
How to select a subset of data? How do you ensure if selected data are relevant?
Do we need 5x data for stress environment?
If cloning or the migration of production data on test environments is required, should we clone full or 60%? What should be the periodicity of migration / cloning?
What is the amount of changes in the production database and amount of application changes?

Subset production data from multiple data sources

Subsetting creates realistic test databases small enough to support rapid test runs, but large enough to reflect the variety of production data
Create test data to force error and boundary conditions

Data Reuse

Have you labeled test data to correlate them to specific test cases?
Are test data labeled for release / build / cycle?
Can we categorize test data according to different testing stages like functional, stress?

Data Maintenance

What should be schedule and frequency of refreshing the test data?
What is your plan for storing the data?
How often it is migrated to the test environment?

Data Refresh

Accommodate changing test requirements
Is this possible to automate data refresh?

Data Auditing

Can you trace the workflow from end to end?
Can you analyze the data from audition logs and is this fit for the purpose?

Cleaning up test environment post testing completion

How and when the cleaning up of test data needs to be done, post testing completion?
Are there any instances where altered test data cannot be cleaned up?

Automate test data result comparison

Automate identification of data anomalies and inconsistencies

Use of central repository with version control

What are some pros and cons of cloning production databases?
Pros: It is relatively simple to implement
Cons:

Expensive in terms of hardware, license and support cost
Time consuming – Increases the time required to run test cases due to large data volumes
Not agile: Developers, testers and QA staff can’t refresh the test data
Inefficient: Developers and testers can’t create targeted test data sets for specific test cases or validate data after test runs
Not collaborative between DBA and testing teams
Not scalable across multiple data sources or applications
Laborious: Production systems are typically large
Risky: Nonproduction environments might be compromised or misused (developers, testers and QA staff need realistic data to do their jobs—but they do not have a valid business reason to access sensitive data such as corporate secrets, revenue projections or customer information)

What are some challenges of using Production Data in Test Enviornment (Production Cloning)?

Data security is one of the most crucial challenges as production data can contain a lot of sensitive information like real customer details, vendor names etc. It can be overcome by data masking
Data volume that needs to be dealt with is pretty huge. Think about 100K customer doing 5 transactions per hour is equivalent of generating 500K transactions per hour, which is a 5000K transactional record’s addition in one day. Just imagine the scale of data that needs to be loaded into the test environment. It can be overcome by data sub setting
Data can come from various sources like flat files, different relational databases, excel, etc. and can be in various formats. Maintaining data relationships and data integrity is another challenge
Production cloning might force to have production like infrastructure, means higher costs
The Additional cost of storing production data (e.g. 50TB) in different test environments
Increased load time from production to test environment will leads to less time for real testing

What are some pros and cons of generating synthetic data?
Pros: Safe
Cons:

Resource-intensive: Requires a huge commitment from highly skilled DBAs with deep knowledge of the underlying database schema, as well as knowledge of implicit relationships that might not be formally detailed in the schema
Tedious: DBAs must intentionally include errors and set boundary conditions within the synthetic data set to ensure a robust testing process, which adds time to the test data creation process
Challenging: Despite the time and effort put forth by the DBA to generate synthetic test data, testers find it challenging to work with because synthetic test data doesn’t always reflect the integrity of the original data set or retain the proper context
Time-consuming: Process is slower and can be error-prone

What are some pros and cons of Subsetting production databases?
Pros: Less expensive compared to cloning or generating synthetic test data
Cons: Skill-intensive: Without an automated solution, requires highly skilled resources to ensure referential integrity and protect sensitive data

What are some challenges in Data Subsetting?

Maintaining referential integrity is the biggest challenges. Just imagine of fetching only 100 customer order records from 1 million customer orders’ records without losing any context
Maintaining data integrity of the subset of the data. Just imagine, if customers’ records are in Oracle database but the customer order records are in SQL server
Maintaining data relationships across multiple sources. For example, a vendor might provide a data feed in flat file format for all customers’ orders

Key features of TDM tool
TDM is about automate the provisioning of masked and synthetically generated data to meet the needs of test, development & QA team. TDM is needed for minimizing risk of data breach. TDM helps in using production data safely in test or development environment. TDM can be deployed on premises, in the cloud and via cloud hybrid configurations. Some of the tools in TDM space are Datamaker, Optim, HP TDM etc. Key features of TDM tool should be:

Automatic discovery of sensitive data (locations) across databases
Ability to create synthetic data where production data can’t be used or doesn’t exist
Should be able to get connected with distributed databases
Conformance and compliance team should be able to verify its functionality
Capability of masking data in place or while copying to test, support or outsource environment
Provision for smaller set of data requirements
Support for packaged applications

References:

Search This Blog

Tech Notes

Test Data Management (TDM)

Comments

Post a Comment

Popular posts from this blog

Performance Test Run Report Template

Understanding Blockchain

Bugs Management in Agile Project