Already read this? Check out Part 2 to discovery how to isolate errors in your conversion pipeline.
Is Data Certainty Possible?
How can there be certainty about data integrity during the conversion?
It is difficult, if not impossible, to begin assembling an accurate representation of data integrity when the conversion is already completed. Consequently, the ideal approach regards the conversion to be an exercise of maintaining data integrity.
With each step of the data conversion process, think about how to validate. If there is an operation that produces a non-deterministic result or if the validation requirements are overly complex, then break up the operation into smaller, more manageable parts. The conversion should implement intermediate validations.
Conversions are Complex
A typical conversion project involves more content than a group can inspect within a reasonable timeframe. A person could not hope to review the content in a lifetime. Moreover, the transient nature of data at volume obstructs granular comparisons between systems. Content migrations are challenging. Additional factors make it difficult to start evaluating for data integrity at the end of conversion processing, such as:
- Data transformations
- Business rules
- Heterogenous systems
- Data exclusions
- Complex data relationships
- Parent-child data
- Changing requirements
Remedy? Intermediate Validations
Conversions are a form of pipeline processing where data flows through a series of interconnected steps. Each step receives as input the output from a previous step, executes a set of instructions, and passes output to a subsequent step. Inline tests are used to verify the results of each instruction set; ensuring the data matches expectations before it is passed to a subsequent step. Data that fails a validation can be set aside, or the conversion pipeline can be configured to accommodate expected error conditions ( I’ll cover this as a separate topic ).
Tests in the pipeline can be applied to data either at the input or output of a pipeline step. Tests can also verify business logic.
Business logic tests validate assumptions about the data. For example:
- Isolate target data created in the last N years.
- Check that the item matches a defined mapping rule.
Input tests check data prior to each step in the conversion pipeline. For example:
- Verify the item meets threshold requirements for the step.
- Identify the data value is formatted appropriately before applying the transformation.
- Check for reprocess and handle subsequent processing correctly.
- Inspect date elements for unexpected values.
- All required elements contain a value and conform to formatting rules.
Output tests check the results of an operation. For example:
- The number of affected items match the expected amount based on input and reference data.
- The item references the correct number of children.
- Verify that default values exist only if necessary.
- Identify null or empty elements where values are expected.
Confidence with Conversion Pipeline
The conversion pipeline is a complex process with steps often too numerous to be monitored manually. Including intermediate validations helps to ensure data integrity by managing the data at each step. Used in conjunction with other methods, the conversion pipeline creates confidence in the conversion product.
Keep a lookout for the next post in Converting Data with Integrity as we discuss how isolation is crucial for transformation.