Read Part 1 which covers data complexity and intermediate validations.
If intermediate validations built into the conversion pipeline help to ensure the preservation of data integrity, then how does one put together a pipeline?
It is situational. The conversion pipeline is assembled based on requirements, which vary with every project. Though requirements are unique to each implementation, the approach to accommodate them is consistent. Moreover, many requirements are commonly encountered, thus standardizing whenever possible leads to a reduction of implementation time and likelihood of human error.
Consider the scenario wherein the database contains references to files on a network share. File paths are not explicitly stored, rather paths are calculated using specific column values. The requirement states that for each pointer calculated from the database, the actual file should be copied elsewhere in preparation of a subsequent pipeline flow.
The logical result might resemble:
An intermediate validation to check for an error is great. Would the file copy be the only point in the process that could generate an error? Is the failure to copy the file the only type of error that could occur? Could there be an error when calculating the file path? If so, does this flow allow easy distinction? Does a non-error necessarily mean that the file was deposited at the target location?
Looking at the pipeline flow does not communicate answers to questions like those, but such questions will occur. This pipeline does not adequately validate the processing step, nor does it appear to make the job of proving data integrity any easier.
Granular tracking of pipeline processes, especially at volume affords confident assurance of the job well done. Isolating data based on error condition promotes clarity about the status of conversion content and eliminates much of the effort during exception resolution. A more complete pipeline flow leveraging the concept error isolation could look like:
This pipeline flow breaks down the requirement into pieces. Each piece includes an intermediate validation that results in either the isolation of an error condition or the application of a subsequent operation. Deciphering an item with one type of problem from another is obvious. There can be confidence about items reaching the end for having passed the various validations along the way.
The form of each conversion pipeline is a response to business requirements and facts discovered among the data. Pipeline flows vary in number and complexity from one conversion to the next. Breaking down each requirement into a set of steps that can be easily tested and clearly interpreted is a key attribute of converting data with integrity.
Stay tuned for Part 3.