To avoid this issue, Target decided to fill the Canada SAP systems with entirely new data rather than converting data from their US databases. Unfortunately for Target, this strategy did not pay off and led to many of the well-publicized failings that shuttered its Canadian stores for good.
Why was Target Canada so afraid of data conversion?
Done correctly, conversion allows for a smoother cutover between old and new systems – no telling customers they must remake their online order accounts, for example – and saves countless labor hours that would otherwise be spent populating data by hand. For many companies, especially in retail, conversion is the only sensible choice. But despite best laid plans, many implementations face uncertainty in the data conversion process. Late changes to functional requirements, large data volumes, and complicated data dependencies all present opportunities for data headaches leading up to and beyond go-live. And with the critical role data plays in modern business processes, a small conversion flaw can have a cascading effect that will bring an entire system to its knees. This is why the best ERP implementations take a strong stance on ensuring the completeness and accuracy of data conversion. However, not all such data validation strategies are created equal.
When “Good” isn’t Good Enough
Currently, many ERP implementations perform data validation with statistical sampling, where a subset of the total data volume is randomly selected and checked for accuracy. The primary advantage of this method is assumed to be time – it is faster to compare 5,000 records than 500,000. For companies with exacting data quality requirements, this method falls flat. Questions remain on how to group data in order to sample it and on how many samples need to be taken in order to detect problems. Furthermore, if statistical sampling indicates that there are issues with data conversion, a team of analysts still has to search through the entire database to find flawed records and fix them. While this seems draconian, it is often preferable to running the entire conversion process over again.
Checking the sample itself can also be a lengthy process. Typically, the sample data from the legacy system and the target system are loaded into Excel and then compared using VLOOKUPs. This ties up project resources, is prone to errors, and must be repeated as many times as there are conversions (both test runs and “for real”). The size of data that Excel can validate is also limited – using VLOOKUPs to validate data much larger than a reasonable sample is impossible due to limitations on the amount of memory Excel can use and hard limits on the number of rows a worksheet can contain. This is all not to mention the challenging logistics of organizing the hundreds of necessary spreadsheets.
Some companies choose to go with commercial data conversion solutions that include data validation as part of their functionality. Products like WinShuttle and BusinessObjects Data Services certainly make the data conversion process much smoother. However, these programs can be costly – both licensing and specialized resources are needed to support their implementation and use. Such an option may prove to be well worth the investment, especially for projects with complex, high-volume data conversion requirements. But for many projects needing a data validation solution, these products are overkill.
A Better Way
On a recent project, Excel validation had been the strategy of choice. Employees resented having to manipulate Excel workbooks for days at a time in order to get results of questionable value. The client needed something fast and accurate, but without the complexity and expense of a proprietary solution.
I reasoned that in this era of in-memory computing, statistical sampling was no longer needed – if I took a programmatic approach and leveraged big data processing techniques, I could check every record quickly and provide far more granular results than available previously. This strategy would eliminate any uncertainty about the data quality and would provide the conversion team with detailed diagnostic information. Altogether, programmatic data validation would eliminate hundreds of hours spent doing Excel validation, reduce the time the conversion team spent hunting down errors, and provide absolute certainty about what data the client would be going live .
Funny Name, Powerful Tool
To accomplish this task, I selected the Python language and Wes McKinney’s excellent Pandas library. Both are free, open-source software. The combination of the two allows for simple, easy-to-understand code that leverages high-performance, in-memory computing strategies to process large volumes of data.
The first step in the script was to import the data. This project used .csv flat files to move data between legacy and SAP systems. A single line of code later I managed to get the client’s entire legacy article master – over half a million records with close to 100 fields each – into memory. The same was repeated for the client’s article master in SAP. Total runtime: less than a minute.
Comparing the two datasets at this point would have been an exercise in futility – the legacy data format was different from the data format used by SAP. In order to compare them, I had to transform the data. Thankfully, in-memory data transformations are extremely fast. Pandas makes them even easier by removing a lot of the dirty work involved – simply point it to what you want changed and it does the rest.
For example: The client’s legacy systems had dates in the format MM/DD/YYYY whereas SAP stored dates in the format YYYYMMDD. I told Pandas to split the legacy date string by the “/” character, giving me three groups of numbers. I then told it to reassemble these groups of numbers into a new string in the required order. One line of code applied this entire change to hundreds of thousands of records simultaneously. Total runtime: two minutes.
I coded the dataset comparison with an eye to the results that would be of the most use to the business and to the conversion team. I decided on a three-tiered matching system based on the primary key of each record and its non-primary fields