Data Staging

Staging high-quality data starts with authenticating and verifying what data is correct and replacing erroneous attributes with what is right and “clean".

The process of cleaning data is about discovering, authenticating and verifying correct, clean and complete data to replace incorrect, dirty, and incomplete data. This merging process is a separate topic not addressed in this document.

Paradata excels at proving what is real so it can then merge clean, complete and correct data with the master data record for that entity type and increase the overall dataset quality.

Why you should care

Customers come to Paradata with the most unimaginably awful piles of data you could never even dream possible, and we love it because it really shows off the value of our Trusted Discovery Technology to solve hard problems.

In plain english, what this technology is about is producing high-quality data and getting rid of your mess, once and for all. Now let’s make it real to understand, one of our enterprise customers recently came to us with a challenge of mapping 35,000 suppliers from a dataset of over 6,000,000 suppliers. We tend to think of that as 6,000,000 right-hand black gloves having to be matched to 35,000 unique left-hand black gloves. This would be so frustrating if you ran a glove company and had to fill an order for complete glove pairs, right?

Our customer’s supplier data was continuously corrupted from all of their operating units having data entry problems, partial information, incorrect information and a host of other issues across all SAP instances. One of the biggest problems from this huge mess was suppliers that had been removed from their approved vendor list were able to change their names and addresses and make their way back onto the approved vendor list. Obviously, not good.

We fixed that.

The science behind it

Paradata knows what is real and true about any master entity. Because of this knowledge we can merge clean, complete and correct data to existing record sets and increase the dataset’s overall quality. After the merge completion process, Trusted Discovery Technology automatically continues to increase the AQs for each attribute and the overall SAQ until it asymptotically reaches perfection.

Paradata authenticates and verifies each attribute of the master data profile for an entity. In the next illustration, the reader sees data sources are polled about the entity, the raw data is then munged into a decisioneering process where attributes are extracted and persona engineering commences using a variety of statistical and stochastic models together with other proprietary heuristics. Inside the decisioneering, master data contributors are classified, modeled and simulated into an ideal form of the master entity type profile. Next, Paradata correlates all contributors discovered and verified . Results are combined with intelligently derived causality results to harden the master entity type profile through a blind-annealing methodology until the result is “true and correct.” Finally, a verdict of the quality result is achieved; fully-passed or fully-failed with a performance guarantee as expressed in the form of a Scaled Authenticity Quotient.

Now, let’s keep on keeping it real. For example, imagine every first name instance of an entity type Person. A specific person has many instances of their name because we have no control over how people may choose to save us in their separate mobile phone address books. Blue, Slinker, Scott, Scott My Love, Dad, Hey Blue...but which name is correct? Well, it depends on who is asking. These all correct first names depending on how you are known to one another and the intent of that association. Neighbors might know me as Scott. Professional Baseball Umpires may know me as Hey Blue...and so on.

Below is a diagram showing Scott Slinker (in the middle) has a self-organized mapping of 5 persons, each who have been separately authenticated and verified and associated using Trusted Discovery Technology. The SAQ of each juror is further modified by common field data they may share with the target, communications log details and location-based service proximity data. Some persons in the jury are logically “closer” to Scott than others hence the string distances as illustrated by the red lines. Scott’s SAQ is computed by factoring the SAQ of every linked profile by the link.

SIDE BAR: The authenticity quotient for an attribute (AQ) is mathematically the product of the frequency of an attribute discovery multiplied by the reliability of the contributing source as illustrated below.

A.Q. = F x RCS

The Authenticity Quotient of each attribute is computed and the overall entity record is the scaled sum of all attributes for an entity computed as a Scaled Authenticity Quotient. It is scaled because not all attributes are weighted equally. For example, a prefix Ms. or Mrs. may not be as heavily weighted in importance as a mobile number or e-mail address because on its own prefix does not offer a great deal of specificity.

Let’s look at Scott’s dataset quality from another POV, a much larger self-organizing map of 65 persons, each contributor plotted with our new social graphing function revealing the available authenticity quotient and the applied authenticity quotient for each “juror.”

This self-organizing map is computed for people in Scott’s community, but the same techniques are applied for other entity types such as things, locations, and events (something with a temporal characteristics).

What the reader is seeing is a new type of social graph of people that know Scott and have a measurable active and intimate relationship. The graph is force-ranked and sorted. At the 12:01 clock position is the person that knows Scott best. At the 11:59 clock position is the person that knows Scott least (relative to the other persons in the dataset.) All of these people are separate and independent contributors about data related to Scott, with the validity of their contributed data correlated against other independently derived data.

Mathematically, the Scaled Authenticity Quotient for all contributor attributes for Scott’s cluster of friends is > 98.1804, which is exceedingly high. The number attests to the quality this specific grouping of people have in how Scott is actually known, his real first and last name for example versus any nicknames or other variants. Because they are reliable and they attest to the frequency of how Scott is really known, they are credible witnesses and adjudication principals to verify who Scott is really. To fake new data about Scott, a fraudster would have to be able to replicate all the intimacy (e.g knowledge quantity and quality) and mediums (e.g. calls, texts, emails) and intimacy (e.g directionality and duration) for all jurors in Scott new social graph. Before Paradata, a fraudster could spoof Scott identity quite easily, but now they would have to spoof Scott’s entire social community and know how it changes everyday. The probability is astronomically low.

The point is the data quality for Scott is very high due to the jury and the knowledge and intimacy computed about each of them that make “their” ability to adjudicate a quality verdict on Scott data extremely reliable and scalable.