Structuring Big Data

Customers come to Paradata because they need to store, process and analyze incredible amounts of both structured and unstructured data and exploit that data to glean insights and information about their business.

Our customers are usually facing two options: 1) building their own in-house capabilities and attracting highly-talented personnel to invent everything or 2) leveraging a vendor’s intellectual property, business model, infrastructure and exploit scalable customer proprietary and public datasets.

Complicating the customer’s decision even more is the need to implement a complex enterprise-class application that leverages all this trustworthy data. As Dr. Bruce Lindsay, Paradata’s Chief Data Scientist, is fond of stating, “It's not what data you have, it's what you do with it that extracts its value.”

The science behind it

Paradata provides not only the platform for structuring large datasets, its Trusted Discovery Technology draws correlations and detects causality across large distributed data sets. The illustration below shows how the platform is architected.

At the far left, the harvesters use dataset configuration inputs to discover and return raw data that is then housed in the T1 harvest silos. The next illustration shows what the Discovery Engine Optimizer inputs and outputs are. Once configured, the harvesters are “trained” and ready for Discovery. They poll and discover raw data from Organic (public) and Custom (private) sources, then format it into attribute groups that are stored in harvest silos.

Next, the cognitive authenticity engines (hereafter referred to as Trusted Discovery Technologies) complete, correct, clean, authenticate and verify each attribute discovered about the harvested entities. Harvested entity attributes are integrated into a structured dataset ready to be accessed and utilized by real-time applications. From here, the Update Sync Engines are accessed by the users through the Paradata Portal to look inside the T3 database(s) for the purpose of reviewing, accepting (or rejecting) and importing all new curated attributes directly into their applications. To the far right, T4, is a protected database that houses private, customer data and is logically partitioned from other customers.

Discovery Engine Optimizer

Below is a screenshot of the Discovery Engineer Optimizer. Its purpose is to “train” harvesters how to parse public websites and private databases. Today, access to the DEO is limited to Paradata Analysts. In the future, users will include Subscribers as the interface is made more intuitive. 

​DEO Inputs

The Discovery Engineer Optimizer captures the critical configuration inputs from Users and Paradata Analysts working together. These inputs “instruct” and “train” our harvesters where to discover attributes and how the entity records are to be structured. We call this stage “ingest” because it is discovering and loading new raw data into the platform for advanced machine learning.