Aggregate • Athinia

Data Aggregation

It sounds trivial, but in fact, aggregation of relevant data is a rather complex thing. Usually, such data is distributed over several different servers, storages and databases and is stored in incompatible file formats and different database structures. To make use of the data you need to have it all accessible in one single system like Athinia® and in standardized file and database formats.

Athinia uses Palantir HyperAuto (Software-Defined Data Integration (SDDI)), a suite of capabilities designed to provide end-to-end data integration capabilities out of the box on top of the most common and mission-critical systems at organizations. It empowers you to autonomously create valuable workflows with ERP, CRM, and other organization-critical data. You can sync, integrate, and structure data to immediately build new workflows on top of major data.

Systems supported are:

SAP

Salesforce

Oracle NetSuite

Hubspot

HyperAuto consists of three components designed to integrate data from raw sources to ontology with minimal effort:

Connectors enable transfer of large-scale data in a secure and optimized way from and to source systems.

Source exploration allows rapid data discovery in a guided manner, and a "shopping cart" experience for rapid bulk data sync creation and configuration.

Automatic pipeline generation transforms raw data into curated Foundry datasets and object types in the Ontology using automatically generated data pipelines.

Pipeline Generation

Automatic pipeline generation creates out-of-the-box data pipelines for integrating common source systems. The pipelines prepare the data so that it can be used by ontologies and workflows. Because pipeline generation ships with embedded knowledge about each source system, using this feature increases efficiency and removes the need to fully understand the intricacies of each underlying source system.

Generated pipelines include four major steps:

Preprocessor

(Source-Specific)

Cleaning Libraries

Core Generator

(Generic)

Derived Elements

(Cross-Source)

SDDI Automatic Pipeline Generator

Source-specific preprocessing generates a set of metadata datasets with a pre-defined schema. These metadata contain the necessary information to understand data from the source system. Cleaning libraries apply standardized data cleaning steps to all datasets, ensuring that best practices are followed for every piece of data flowing into the system.

Core generation performs data enrichment, column renaming, de-duplication, and data unioning to produce usable data that can be used for analysis, reporting, and workflows in the ontology. Derived elements provide pre-defined support for advanced workflows, including generating join tables, time series datasets, and enriched columns that provide rich derived information that also feeds into the ontology.

Ontology Creation

After pipelines are automatically generated, the generation of an ontology is supported automatically. This completes the data integration process, allowing you to immediately begin searching for, analyzing, and even building applications on top of data thanks to the broad set of ontology-aware applications in Foundry.

Figure 1: Example of a data pipeline between supplier and IDM
after being obfuscated and merged.

Data pipeline in Athinia

Data exists in various forms, sources, and ranges. The process manufacturing datasets and variables attributed to different processes may have significant differences in magnitudes and units of measurement. Hence it is essential that the data be transformed in certain use cases to enable the development of a reliable and accurate model. A simplified example of the data platform between Supplier and Integrated Device Maker (IDM) is illustrated in Figure 1.

Data Normalization

In the data pipeline shown in Figure 1, normalization is a critical step for two specific reasons: data owners' privacy and for model building purposes. We implement the best standard and peer-reviewed techniques to normalize and obfuscate data to provide data security in a collaborative environment while simultaneously eliminating data leakages. In addition, we combine feature engineering techniques into the normalization process to support machine learning model development. Therefore, it is important to consider and select the appropriate normalization methods and implement the optimal set to balance security and model performance.

Figure 2: Examples of standard normalization techniques.

Examples of data normalization techniques are shown in Figure 2. The main topics which will be described include quantization/discretization, scaling, transforming, feature creation, ranking and examples of academic research.