Data pipeline in Athinia
Data exists in various forms, sources, and ranges. The process manufacturing datasets and variables attributed to different processes may have significant differences in magnitudes and units of measurement. Hence it is essential that the data be transformed in certain use cases to enable the development of a reliable and accurate model. A simplified example of the data platform between Supplier and Integrated Device Maker (IDM) is illustrated in Figure 1.
Figure 1: Example of an Athinia data pipeline between supplier and IDM after being obfuscated and merged.
Data Normalization
In the data pipeline shown in Figure 1, normalization is a critical step for two specific reasons: data owners' privacy and for model building purposes. We implement the best standard and peer-reviewed techniques to normalize and obfuscate data to provide data security in a collaborative environment while simultaneously eliminating data leakages. In addition, we combine feature engineering techniques into the normalization process to support machine learning model development. Therefore, it is important to consider and select the appropriate normalization methods and implement the optimal set to balance security and model performance. Solutions to this challenge are discussed in the proceeding chapters of this white paper.
Data normalization techniques
Examples of data normalization techniques are shown in Figure 2. The main topics which will be described include: quantization/discretization, scaling, transforming, feature creation, ranking and examples of academic research.
Figure 2: Examples of standard normalization techniques.