By: Sanjay Debnath, Sr. Architect, IMSS Innovation Office at Happiest Minds Technologies
We are living in the age of data science driven IT decision making by performing the automated root cause analysis and failure prediction. Cluster analysis is a very widely used method to partition large data sets and to discern useful information, in terms of patterns which subsequently find use in many areas of analysis and decision making. Many domains and industries utilize different clustering methods for data analysis and different techniques are available to perform this clustering. However, conventional clustering methods, algorithms and measures available today are more focused on clustering of data based on one type of predominant data attribute, which typically is either numerical or categorical. There are certain limitations for these methods, especially when it comes to the case of real life scenarios which includes a vast number of data sets which are essentially a mix of numerical and categorical data. The mix of these numerical and categorical data attributes challenges the clustering process, separation and deriving meaningful insights from it.
Distance or similarity measurement is a major aspect of the prominent data clustering methods. In a mixed data set, while the numeric data attributes displays continuous behavior, the category attributes display discreet behavior. To handle this challenge, a distance measurement is used for the numerical data sets whereas similarity measurement is done for the categorical data sets. Even though this is a viable approach, it can distort the date element evaluation process. There is another approach for addressing mixed data set, by transforming the entire dataset into a completely numeric data set with some standard/custom transformation method(s) and use the final transformed dataset as the target dataset. This also has the potential to distort the data characteristics to a larger extent. Here we are discussing an approach towards clustering of the mixed dataset based on unsupervised learning of artificial neural networks without the express need to transform datasets.
How ART/ART-2 Neural Network Model addresses the challenges with clustering models?
ART (Adaptive Resonance Theory)/ART-2 neural network models work on the principal that identification of objects/vectors into classes occur from two sources of information viz. top down retained knowledge (long term memory)and expectation and bottom up inputs and information (short term memory). The comparison of the long term and short term memory brings forward a classification or categorization of data vectors. As long as this comparison does not exceed a defined threshold (termed as Vigilance Parameter) the input is considered a member of a defined class in long term memory. However, if the comparison exceeds the Vigilance Parameter, the object/vector is considered to be of a class that has not been encountered previously and the vector properties are learned into the long term memory. By having the concept of a long and short term memory, ART family of networks can retain knowledge over time and increase the knowledge base. ART family networks offer a framework to retain old knowledge while gaining new knowledge i.e. it addresses the plasticity/stability problem faced in learning systems. The primary difference between the ART-1 model and ART-2 model is that ART-1 model could handle only binary inputs, whereas ART-2 has been extended out of ART-1 to support continuous inputs.
Auto-encoder Neural Network Model
Auto-encoders are a family of neural network models/architecture which is focused towards the transformation of a specific representation of a dataset (e.g. raw high dimensional data) into another representation (e.g. low dimensional representation). The primary motivation behind Auto-encoders is to produce a low dimensional representation of a high dimensional data space. There are many methods available for reduction of dimensionality of data and one popular method is Principal Component Analysis (PCA). However, PCA is more suited for numerical datasets than categorical datasets or mixed dataset. Another recent and emerging method towards determining significant variation providing vector component is Factor Analysis of Mixed data (FAMD), however, FDMA is in the active research phase of development. As indicated in Data Clustering Process with ART-2 Network section above, as part of the larger data clustering process, the dimensionality of the dataset was reduced with an Autoencoder after clustering and the results were plotted for a more intuitive identification and validation of the clusters.
The ability of artificial neural networks with unsupervised learning is already recognized as a means to cluster data in the evolving filed of data science and machine learning. The ART-2 neural network model provides an adequate means for data clustering alongside the other popular methods and Autoencoders, a viable means for dimensionality reduction of higher dimensional datasets.