Machine Learning and Knowledge Extraction

Research into explainable artificial intelligence (XAI) methods has exploded over the past five years. It is essential to synthesize and categorize this research and, for this purpose, multiple systematic reviews on XAI mapped out the landscape of the existing methods. To understand how these methods have developed and been applied and what evidence has been accumulated through model training and analysis, we carried out a tertiary literature review that takes as input systematic literature reviews published between 1992 and 2023. We evaluated 40 systematic literature review papers and presented binary tabular overviews of researched XAI methods and their respective characteristics, such as the scope, scale, input data, explanation data, and machine learning models researched. We identified seven distinct characteristics and organized them into twelve specific categories, culminating in the creation of comprehensive research grids. Within these research grids, we systematically documented the presence or absence of research mentions for each pairing of characteristic and category. We identified 14 combinations that are open to research. Our findings reveal a significant gap, particularly in categories like the cross-section of feature graphs and numerical data, which appear to be notably absent or insufficiently addressed in the existing body of research and thus represent a future research road map. Full article

(This article belongs to the Special Issue Machine Learning in Data Science)

28 pages, 1736 KiB Open Access Article by Alexander Bott, Felix Schreyer, Alexander Puchta and Jürgen Fleischer Mach. Learn. Knowl. Extr. 2024, 6(3), 1969-1996; https://doi.org/10.3390/make6030097 - 27 Aug 2024

Standard ML relies on ample data, but limited availability poses challenges. Transfer learning offers a solution by leveraging pre-existing knowledge. Yet many methods require access to the model’s internal aspects, limiting applicability to white box models. To address this, Tsai, Chen and Ho introduced Black Box Adversarial Reprogramming for transfer learning with black box models. While tested primarily in image classification, this paper explores its potential in time series classification, particularly predictive maintenance. We develop an adversarial reprogramming concept tailored to black box time series classifiers. Our study focuses on predicting the Remaining Useful Life of rolling bearings. We construct a comprehensive ML pipeline, encompassing feature engineering and model fine-tuning, and compare results with traditional transfer learning. We investigate the impact of hyperparameters and training parameters on model performance, demonstrating the successful application of Black Box Adversarial Reprogramming to time series data. The method achieved a weighted F1-score of 0.77, although it exhibited significant stochastic fluctuations, with scores ranging from 0.3 to 0.77 due to randomness in gradient estimation. Full article

(This article belongs to the Section Learning)

16 pages, 956 KiB Open Access Review by Connor MacLean and Denis Cavallucci Mach. Learn. Knowl. Extr. 2024, 6(3), 1953-1968; https://doi.org/10.3390/make6030096 - 27 Aug 2024

Achieving carbon neutrality by 2050 requires unprecedented technological, economic, and sociological changes. With time as a scarce resource, it is crucial to base decisions on relevant facts and information to avoid misdirection. This study aims to help decision makers quickly find relevant information related to companies and organizations in the renewable energy sector. In this study, we propose fine-tuning five RNN and transformer models trained for French on a new category, “TECH”. This category is used to classify technological domains and new products. In addition, as the model is fine-tuned on news related to startups, we note an improvement in the detection of startup and company names in the “ORG” category. We further explore the capacities of the most effective model to accurately predict entities using a small amount of training data. We show the progression of the model from being trained on several hundred to several thousand annotations. This analysis allows us to demonstrate the potential of these models to extract insights without large corpora, allowing us to reduce the long process of annotating custom training data. This approach is used to automatically extract new company mentions as well as to extract technologies and technology domains that are currently being discussed in the news in order to better analyze industry trends. This approach further allows to group together mentions of specific energy domains with the companies that are actively developing new technologies in the field. Full article

17 pages, 2683 KiB Open Access Article by Sairoel Amertet and Girma Gebresenbet Mach. Learn. Knowl. Extr. 2024, 6(3), 1936-1952; https://doi.org/10.3390/make6030095 - 26 Aug 2024

In farming technologies, it is difficult to properly provide the accurate crop nutrients for respective crops. For this reason, farmers are experiencing enormous problems. Although various types of machine learning (deep learning and convolutional neural networks) have been used to identify crop diseases, as has crop classification-based image processing, they have failed to forecast accurate crop nutrients for various crops, as crop nutrients are numerical instead of visual. Neural networks represent an opportunity for the precision agriculture sector to more accurately forecast crop nutrition. Recent technological advancements in neural networks have begun to provide greater precision, with an array of opportunities in pattern recognition. Neural networks represent an opportunity to effectively solve numerical data problems. The aim of the current study is to estimate the right crop nutrients for the right crops based on the data collected using an artificial neural network. The crop data were collected from the MNIST dataset. To forecast the precise nutrients for the crops, ANN models were developed. The entire system was simulated in a MATLAB environment. The obtained results for forecasting accurate nutrients were 99.997%, 99.996%, and 99.997% for validation, training, and testing, respectively. Therefore, the proposed algorithm is suitable for forecasting accurate crop nutrients for the crops. Full article

(This article belongs to the Section Network)

15 pages, 1283 KiB Open Access Article by Darian Onchis, Codruta Istin and Ioan Samuila Mach. Learn. Knowl. Extr. 2024, 6(3), 1921-1935; https://doi.org/10.3390/make6030094 - 22 Aug 2024

In this paper, a method is introduced to control the dark knowledge values also known as soft targets, with the purpose of improving the training by knowledge distillation for multi-class classification tasks. Knowledge distillation effectively transfers knowledge from a larger model to a smaller model to achieve efficient, fast, and generalizable performance while retaining much of the original accuracy. The majority of deep neural models used for classification tasks append a SoftMax layer to generate output probabilities and it is usual to take the highest score and consider it the inference of the model, while the rest of the probability values are generally ignored. The focus is on those probabilities as carriers of dark knowledge and our aim is to quantify the relevance of dark knowledge, not heuristically as provided in the literature so far, but with an inductive proof on the SoftMax operational limits. These limits are further pushed by using an incremental decision tree with information gain split. The user can set a desired precision and an accuracy level to obtain a maximal temperature setting for a continual classification process. Moreover, by fitting both the hard targets and the soft targets, one obtains an optimal knowledge distillation effect that mitigates better catastrophic forgetting. The strengths of our method come from the possibility of controlling the amount of distillation transferred non-heuristically and the agnostic application of this model-independent study. Full article

28 pages, 7677 KiB Open Access Article

by Mohammed Elhenawy, Ahmad Abutahoun, Taqwa I. Alhadidi, Ahmed Jaber, Huthaifa I. Ashqar, Shadi Jaradat, Ahmed Abdelhay, Sebastien Glaser and Andry Rakotonirainy

Mach. Learn. Knowl. Extr. 2024, 6(3), 1894-1920; https://doi.org/10.3390/make6030093 - 13 Aug 2024

Multimodal Large Language Models (MLLMs) harness comprehensive knowledge spanning text, images, and audio to adeptly tackle complex problems. This study explores the ability of MLLMs in visually solving the Traveling Salesman Problem (TSP) and Multiple Traveling Salesman Problem (mTSP) using images that portray point distributions on a two-dimensional plane. We introduce a novel approach employing multiple specialized agents within the MLLM framework, each dedicated to optimizing solutions for these combinatorial challenges. We benchmarked our multi-agent model solutions against the Google OR tools, which served as the baseline for comparison. The results demonstrated that both multi-agent models—Multi-Agent 1, which includes the initializer, critic, and scorer agents, and Multi-Agent 2, which comprises only the initializer and critic agents—significantly improved the solution quality for TSP and mTSP problems. Multi-Agent 1 excelled in environments requiring detailed route refinement and evaluation, providing a robust framework for sophisticated optimizations. In contrast, Multi-Agent 2, focusing on iterative refinements by the initializer and critic, proved effective for rapid decision-making scenarios. These experiments yield promising outcomes, showcasing the robust visual reasoning capabilities of MLLMs in addressing diverse combinatorial problems. The findings underscore the potential of MLLMs as powerful tools in computational optimization, offering insights that could inspire further advancements in this promising field. Full article

23 pages, 4393 KiB Open Access Article by Masoumeh Hashemi, Richard C. Peralta and Matt Yost Mach. Learn. Knowl. Extr. 2024, 6(3), 1871-1893; https://doi.org/10.3390/make6030092 - 9 Aug 2024

An artificial intelligence-based geostatistical optimization algorithm was developed to upgrade a test Iranian aquifer’s existing groundwater monitoring network. For that aquifer, a preliminary study revealed that a Multi-Layer Perceptron Artificial Neural Network (MLP-ANN) more accurately determined temporally average water table elevations than geostatistical kriging, spline, and inverse distance weighting. Because kriging is usually used in that area for water table estimation, the developed algorithm used MLP-ANN to guide kriging, and Genetic Algorithm (GA) to determine locations for new monitoring well location(s). For possible annual fiscal budgets allowing 1–12 new wells, 12 sets of optimal new well locations are reported. Each set has the locations of new wells that would minimize the squared difference between the time-averaged heads developed by kriging versus MLP-ANN. Also, to simultaneously consider local expertise, the algorithm used fuzzy inference to quantify an expert’s satisfaction with the number of new wells. Then, the algorithm used symmetric bargaining (Nash, Kalai–Smorodinsky, and area monotonic) to present an upgradation strategy that balanced professional judgment and heuristic optimization. In essence, the algorithm demonstrates the systematic application of relatively new computational practices to a common situation worldwide. Full article

(This article belongs to the Special Issue Sustainable Applications for Machine Learning)

14 pages, 7188 KiB Open Access Article by Hakgeun Kim, Hyeongjin Kim and Kiweon Kang Mach. Learn. Knowl. Extr. 2024, 6(3), 1857-1870; https://doi.org/10.3390/make6030091 - 7 Aug 2024

Securing the structural safety of blades has become crucial, owing to the increasing size and weight of blades resulting from the recent development of large wind turbines. Composites are primarily used for blade manufacturing because of their high specific strength and specific stiffness. However, in composite blades, joints may experience fractures from the loads generated during wind turbine operation, leading to deformation caused by changes in structural stiffness. In this study, 7132 debonding damage data, classified by damage type, position, and size, were selected to predict debonding damage based on natural frequency. The change in the natural frequency caused by debonding damage was acquired through finite element (FE) modeling and modal analysis. Synchronization between the FE analysis model and manufactured blades was achieved through modal testing and data analysis. Finally, the relationship between debonding damage and the change in natural frequency was examined using artificial neural network techniques. Full article

(This article belongs to the Section Network)

17 pages, 786 KiB Open Access Article by Ashutosh Ghimire and Fathi Amsaad Mach. Learn. Knowl. Extr. 2024, 6(3), 1840-1856; https://doi.org/10.3390/make6030090 - 2 Aug 2024

Machine learning models play a critical role in applications such as image recognition, natural language processing, and medical diagnosis, where accuracy and efficiency are paramount. As datasets grow in complexity, so too do the computational demands of classification techniques. Previous research has achieved high accuracy but required significant computational time. This paper proposes a parallel architecture for Ensemble Machine Learning Models, harnessing multicore CPUs to expedite performance. The primary objective is to enhance machine learning efficiency without compromising accuracy through parallel computing. This study focuses on benchmark ensemble models including Random Forest, XGBoost, ADABoost, and K Nearest Neighbors. These models are applied to tasks such as wine quality classification and fraud detection in credit card transactions. The results demonstrate that, compared to single-core processing, machine learning tasks run 1.7 times and 3.8 times faster for small and large datasets on quad-core CPUs, respectively. Full article

(This article belongs to the Section Learning)

22 pages, 2817 KiB Open Access Article by Duaa Mohammad Alawad, Ataur Katebi and Md Tamjidul Hoque Mach. Learn. Knowl. Extr. 2024, 6(3), 1818-1839; https://doi.org/10.3390/make6030089 - 1 Aug 2024

Studying gene regulatory networks (GRNs) is paramount for unraveling the complexities of biological processes and their associated disorders, such as diabetes, cancer, and Alzheimer’s disease. Recent advancements in computational biology have aimed to enhance the inference of GRNs from gene expression data, a non-trivial task given the networks’ intricate nature. The challenge lies in accurately identifying the myriad interactions among transcription factors and target genes, which govern cellular functions. This research introduces a cutting-edge technique, EGRC (Effective GRN Inference applying Graph Convolution with Self-Attention Graph Pooling), which innovatively conceptualizes GRN reconstruction as a graph classification problem, where the task is to discern the links within subgraphs that encapsulate pairs of nodes. By leveraging Spearman’s correlation, we generate potential subgraphs that bring nonlinear associations between transcription factors and their targets to light. We use mutual information to enhance this, capturing a broader spectrum of gene interactions. Our methodology bifurcates these subgraphs into ‘Positive’ and ‘Negative’ categories. ‘Positive’ subgraphs are those where a transcription factor and its target gene are connected, including interactions among their neighbors. ‘Negative’ subgraphs, conversely, denote pairs without a direct connection. EGRC utilizes dual graph convolution network (GCN) models that exploit node attributes from gene expression profiles and graph embedding techniques to classify these. The performance of EGRC is substantiated by comprehensive evaluations using the DREAM5 datasets. Notably, EGRC attained an AUROC of 0.856 and an AUPR of 0.841 on the E. coli dataset. In contrast, the in silico dataset achieved an AUROC of 0.5058 and an AUPR of 0.958. Furthermore, on the S. cerevisiae dataset, EGRC recorded an AUROC of 0.823 and an AUPR of 0.822. These results underscore the robustness of EGRC in accurately inferring GRNs across various organisms. The advanced performance of EGRC represents a substantial advancement in the field, promising to deepen our comprehension of the intricate biological processes and their implications in both health and disease. Full article

20 pages, 671 KiB Open Access Article by Seyum Abebe, Irene Poli, Roger D. Jones and Debora Slanzi Mach. Learn. Knowl. Extr. 2024, 6(3), 1798-1817; https://doi.org/10.3390/make6030088 - 30 Jul 2024

In medicine, dynamic treatment regimes (DTRs) have emerged to guide personalized treatment decisions for patients, accounting for their unique characteristics. However, existing methods for determining optimal DTRs face limitations, often due to reliance on linear models unsuitable for complex disease analysis and a focus on outcome prediction over treatment effect estimation. To overcome these challenges, decision tree-based reinforcement learning approaches have been proposed. Our study aims to evaluate the performance and feasibility of such algorithms: tree-based reinforcement learning (T-RL), DTR-Causal Tree (DTR-CT), DTR-Causal Forest (DTR-CF), stochastic tree-based reinforcement learning (SL-RL), and Q-learning with Random Forest. Using real-world clinical data, we conducted experiments to compare algorithm performances. Evaluation metrics included the proportion of correctly assigned patients to recommended treatments and the empirical mean with standard deviation of expected counterfactual outcomes based on estimated optimal treatment strategies. This research not only highlights the potential of decision tree-based reinforcement learning for dynamic treatment regimes but also contributes to advancing personalized medicine by offering nuanced and effective treatment recommendations. Full article

(This article belongs to the Section Learning)
16 pages, 1999 KiB Open Access Article by Mengmeng Liu, Gopal Srivastava, J. Ramanujam and Michal Brylinski Mach. Learn. Knowl. Extr. 2024, 6(3), 1782-1797; https://doi.org/10.3390/make6030087 - 29 Jul 2024

SynerGNet is a novel approach to predicting drug synergy against cancer cell lines. In this study, we discuss in detail the construction process of SynerGNet, emphasizing its comprehensive design tailored to handle complex data patterns. Additionally, we investigate a counterintuitive phenomenon when integrating more augmented data into the training set results in an increase in testing loss alongside improved predictive accuracy. This sheds light on the nuanced dynamics of model learning. Further, we demonstrate the effectiveness of strong regularization techniques in mitigating overfitting, ensuring the robustness and generalization ability of SynerGNet. Finally, the continuous performance enhancements achieved through the integration of augmented data are highlighted. By gradually increasing the amount of augmented data in the training set, we observe substantial improvements in model performance. For instance, compared to models trained exclusively on the original data, the integration of the augmented data can lead to a 5.5% increase in the balanced accuracy and a 7.8% decrease in the false positive rate. Through rigorous benchmarks and analyses, our study contributes valuable insights into the development and optimization of predictive models in biomedical research. Full article

(This article belongs to the Special Issue Machine Learning in Data Science)

20 pages, 1138 KiB Open Access Article by Christina Markopoulou, George Papageorgiou and Christos Tjortjis Mach. Learn. Knowl. Extr. 2024, 6(3), 1762-1781; https://doi.org/10.3390/make6030086 - 28 Jul 2024

The field of sports analytics has grown rapidly, with a primary focus on performance forecasting, enhancing the understanding of player capabilities, and indirectly benefiting team strategies and player development. This work aims to forecast and comparatively evaluate players’ goal-scoring likelihood in four elite football leagues (Premier League, Bundesliga, La Liga, and Serie A) by mining advanced statistics from 2017 to 2023. Six types of machine learning (ML) models were developed and tested individually through experiments on the comprehensive datasets collected for these leagues. We also tested the upper 30th percentile of the best-performing players based on their performance in the last season, with varied features evaluated to enhance prediction accuracy in distinct scenarios. The results offer insights into the forecasting abilities of those leagues, identifying the best forecasting methodologies and the factors that most significantly contribute to the prediction of players’ goal-scoring. XGBoost consistently outperformed other models in most experiments, yielding the most accurate results and leading to a well-generalized model. Notably, when applied to Serie A, it achieved a mean absolute error (MAE) of 1.29. This study provides insights into ML-based performance prediction, advancing the field of player performance forecasting. Full article

42 pages, 16635 KiB Open Access Article by Mustafa Pamuk and Matthias Schumann Mach. Learn. Knowl. Extr. 2024, 6(3), 1720-1761; https://doi.org/10.3390/make6030085 - 27 Jul 2024

Financial institutions are increasingly turning to artificial intelligence (AI) to improve their decision-making processes and gain a competitive edge. Due to the iterative process of AI development, it is mandatory to have a structured process in place, from the design to the deployment of AI-based services in the finance industry. This process must include the required validation and coordination with regulatory authorities. An appropriate dashboard can help to shape and structure the process of model development, e.g., for credit assessment in the finance industry. In addition, the analysis of datasets must be included as an important part of the dashboard to understand the reasons for changes in model performance. Furthermore, a dashboard can undertake documentation tasks to make the process of model development traceable, explainable, and transparent, as required by regulatory authorities in the finance industry. This can offer a comprehensive solution for financial companies to optimize their models, improve regulatory compliance, and ultimately foster sustainable growth in an increasingly competitive market. In this study, we investigate the requirements and provide a prototypical dashboard to create, manage, compare, and validate AI models to be used in the credit assessment of private customers. Full article

(This article belongs to the Special Issue Sustainable Applications for Machine Learning)

21 pages, 3362 KiB Open Access Article by Lauren J. Wong, Braeden P. Muller, Sean McPherson and Alan J. Michaels Mach. Learn. Knowl. Extr. 2024, 6(3), 1699-1719; https://doi.org/10.3390/make6030084 - 25 Jul 2024

The use of transfer learning (TL) techniques has become common practice in fields such as computer vision (CV) and natural language processing (NLP). Leveraging prior knowledge gained from data with different distributions, TL offers higher performance and reduced training time, but has yet to be fully utilized in applications of machine learning (ML) and deep learning (DL) techniques and applications related to wireless communications, a field loosely termed radio frequency machine learning (RFML). This work examines whether existing transferability metrics, used in other modalities, might be useful in the context of RFML. Results show that the two existing metrics tested, Log Expected Empirical Prediction (LEEP) and Logarithm of Maximum Evidence (LogME), correlate well with post-transfer accuracy and can therefore be used to select source models for radio frequency (RF) domain adaptation and to predict post-transfer accuracy. Full article

(This article belongs to the Section Learning)

26 pages, 3308 KiB Open Access Article by Ray-I Chang, Chih-Yung Tsai and Yu-Wei Chang Mach. Learn. Knowl. Extr. 2024, 6(3), 1673-1698; https://doi.org/10.3390/make6030083 - 25 Jul 2024

Accurate forecasting of inbound visitor numbers is crucial for effective planning and resource allocation in the tourism industry. Preceding forecasting algorithms primarily focused on time series analysis, often overlooking influential factors such as economic conditions. Regression models, on the other hand, face challenges when dealing with high-dimensional data. Previous autoencoders for feature selection do not simultaneously incorporate feature and target information simultaneously, potentially limiting their effectiveness in improving predictive performance. This study presents a novel approach that combines a target-concatenated autoencoder (TCA) with ensemble learning to enhance the accuracy of tourism demand predictions. The TCA method integrates the prediction target into the training process, ensuring that the learned feature representations are optimized for specific forecasting tasks. Extensive experiments conducted on the Taiwan and Hawaii datasets demonstrate that the proposed TCA method significantly outperforms traditional feature selection techniques and other advanced algorithms in terms of the mean absolute percentage error (MAPE), mean absolute error (MAE), and coefficient of determination (R 2 ). The results show that TCA combined with XGBoost achieves MAPE values of 3.3947% and 4.0059% for the Taiwan and Hawaii datasets, respectively, indicating substantial improvements over existing methods. Additionally, the proposed approach yields better R 2 and MAE metrics than existing methods, further demonstrating its effectiveness. This study highlights the potential of TCA in providing reliable and accurate forecasts, thereby supporting strategic planning, infrastructure development, and sustainable growth in the tourism sector. Future research is advised to explore real-time data integration, expanded feature sets, and hybrid modeling approaches to further enhance the capabilities of the proposed framework. Full article