In this article, Wang explores the challenge of training neural networks with limited and unlabeled datasets. He discusses different approaches to addressing this challenge, including transfer learning and semi-supervised learning. Wang also provides practical tips for implementing these approaches in real-world scenarios, emphasizing the importance of balancing model accuracy with computational efficiency.
Article contributed by Dr Zhuowei Wang, IEEE Young Professional.
Artificial intelligence refers to computer programs, or algorithms, that use data to make decisions or predictions. To build an algorithm, scientists might create a set of rules, or instructions, for the computer to follow so it can analyze data and make a decision. With other artificial intelligence approaches, like machine learning, the algorithm teaches itself how to analyze and interpret data. As such, machine learning algorithms may pick up on patterns that are not readily discernible to the human eye or brain. And as these algorithms are exposed to more new data, their ability to learn and interpret the data improves.
In the traditional binary classification problem, the classifier is trained on a dataset with both positive and negative instances. In practice, negative samples may sometimes be difficult to obtain or label for two reasons: (i) negative samples are too difficult to acquire, and (ii) negative samples are too diverse to classify. For example, while we can identify clients who have watched at least one Harry Potter movie as being interested in this category, we are not certain about the interests of a client who has never watched a Harry Potter movie before. Unlike traditional supervised learning, where a classifier is learned by taking full advantage of both positive and negative samples, Positive and Unlabeled Learning (PUL) targets training a binary classifier with only positive and unlabeled data. PU Learning liberates humans from exhaustively collecting and annotating negative samples, which could be potentially and widely adopted in disease diagnosis, deceptive review detection, and web data mining.
At the Australian Artificial Intelligence Institute (AAII) at the University of Technology Sydney, we conducted a preliminary experiment on CIFAR-10. The histogram of the positive probability of each unlabeled sample predicted by previous state-of-the-art models was plotted, and an interesting phenomenon was observed. At the late stage of training, the distribution of positive and negative becomes less polarized, and more negative samples are wrongly regarded as positive ones. This suggests that there may be opportunities to improve the accuracy of models by adjusting the training process to account for this phenomenon.
In the paper entitled Positive Unlabeled Learning by Semi-supervised Learning published in IEEE ICIP, 2022, we discuss related problems. We propose to identify confident positive and negative samples in the unlabeled set first and then handle these confident samples and the rest of the unlabeled data by semi-supervised learning (SSL). By using SSL, we can improve the accuracy of models and create more effective machine learning models that can be applied to a variety of real-world scenarios such as Covid19 diagnosis, medical analysis, healthcare, and IoT.
Article Contribution: Dr Zhuowei Wang is currently working as Post doctoral Researcher at CSIRO – Australia. He graduated with a PhD from the University of Technology Sydney in Computer Science, with research interests in computer vision, machine learning, and deep learning. He has published numerous papers in top-tier conferences and journals and has received several awards for his work. Currently, he is developing new machine learning algorithms for medical image analysis at CSIRO in Australia. He aims to make machine learning more accessible and effective for researchers and practitioners in a variety of fields.