• Training Neural Networks with Limited and Unlabeled Dataset

    March, 2 2023, Noor E Karishma Shaik

    Article by Zhuowei Wang, IEEE Young Professional

    Artificial intelligence refers to computer programs, or algorithms, that use data to make decisions or predictions. To build an algorithm, scientists might create a set of rules, or instructions, for the computer to follow so it can analyze data and make a decision. With other artificial intelligence approaches, like machine learning, the algorithm teaches itself how to analyze and interpret data. As such, machine learning algorithms may pick up on patterns that are not readily discernable to the human eye or brain. And as these algorithms are exposed to more new data, their ability to learn and interpret the data improves.

    In the traditional binary classification problem, the classifier is trained on a dataset with both positive and negative instances. In practice, negative samples may sometimes be difficult to obtain or label for two reasons: (i) negative samples are too difficult to acquire, and (ii) negative samples are too diverse to classify. For example, while we can identify clients who had watched at least one Harry Porter movie as being interested in this category, we are not certain about the interests of a client who had never watched a Harry Parter movie before. Unlike traditional supervised learning, where a classifier is learned by taking full advantage of both positive and negative samples, Positive and Unlabeled Learning (PUL) targets training a binary classifier with only positive and unlabeled data. PU Learning liberates humans from exhaustively collecting and annotating negative samples, which could be potentially and widely adopted in disease diagnosis, deceptive review detection, and web data mining.

    Researchers at the Australian Artificial Intelligence Institute (AAII), the University of Technology Sydney conducted a preliminary experiment on CIFAR-10. They plotted the histogram of the positive probability of each unlabeled sample predicted by the previous state-of-the-art models and observed an interesting phenomenon. At the late stage of training, the distribution of the positive and negative becomes less polarized, and more negative samples are wrongly regarded as positive ones.

    Dr Zhuowei Wang, the author of the paper entitled Positive Unlabeled Learning by Semi-supervised Learning published in IEEE ICIP, 2022, discussed the related problems. He proposed to identify confident positive and negative samples in the unlabeled set first and then handle these confident samples and the rest unlabeled data by semi-supervised learning (SSL). To be more specific, this method first leverages the current network to estimate the positive probability of each sample in the original unlabeled set, based on which it increasingly identifies the most confident positive and negative samples dynamically. Then, those chosen confident samples form the labelled set, while the remaining samples are considered as the unlabeled set. In this way, the problem is transferred to an SSL setting so that the SSL backbone can be used to leverage the unlabeled samples. This method can be applied to various aspects of our life like Covid19 diagnosis, medical analysis, healthcare, and loT.

    Article Contribution: Dr Zhuowei Wang is currently working as Post doctoral Researcher at CSIRO – Australia. He graduated with a PhD from the University of Technology Sydney in Computer Science.