Estimating the post-click conversion rate (CVR) accurately in ranking systems is crucial for industrial applications. However, this task faces challenges of data sparsity and selection bias, which hinder accurate ranking. Previous approaches attempting to address these challenges often involve a trade-off: either training models on the entire exposure space without unbiased CVR estimation, or providing unbiased CVR estimation without modeling CVR across the entire exposure space. To overcome this trade-off, we propose the Entire-space Weighted Area Under the Curve (EWAUC) framework, which formulates the CVR estimation task as an AUC optimization problem. EWAUC leverages sample reweighting techniques to handle selection bias, and employs pairwise AUC risk to incorporate more information from limited clicked data than cross-entropy and handle data sparsity. In order to model CVR across the entire exposure space in an unbiased manner, EWAUC treats the exposure data as both conversion data and non-conversion data. The properties of AUC risk guarantee the unbiased nature of the entire-space modeling. We provide comprehensive theoretical analysis to validate the unbiased nature of our approach. Additionally, extensive experiments conducted on real-world datasets demonstrate that our approach outperforms state-of-the-art methods in terms of ranking performance for the CVR estimation task.
@inproceedings{liu2024recsys,author={Liu, Yu and Jia, Qinglin and Shi, Shuting and Wu, Chuhan and Du, Zhaocheng and Xie, Zheng and Tang, Ruiming and Zhang, Muyu and Li, Ming},booktitle={The 18th ACM Conference on Recommender Systems},title={Ranking-Aware Unbiased Post-Click Conversion Rate Estimation via AUC Optimization on Entire Exposure Space},year={2024},}
ICML
Ambiguity-Aware Abductive Learning
Hao-Yuan He, Hui Sun, Zheng Xie, and Ming Li
In The 41st International Conference on Machine Learning, 2024.
Abductive Learning (ABL) is a promising framework for integrating sub-symbolic perception and logical reasoning through abduction. In this case, the abduction process provides supervision for the perception model from the background knowledge. Nevertheless, this process naturally contains uncertainty, since the knowledge base may be satisfied by numerous potential candidates. This implies that the result of the abduction process, i.e., a set of candidates, is ambiguous; both correct and incorrect candidates are mixed in this set. The prior art of abductive learning selects the candidate that has the minimal inconsistency of the knowledge base. However, this method overlooks the ambiguity in the abduction process and is prone to error when it fails to identify the correct candidates. To address this, we propose Ambiguity-Aware Abductive Learning (A3BL), which evaluates all potential candidates and their probabilities, thus preventing the model from falling into sub-optimal solutions. Both experimental results and theoretical analyses prove that A3BL markedly enhances ABL by efficiently exploiting the ambiguous abduced supervision.
@inproceedings{he2024wsabl,author={He, Hao-Yuan and Sun, Hui and Xie, Zheng and Li, Ming},booktitle={The 41st International Conference on Machine Learning},title={Ambiguity-Aware Abductive Learning},year={2024},}
TPAMI
Weakly Supervised AUC Optimization: A Unified Partial AUC Approach
Zheng Xie, Yu Liu, Hao-Yuan He, Ming Li, and Zhi-Hua Zhou
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
We propose WSAUC, a unified and robust AUC optimization framework for weakly supervised AUC optimization. The framework covers multiple scenarios including noisy labeled AUC optimization, positive-unlabeled AUC optimization, multi-instance AUC optimization, and semi-supervised AUC optimization with or without noise. The framework achieves robust AUC optimization through a novel variety of AUC, i.e., rpAUC. Theorical and empirical results validate the effectiveness of the framework.
@article{xie2023wsauc,title={Weakly Supervised AUC Optimization: A Unified Partial AUC Approach},journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},author={Xie, Zheng and Liu, Yu and He, Hao-Yuan and Li, Ming and Zhou, Zhi-Hua},year={2024},volume={46},issue={7},pages={4780--4795}}
AAAI
AUC Optimization from Multiple Unlabeled Datasets
Zheng Xie, Yu Liu, and Ming Li
In The 38th AAAI Conference on Artificial Intelligence, 2024.
Weakly supervised learning aims to empower machine learning when the perfect supervision is unavailable, which has drawn great attention from researchers. Among various types of weak supervision, one of the most challenging cases is to learn from multiple unlabeled (U) datasets with only a little knowledge of the class priors, or U^m learning for short. In this paper, we study the problem of building an AUC (area under ROC curve) optimization model from multiple unlabeled datasets, which maximizes the pairwise ranking ability of the classifier. We propose U^m-AUC, an AUC optimization approach that converts the U^m data into a multi-label AUC optimization problem, and can be trained efficiently. We show that the proposed U^m-AUC is effective theoretically and empirically.
@inproceedings{xie2024umauc,author={Xie, Zheng and Liu, Yu and Li, Ming},booktitle={The 38th AAAI Conference on Artificial Intelligence},title={{AUC} Optimization from Multiple Unlabeled Datasets},year={2024},}
FCS
Top Pass: Improve Code Generation by Pass@k-Maximized Code Ranking
@article{topaz,author={Lyu, Zhi-Cun and Li, Xin-Ye and Xie, Zheng and Li, Ming},journal={Frontiers of Computer Science},title={{Top Pass}: Improve Code Generation by Pass@k-Maximized Code Ranking},year={2024},volume={in press},}
ARXIV
Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation
Code generation aims to automatically generate source code from high-level task specifications, which can significantly increase productivity of software engineering. Recently, approaches based on large language models (LLMs) have shown remarkable code generation abilities on simple tasks. However, generate code for more complex tasks, such as competition-level problems, remains challenging. In this paper, we introduce Brainstorm framework for code generation. It leverages a brainstorming step that generates and selects diverse thoughts on the problem to facilitate algorithmic reasoning, where the thoughts are possible blueprint of solving the problem. We demonstrate that Brainstorm significantly enhances the ability of LLMs to solve competition-level programming problems, resulting in a more than 50% increase in the pass@k metrics for ChatGPT on the CodeContests benchmark, achieving state-of-the-art performance. Furthermore, our experiments conducted on LeetCode contests show that our framework boosts the ability of ChatGPT to a level comparable to that of human programmers.
@inproceedings{brainstorm,author={Li, Xin-Ye and Xue, Jiang-Tian and Xie, Zheng and Li, Ming},booktitle={preprint},title={Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation},year={2024},}
2023
ICDM
Beyond Lexical Consistency: Preserving Semantic Consistency for Program Translation
Yali Du, Yi-Fan Ma, Zheng Xie, and Ming Li
In The 23rd IEEE International Conference on Data Mining, 2023.
Program translation aims to convert the input programs from one programming language to another. Automatic program translation is a prized target of software engineering research, which leverages the reusability of projects and improves the efficiency of development. Recently, thanks to the rapid development of deep learning model architectures and the availability of large-scale parallel corpus of programs, the performance of program translation has been greatly improved. However, the existing program translation models are still far from satisfactory, in terms of the quality of translated programs. In this paper, we argue that a major limitation of the current approaches is lack of consideration of semantic consistency. Beyond lexical consistency, semantic consistency is also critical for the task. To make the program translation model more semantically aware, we propose a general framework named Preserving Semantic Consistency for Program Translation (PSCPT), which considers semantic consistency with regularization in the training objective of program translation and can be easily applied to all encoder-decoder methods with various neural networks (e.g., LSTM, Transformer) as the backbone. We conduct extensive experiments in 7 general programming languages. Experimental results show that with CodeBERT as the backbone, our approach outperforms not only the state-of-the-art open-source models but also the commercial closed large language models (e.g., text-davinci-002, text-davinci-003) on the program translation task. Our replication package (including code, data, etc.) is publicly available at https://github.com/duyali2000/PSCPT .
@inproceedings{du2023beyond,author={Du, Yali and Ma, Yi-Fan and Xie, Zheng and Li, Ming},booktitle={The 23rd IEEE International Conference on Data Mining},title={Beyond Lexical Consistency: Preserving Semantic Consistency for Program Translation},year={2023},}
AAAI
Cooperative and Adversarial Learning: Co-Enhancing Discriminability and Transferability in Domain Adaptation
Hui Sun, Zheng Xie, Xin-Ye Li, and Ming Li
In The 37th AAAI Conference on Artificial Intelligence, 2023.
We propose the CALE framework to unify and enhance the two main objectives of domain adaptation: discriminability and transferability. To achieve this, CALE swaps the cooperative examples of the two objectives, enabling the learning of discriminability and transferability to mutually benefit each other. Additionally, adversarial examples are utilized to enhance the robustness of the two objectives themselves. The framework can be applied to improve current domain adaptation approaches and has been shown to outperform existing state-of-the-art methods.
@inproceedings{sun2023cale,author={Sun, Hui and Xie, Zheng and Li, Xin-Ye and Li, Ming},booktitle={The 37th AAAI Conference on Artificial Intelligence},title={Cooperative and Adversarial Learning: Co-Enhancing Discriminability and Transferability in Domain Adaptation},year={2023},}
AAAI
Semi-Supervised Learning with Support Isolation by Small-Paced Self-Training
Zheng Xie, Hui Sun, and Ming Li
In The 37th AAAI Conference on Artificial Intelligence, 2023.
In this paper, we address a special scenario of semi-supervised learning, where the label missing is caused by a preceding filtering mechanism, i.e., an instance can enter a subsequent process in which its label is revealed if and only if it passes the filtering mechanism. The rejected instances are prohibited to enter the subsequent labeling process due to economical or ethical reasons, making the support of the labeled and unlabeled distributions isolated from each other. In this case, classical semi-supervised learning approaches are prone to fail. We propose a SmallPaced Self-Training framework, which iteratively discovers labeled and unlabeled instance subspaces with bounded Wasserstein distance. We theoretically prove that such a framework may achieve provably low error on the pseudo labels during learning, and validate the approach through experiments.
@inproceedings{xie2023spst,author={Xie, Zheng and Sun, Hui and Li, Ming},booktitle={The 37th AAAI Conference on Artificial Intelligence},title={Semi-Supervised Learning with Support Isolation by Small-Paced Self-Training},year={2023},}
2018
IJCAI
Cutting the Software Building Efforts in Continuous Integration by Semi-Supervised Online AUC Optimization
Zheng Xie, and Ming Li
In The 27th International Joint Conference on Artificial Intelligence, 2018.
In this paper, we propose a semi-supervised online AUC optimization algorithm, namely SOLA. This algorithm is suitable for tasks that suffers from streaming data, label scarce, and imbalance. The algorithm is used for solving build outcome prediction in software continuous integration, and achieves superior performance.
@inproceedings{xie2018onlinesemiauc,author={Xie, Zheng and Li, Ming},booktitle={The 27th International Joint Conference on Artificial Intelligence},title={Cutting the Software Building Efforts in Continuous Integration by Semi-Supervised Online AUC Optimization},year={2018},}
AAAI
Semi-Supervised AUC Optimization without Guessing Labels of Unlabeled Data
Zheng Xie, and Ming Li
In The 32nd AAAI Conference on Artificial Intelligence, 2018.
We prove the theoretical property of AUC optimization under semi-supervised learning and positive-unlabeled learning scenarios, and propose a simple yet effective algorithm for semi-supervised and positive-unlabeled AUC optimization. Our algorithm outperforms elaborated approaches on semi-supervised and positive-unlabeled AUC optimization approaches.
@inproceedings{xie2018semiauc,author={Xie, Zheng and Li, Ming},booktitle={The 32nd AAAI Conference on Artificial Intelligence},title={Semi-Supervised AUC Optimization without Guessing Labels of Unlabeled Data},year={2018},}
2017
JOS
Cost-Sensitive Margin Distribution Optimization for Software Bug Localization
Software bug localization problem suffers from data imbalance and heterogeneous code and natural language structure. To tackle this problem, we propose cost-sensitive margin distribution optimization method to enhance the classification tasks under imbalanced scenario, and design a network architecture for processing programming and natural language. Experimental results validates the effectiveness of our method.
@article{xie2017costsensitive,author={Xie, Zheng and Li, Ming},journal={Journal of Software},number={11},title={Cost-Sensitive Margin Distribution Optimization for Software Bug Localization},volume={28},year={2017},}
CCML
Cost-Sensitive Margin Distribution Optimization for Software Bug Localization
Software bug localization problem suffers from data imbalance and heterogeneous code and natural language structure. To tackle this problem, we propose cost-sensitive margin distribution optimization method to enhance the classification tasks under imbalanced scenario, and design a network architecture for processing programming and natural language. Experimental results validates the effectiveness of our method.
@inproceedings{xie2017ccml,author={Xie, Zheng and Li, Ming},booktitle={China Conference on Machine Learning},title={Cost-Sensitive Margin Distribution Optimization for Software Bug Localization},year={2017},}
ICMC
Music Style Analysis among Haydn, Mozart and Beethoven: an Unsupervised Machine Learning Approach
Ru Wen, Zheng Xie, Kai Chen, Ruoxuan Guo, Kuan Xu, Wenmin Huang, Jiyuan Tian, and Jiang Wu
In The 43rd International Computer Music Conference, 2017.
We propose an unsupervised music analysis method. We propose a feature extraction method for extracting consecutive note pitch patterns, and use clustering methods for mining the music styles. We apply our method on new built corpus of Haydn, Mozart, and Beethoven. Our discovered pattern fits the Implication-Realization theory, which confirms the validity of our approach.
@inproceedings{xie2017music,author={Wen, Ru and Xie, Zheng and Chen, Kai and Guo, Ruoxuan and Xu, Kuan and Huang, Wenmin and Tian, Jiyuan and Wu, Jiang},booktitle={The 43rd International Computer Music Conference},title={Music Style Analysis among Haydn, Mozart and Beethoven: an Unsupervised Machine Learning Approach},year={2017},}