Recent Publications
Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data
Antonis Antoniades, Xinyi Wang, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang
ICLR, 2025
PaperTowards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn
PreprintSurveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
Alex Havrilla, Andrew Dai, Laura O’Mahony, Koen Oostermeijer, Vera Zisler, Alon Albalak, … 14 more authors
PreprintGenerative Reward Models
Dakota Mahan*, Duy Van Phung*, Rafael Rafailov*, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, Alon Albalak*
PreprintUnderstanding and Improving Language Models Through a Data-Centric Lens
Alon Albalak
Dissertation, 2024
PaperA Survey on Data Selection for Language Models
Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang
TMLR, 2024
Paper GithubDataComp-LM: In search of the next generation of training sets for language models
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, … 14 authors, … Alon Albalak, … 40 more authors
NeurIPS 2024, Datasets and Benchmarks Track
PreprintThe Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Shayne Longpre, Stella Biderman, Alon Albalak, … 20 more authors
TMLR, 2024
Paper Cheatsheet WebsiteA Mathematical Framework, a Taxonomy of Modeling Paradigms, and a Suite of Learning Techniques for Neural-Symbolic Systems
Charles Dickens, Connor Pryor, Changyu Gao, Alon Albalak, Eriq Augustine, William Wang, Stephen Wright, Lise Getoor
PreprintEagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Bo Peng*, Daniel Goldstein*, Quentin Anthony*, Alon Albalak, … 23 more authors
COLM 2024
PaperImproving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data
Alon Albalak, Colin Raffel, William Yang Wang
NeurIPS 2023
Paper Code PresentationEfficient Online Data Mixing For Language Model Pre-Training
Alon Albalak, Liangming Pan, Colin Raffel, William Yang Wang
NeurIPS 2023, Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models
PaperRWKV: Reinventing RNNs for the Transformer Era
Bo Peng*, Eric Alcaide*, Quentin Anthony*, Alon Albalak, … 26 more authors
EMNLP 2023
Paper CodeLogic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning
Liangming Pan, Alon Albalak, Xinyi Wang, William Yang Wang
EMNLP 2023
Paper CodeCausalDialogue: Modeling Utterance-level Causality in Conversations
Yi-Lin Tuan, Alon Albalak, Wenda Xu, Michael Saxon, Connor Pryor, Lise Getoor, William Yang Wang
ACL 2023
PaperNeuPSL: Neural Probabilistic Soft Logic
Connor Pryor, Charles Dickens, Eriq Augustine, Alon Albalak, William Yang Wang, Lise Getoor
IJCAI 2023
PaperAddressing Issues of Cross-Linguality in Open-Retrieval Question Answering Systems For Emergent Domains
Alon Albalak, Sharon Levy, William Yang Wang
EACL 2023, Demonstration Track
Paper CodeFETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue
Alon Albalak, Yi-Lin Tuan, Pegah Jandaghi, Connor Pryor, Luke Yoffe, Deepak Ramachandran, Lise Getoor, Jay Pujara, William Yang Wang
EMNLP 2022
Paper Code Benchmark Website
Making Something out of Nothing: Building Robust Task-oriented Dialogue Systems from Scratch
Zekun Li, Hong Wang, Alon Albalak, Yingrui Yang, Jing Qian, Shiyang Li, Xifeng Yan
Alexa Prize Taskbot Challenge 2022
PaperD-REX: Dialogue Relation Extraction with Explanations
Alon Albalak, Varun Embar, Yi-Lin Tuan, Lise Getoor, William Yang Wang
ACL 2022, ConvAI Workshop
Paper Code