About Me - Alon Albalak

I am the Data Team Lead at SynthLabs, where I focus on post-training for large foundation models. I received my Ph.D. in the NLP Group at the University of California, Santa Barbara, advised by professors William Yang Wang and Xifeng Yan. During the first year of my Ph.D. I was gratefully supported by an NSF IGERT Fellowship. While pursuing my Ph.D. I took a year off from research to work at a financial technology startup, Theta Lake. Prior to my Ph.D. I received my B.S. in mathematics at Wayne State University, with research advised by Gang George Yin.

The primary focus of my research has been on applying ML methods to NLP to improve data quality and model performance. In my research I have explored the use of methods including multi-armed bandits, data selection, multitask learning, transfer learning, reinforcement learning, and neuro-symbolic methods. Additionally, I have a wide array of interests in other topics including model efficiency, logic and reasoning, conversational AI, multilinguality, and other modes of information (vision, robotics, multimodal).

NEWS

[02/2025] New Preprint: “Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models”

This work presents Big-MATH, a dataset designed to push the boundaries of reinforcement learning (RL) for reasoning in large language models (LLMs). Comprising over 250,000 curated math problems—an order of magnitude larger than existing datasets—Big-MATH bridges the gap between data quality and quantity. Our rigorous filtering pipeline ensures uniquely verifiable, open-ended, and closed-form solutions, making this dataset ideal for RL-based training. Additionally, we introduce Big-MATH-Reformulated, a subset of 47,000 multiple-choice problems transformed into open-ended formats. Open-sourced to accelerate community-driven innovation, Big-MATH provides the foundation for scaling experiments, algorithm benchmarking, and advancing mathematical reasoning in AI.

[01/2025] “Generalization vs. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data” will be presented at ICLR 2025

[01/2025] New Preprint: “Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought”

In this work, we introduce a novel framework called Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by modeling the underlying reasoning process. We present a pipeline for training models to produce Meta-CoTs, using process supervision, synthetic data, and search algorithms, and highlight open research questions and the potential for more advanced, human-like reasoning in AI. This work marks the next step in reasoning, moving from teaching LMs what to think, to teaching them how to think.

[12/2024] New Preprint: “Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models”

Our new paper proposes evaluating synthetic data generation algorithms based on three key characteristics: quality, diversity, and complexity (QDC), which are crucial for model generalization and performance. We highlight the importance of balancing QDC in synthetic data and argue that focusing solely on quality limits model diversity, which is critical for self-improvement and reinforcement learning algorithms.

[12/2024] “The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources” will appear in the Transactions of Machine Learning Research (TMLR)

This cheatsheet compiles resources and tools for the full lifecycle of model development including: data collection, preprocessing, and documentation; model pretraining and finetuning; environmental impact estimation; assessing risks and harms; as well as model documentation, release, and licensing.\ Check out my blog for more details!\ Also, check out the up-to-date cheatsheet here.

[10/2024] New preprint: “Generative Reward Models”

In this work, we propose GenRM, an iterative algorithm that combines Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) to improve the alignment of synthetic preference labels with human judgments. Our empirical results demonstrate that GenRM achieves comparable or better performance than traditional Bradley-Terry models on in- and out-of-distribution tasks, highlighting the potential of this hybrid approach for enhancing the quality of synthetic preference labels in LLM training.

[09/2024] “DataComp-LM: In search of the next generation of training sets for language models” will be presented at NeurIPS 2024

[08/2024] “Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence” will be presented at COLM 2024

[07/2024] “A Survey on Data Selection for Language Models” will appear in the Transactions of Machine Learning Research (TMLR)

This survey presents a comprehensive review of data selection methods and related areas, providing a taxonomy of existing approaches that allows us to point at holes in research, and propose promising avenues for future research. The aim of this resource is to accelerate progress on data-cenrtic research, for both new and established researchers!\ We also compiled a paper list.

[04/2024] I started as the Data Team Lead, and a Member of Technical Staff at SynthLabs

[04/2024] My dissertation: “Understanding and Improving Language Models Through a Data-Centric Lens” was accepted by the University of California, Santa Barbara!

[10/2023] “Efficient Online Data Mixing For Language Model Pre-Training” was accepted as a spotlight to the r0-FoMo workshop at NeurIPS

This work presents an extremely efficient online data mixing algorithm that reaches the same model perplexity of the next best method (DoReMi) with 19% fewer iterations, and improves downstream performance by 1.9% while adding a miniscule 0.000007% overhead.\ Check out the pre-print

[10/2023] Accepted to EMNLP 2023 - “RWKV: Reinventing RNNs for the Transformer Era”

RWKV is a new model architecture that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs.\ Check out the paper and code for more information.

[10/2023] Accepted at EMNLP 2023 - “Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning”

This work demonstrates that combining Large Language Models (LLMs) with symbolic solvers makes for a strong method for solving logical problems.\ Check out the paper and the code.

[09/2023] Accepted to NeurIPS 2023 - “Improving Few-shot Generalization by Exploring and Exploiting Auxiliary Data”

This work presents 2 methods of few-shot learning with auxiliary data, inspired by multi-armed bandits. These methods show significant improvement over multi-tasking followed by fine tuning (9% improvement).
Check out the paper and the code for more information.

Alon Albalak

** NEWS **