About Me - Alon Albalak

I am the Data Team Lead at SynthLabs, where we focus on post-training for large foundation models. I received my Ph.D. in the NLP Group at the University of California, Santa Barbara, advised by professors William Yang Wang and Xifeng Yan. During the first year of my Ph.D. I was gratefully supported by an NSF IGERT Fellowship. While pursuing my Ph.D. I took a year off from research to work at a financial technology startup, Theta Lake. Prior to my Ph.D. I received my B.S. in mathematics at Wayne State University, with research advised by Gang George Yin.

The primary focus of my research has been on applying ML methods to NLP to improve data quality and model performance. In my research I have explored the use of methods including multi-armed bandits, data selection, multitask learning, transfer learning, reinforcement learning, and neuro-symbolic methods. Additionally, I have a wide array of interests in other topics including model efficiency, logic and reasoning, conversational AI, and multilingual models.


** NEWS **

[10/2024] New preprint: “Generative Reward Models”

[09/2024] “DataComp-LM: In search of the next generation of training sets for language models” will be presented at NeurIPS 2024

[08/2024] “Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence” will be presented at COLM 2024

[07/2024] “A Survey on Data Selection for Language Models” will appear in the Transactions of Machine Learning Research (TMLR)

This survey presents a comprehensive review of data selection methods and related areas, providing a taxonomy of existing approaches that allows us to point at holes in research, and propose promising avenues for future research. The aim of this resource is to accelerate progress on data-cenrtic research, for both new and established researchers!
We also compiled a paper list.

[04/2024] I started as the Data Team Lead, and Member of Technical Staff at SynthLabs

[04/2024] My dissertation: “Understanding and Improving Language Models Through a Data-Centric Lens” was accepted by the University of California, Santa Barbara!

[02/2024] New resource on foundation model best practices, The Foundation Model Development Cheatsheet

This cheatsheet compiles resources and tools for the full lifecycle of model development including: data collection, preprocessing, and documentation; model pretraining and finetuning; environmental impact estimation; assessing risks and harms; as well as model documentation, release, and licensing.
Check out my blog for more details!
Also, check out the up-to-date cheatsheet here, or a static pdf of the cheatsheet here.

[10/2023] “Efficient Online Data Mixing For Language Model Pre-Training” was accepted as a spotlight to the r0-FoMo workshop at NeurIPS

This work presents an extremely efficient online data mixing algorithm that reaches the same model perplexity of the next best method (DoReMi) with 19% fewer iterations, and improves downstream performance by 1.9% while adding a miniscule 0.000007% overhead.
Check out the pre-print

[10/2023] Accepted to EMNLP 2023 - “RWKV: Reinventing RNNs for the Transformer Era”

RWKV is a new model architecture that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs.
Check out the paper and code for more information.

[10/2023] Accepted at EMNLP 2023 - “Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning”

This work demonstrates that combining Large Language Models (LLMs) with symbolic solvers makes for a strong method for solving logical problems.
Check out the paper and the code.

[09/2023] Accepted to NeurIPS 2023 - “Improving Few-shot Generalization by Exploring and Exploiting Auxiliary Data”

This work presents 2 methods of few-shot learning with auxiliary data, inspired by multi-armed bandits. These methods show significant improvement over multi-tasking followed by fine tuning (9% improvement).
Check out the paper and the code for more information.