About Me - Alon Albalak

I am currently on the job market
I am interested in applying my data-centric background to many areas of ML research, including LLM pretraining, alignment, fine-tuning, tool use, retrieval-augmentation, and many more!
Please reach out by email if my skills match up with your team’s research goals.



I am a fifth year Ph.D. candidate in the NLP Group at the University of California, Santa Barbara, advised by professors William Yang Wang and Xifeng Yan. During the first year of my Ph.D. I was gratefully supported by an NSF IGERT Fellowship. While pursuing my Ph.D. I took a year off from research to work at a financial technology startup, Theta Lake. Prior to my Ph.D. I received my B.S. in mathematics at Wayne State University, with research advised by Gang George Yin.

The primary focus of my PhD research has been on applying ML methods to NLP to improve data efficiency and model performance. In my research I have explored the use of methods including multi-armed bandits, data selection, multitask learning, transfer learning, reinforcement learning, and neuro-symbolic methods. Additionally, I have a wide array of interests in other topics including model efficiency, logic and reasoning, conversational AI, and multilingual models.


** NEWS **

[02/2024] New survey paper! “A Survey on Data Selection for Language Models”

This survey presents a comprehensive review of data selection methods and related areas, providing a taxonomy of existing approaches that allows us to point at holes in research, and propose promising avenues for future research. The aim of this resource is to accelerate progress on data-cenrtic research, for both new and established researchers!
We also compiled a paper list.

[02/2024] New resource on foundation model best practices, The Foundation Model Development Cheatsheet

This cheatsheet compiles resources and tools for the full lifecycle of model development including: data collection, preprocessing, and documentation; model pretraining and finetuning; environmental impact estimation; assessing risks and harms; as well as model documentation, release, and licensing.
Check out my blog for more details!
Also, check out the up-to-date cheatsheet here, or a static pdf of the cheatsheet here.

[10/2023] “Efficient Online Data Mixing For Language Model Pre-Training” was accepted as a spotlight to the r0-FoMo workshop at NeurIPS

This work presents an extremely efficient online data mixing algorithm that reaches the same model perplexity of the next best method (DoReMi) with 19% fewer iterations, and improves downstream performance by 1.9% while adding a miniscule 0.000007% overhead.
Check out the pre-print

[10/2023] Accepted to EMNLP 2023 - “RWKV: Reinventing RNNs for the Transformer Era”

RWKV is a new model architecture that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs.
Check out the paper and code for more information.

[10/2023] Accepted at EMNLP 2023 - “Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning”

This work demonstrates that combining Large Language Models (LLMs) with symbolic solvers makes for a strong method for solving logical problems.
Check out the paper and the code.

[09/2023] Accepted to NeurIPS 2023 - “Improving Few-shot Generalization by Exploring and Exploiting Auxiliary Data”

This work presents 2 methods of few-shot learning with auxiliary data, inspired by multi-armed bandits. These methods show significant improvement over multi-tasking followed by fine tuning (9% improvement).
Check out the paper and the code for more information.

[05/2023] Accepted to ACL 2023 - “Modeling Utterance-level Causality in Conversations”

Check out the paper for more details.

[04/2023] Accepted at IJCAI 2023 - “NeuPSL: Neural Probabilistic Soft Logic”

NeuPSL is a neuro-symbolic framework that unites the powerful symbolic reasoning of PSL with the representation learning of deep neural networks.
Check out the paper for more.

[02/2023] The FETA benchmark on task transfer will be a shared task at the NLP for ConvAI workshop at ACL ‘23!

*Awards* The FETA benchmark will have prizes for top scorers and most innovative approaches!

*Purpose* The FETA benchmark shared task aims to bring together researchers from a variety of backgrounds and compare their best ideas for task transfer. The benchmark allows for comparing many different methods including: instruction/prompt fine-tuning, source-task selection, multitask learning, continued pre-training, meta-learning, and many more!

See detailed rules, starter code, and submission instructions on the website.

[02/2023] “Addressing Issues of Cross-Linguality in Open-Retrieval Question Answering Systems For Emergent Domains” was accepted at EACL 2023!

This work addresses the low-resource question-answering setting where supporting documents may not be in the same language as the query, cross-lingual Open-retrieval QA. In particular, this is an important problem in emergent domains, where the majority of supporting documents are more likely to be in a limited number of languages. Check out the paper and code for more information.

[01/2023] The Transfer Learning for NLP Workshop (TL4NLP) workshop is available to watch!

TL4NLP explored insights and advances on transfer learning, including insightful talks from our guest speakers and hot takes from our debaters.
TL4NLP features talks from Mike Lewis, Percy Liang/Ananya Kumar, Graham Neubig, David Adelani, and Jonas Pfeiffer
as well as a debate between Sara Hooker and Kyunghyun Cho.
Check out the talks, topics, and more at tl4nlp.githb.io.
Find recorded talks here.

[10/2022] My paper on benchmarking task transfer will be at EMNLP ‘22: FETA

FETA is the largest NLP benchmark for intra-dataset task transfer, where task transfer is isolated from domain shift.
Check out the paper, and our github repo for more.