David Heineman

Research interests

I'm interested in understanding language model behavior. This includes fine-grained evaluation of LLM generation [1, 2], using measures of behavior to improve text generation [3], and interpreting language model behavior with existing theories of human cognition [4].

Previously, I completed my undergrad at Georgia Tech 🐝, where I was fortunate to be advised by Prof. Wei Xu and work with Yao Dou and Dr. Mounica Maddela. I've also spent a few summers as an intern at AWS and Patientco, a healthcare / fintech startup. I enjoy reading, hiking, and making homebrew nitrogen cold brew. ☕️ ⛰️

Publications & Preprints ✒️

Establishing Task Scaling Laws via Compute-Efficient Model Ladders [code]

Akshita Bhagia*, Jiacheng Liu*, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, Hannaneh Hajishirzi
preprint

DataDecide: How to Predict Best Pretraining Data with Small Experiments [code, models]

Ian Magnusson*, Nguyen Tai*, Ben Bogin*, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
ICML, 2025

Improving Minimum Bayes Risk Decoding with Multi-Prompt [code]

David Heineman, Yao Dou, Wei Xu
EMNLP, 2024

Towards a Path Dependent Account of Category Fluency [code]

David Heineman, Reba Koenen, Sashank Varma
CogSci, 2024

Thresh 🌾: Unified, Customizable and Deployable Fine-Grained Text Evaluation [thresh.tools]

David Heineman, Yao Dou, Wei Xu
EMNLP Demo, 2023

Edit-level Simplification Evaluation using SALSA 💃 [code/data, metric]

David Heineman, Yao Dou, Mounica Maddela, Wei Xu
EMNLP, 2023

LENS 🔎: A Learnable Evaluation Metric for Text Simplification [code/data, metric]

Mounica Maddela*, Yao Dou*, David Heineman, Wei Xu
ACL, 2023

* = equal contribution

My past work 🌳

A few mini-projects: a 500 line GRPO implmenetation; showing LLM benchmark scores can improve +2 pts on MATH simply by changing vLLM version; a reproduction of branching factor; custom PyTorch kernels for Fast FFNs; and eval'ing LLMs on quant puzzles.
Maintaining the Thresh 🌾 platform, an all-purpose tool for fine-grained text generation evaluation, including an annotation tool builder and Python library.
Built a search engine [code] for ML / NLP conferences, indexed with ColBERT.
Wrote a LLM-based Rubiks cube solver as a demonstration of explore/exploit behavior for reasoning (🏆 2nd place at AGI House open source hackathon).
Awarded the GT College of Computing Outstanding Undergraduate Research Award (1 of 3000+ CS students) for my undergradaute thesis work on fine-grained evaluation of LLMs.
Designed new programming assignments for CS 4650, Natural Language Processing as a teaching assistant (sampling algorithms & LLaMA fine-tuning with LoRA).
Built an air pollution complaint tracker and classifier [code] for the Georgia Environmental Protection Divison (part of a larger collaboration at GT).
Awarded the PURA research grant to work on open problems in generation & evaluation (check out my Huggingface decoding vizualizer extension).
Thoughts on approaching reasoning evaluation in LLMs using theories of human cognition.
pip install lens-metric - A simple library to evalute text simplification using our LENS and LENS-SALSA LLMs on HuggingFace using only 5 lines of Python [demo].
Interned in AWS EC2 Enterprise Services, developing a prototype language model service, addressing problems in inference cost and deployment of open-source LLMs.
Earned 4th place in Georgia Tech's Wrek CTF (one of the largest greyhat hackathons in the southern US) [answers].
Helped lead Georgia Tech's CS 3510, Design and Analysis of Algorithms as a teaching asssistant in Fall '21 and '22.
Interned at AWS CloudWatch Application Insights, built infrastructure to monitor and group telemetry data from processes running on EC2 instances to identify the root causes of problems on customers' AWS infrastructure.
Interned at Patientco (now part of Waystar), invented and deployed new sequence-based prediction models to predict when a patient pays their healthcare bill using their payment history (used to customize ~5% of U.S. healthcare bills).
Deployed an API to allow researchers to segment Twitter hashtags using a new segmentation model from Georgia Tech's NLP Lab.
In the pre GPT-3 times, worked on methods for automatically grading student essays [code].

Recommendations

A few interesting corners of the internet I find worth checking out!

... to read

PG, Siboehm, MIY

...

Katherine Lee et al.

Florian Ederer et al.

Nicholas Bloom et al.

... to `clone`

davidheineman/dotfiles

... to listen

PyTorch Dev Podcast

Generally Intelligent

Artem Kirsanov

Mary Wootters

Undefined Behavior

... to flip through

Games, Puzzles, and Computation by Erik Demaine

The Corrections by Jonathan Franzen

Society Must be Defended by Michel Foucault

Oblivion by David Foster Wallace

I also enjoy trying new coffee shops. Here's some recommendations across Atlanta, that I visited during my undergrad, and a growing list across Seattle.

David Heineman
Last updated May 2025 [view source]

curl -s https://davidheineman.com/rick | bash

David Heineman

Hey! I'm David 👋

I'm a pre-doctoral young investigator at the Allen Institute for AI, working to improve language model pre-training and evaluation.

Research interests

Publications & Preprints ✒️

My past work 🌳

Recommendations