Projects

Harbor

A framework that enables agent evaluation and rollout. I worked on many aspects of this project, including establishing a universal trajectory format *ATIC*, integration of multiple agents (support both installation-based and mounting-based), curation of multi-turn oracle dataset for SFT, integration with SkyRL for RL, development of an terminal-native agent, etc.

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

A terminal benchmark used in system cards of popular LLMs like Claude 4.5 and GLM 4.5. I contributed some tasks, reviewed most tasks, and developed a lot of evaluation harness.

The Agent Company: Benchmarking LLM Agents on Consequential Real World Tasks

TheAgentCompany is the first benchmark that examines AI’s ability to complete real-world consequential tasks. I designed a reproducible, extensible evaluation framework from scratch, led a team of 10+ software engineers to complete the coding part, and co-authored the paper as one of the primary authors.

OpenHands - An Open Platform for AI Software Developers as Generalist Agents

OpenHands is the most popular open-source coding agent. I am one of the early cofounders and an active maintainer of OpenHands since 2024. It has been downloaded by more than 7 million times, reported by multiple media, and cited by 300+ academic papers.

JanusGraph - leading open-source graph database

Graph databases are a fundamental building block of AI applications to leverage the private domain knowledge. Since 2019, I served as Technical Steering Committee and led the development of JanusGraph, the most popular open-source distributed graph database. It has been downloaded by more than 500k times.

PLOVER: Virtualized State Machine Replication System

Plover is the first Virtualized SMR (VSMR) System that achieves fast and multi-core scalable virtual machine fault-tolerance