David Snyder @das_princeton X Profile

David Snyder

@das_princeton

Followers

25

Following

0

Media

4

Statuses

13

PhD Student in the IRoM Lab at Princeton University, working on safety and generalization assurances for robots.

Joined May 2025

Don't wanna be here? Send us removal request.

David Snyder

@das_princeton

4 months

(13/13) Very grateful to Haruki Nishimura and Masha Itkina in the TLU (Trustworthy Learning under Uncertainty) team at the Toyota Research Institute (TRI), as well as many additional collaborators at TRI and in the IRoM Lab at Princeton!.

0

1

David Snyder

@das_princeton

4 months

(12/13) For more information — . Project Page: Paper: Code: coming soon! (link on project page).

arxiv.org

Imitation learning has enabled robots to perform complex, long-horizon tasks in challenging dexterous manipulation settings. As new methods are developed, they must be rigorously evaluated and...

1

0

1

Grok

@grok

5 days

What do you want to know?.

424

262

2K

David Snyder

@das_princeton

4 months

(11/13) STEP can be thought of as a sequentialized, resource-aware version of Barnard’s Test, improving small-sample efficiency over state-of-the-art (SOTA) sequential methods in the literature, including work by Lai and recent work from safe, anytime-valid inference (SAVI).

1

0

David Snyder

@das_princeton

4 months

(10/13) STEP constructs decision rules by solving an offline convex optimization problem, which yields near-optimal multidimensional decision boundaries for Nmax up to ~500-1000. During evaluation, STEP can be used almost like a look-up table!

1

0

David Snyder

@das_princeton

4 months

(9/13) Why Nmax? . Policy evaluation is expensive, due to limited hardware availability and limited resources for human supervision. STEP near-optimally accounts for this practical constraint, and gives the evaluator significant leeway to set a conservative Nmax.

1

0

David Snyder

@das_princeton

4 months

(8/13) Because STEP is sequential, instead of the batch size N, the evaluator sets Nmax: the greatest number of rollouts (per policy) they are willing to run in order to detect an improvement. STEP then automatically adapts the stopping time to the difficulty of the problem.

1

0

David Snyder

@das_princeton

4 months

(7/13) STEP acts as a statistically rigorous evaluation procedure which adapts to the difficulty of the specific comparison instance. In essence, it is a principled way to allow the evaluator to 'peek at the data' without compromising statistical assurances!.

1

0

David Snyder

@das_princeton

4 months

(6/13) Yes! . We propose STEP, a sequential test which aggregates evaluation rollouts one-by-one and stops automatically when a desired significance level is reached. It stops quickly when the performance gap is large, and waits if the gap is small.

1

0

David Snyder

@das_princeton

4 months

(5/13) This induces costly inefficiencies. Choosing a large N means that many (unnecessary) trials must be run on weak baselines; conversely, choosing a small N risks the failure to accumulate sufficiently compelling evidence of improvement. Can we do better?.

1

0

David Snyder

@das_princeton

4 months

(4/13) … because acting on any observation of partial results invalidates statistical assurances of the test. In other words: stopping early because the results appear ‘promising enough’ or running additional trials beyond the allotted N breaks the statistical guarantee!.

1

0

David Snyder

@das_princeton

4 months

(3/13) The standard evaluation procedure in robotics is batch testing: run N trials of each policy, then apply a statistical test (e.g., Barnard’s Test). This requires the evaluator to choose N prior to the experiment and stick to it. But this is very limiting. .

1

0

David Snyder

@das_princeton

4 months

(2/13) Most robotics papers rely on empirical performance gains — i.e., “we outperform the baseline” — as evidence of methodological efficacy. Such comparisons must be made rigorous to ensure reproducible science. STEP aims to ensure these comparisons are sound and efficient.

1

0

David Snyder

@das_princeton

4 months

(1/13) How should we rigorously compare robot policies? Comparison is central to robotics research, but is inherently expensive. We introduce STEP, a flexible and data-efficient method for statistically rigorous policy comparison. Accepted at RSS 2025:

1

7

25