David Snyder Profile
David Snyder

@das_princeton

Followers
25
Following
0
Media
4
Statuses
13

PhD Student in the IRoM Lab at Princeton University, working on safety and generalization assurances for robots.

Joined May 2025
Don't wanna be here? Send us removal request.
@das_princeton
David Snyder
4 months
(13/13) Very grateful to Haruki Nishimura and Masha Itkina in the TLU (Trustworthy Learning under Uncertainty) team at the Toyota Research Institute (TRI), as well as many additional collaborators at TRI and in the IRoM Lab at Princeton!.
0
0
1
@grok
Grok
5 days
What do you want to know?.
424
262
2K
@das_princeton
David Snyder
4 months
(11/13) STEP can be thought of as a sequentialized, resource-aware version of Barnard’s Test, improving small-sample efficiency over state-of-the-art (SOTA) sequential methods in the literature, including work by Lai and recent work from safe, anytime-valid inference (SAVI).
1
0
0
@das_princeton
David Snyder
4 months
(10/13) STEP constructs decision rules by solving an offline convex optimization problem, which yields near-optimal multidimensional decision boundaries for Nmax up to ~500-1000. During evaluation, STEP can be used almost like a look-up table!
1
0
0
@das_princeton
David Snyder
4 months
(9/13) Why Nmax? . Policy evaluation is expensive, due to limited hardware availability and limited resources for human supervision. STEP near-optimally accounts for this practical constraint, and gives the evaluator significant leeway to set a conservative Nmax.
1
0
0
@das_princeton
David Snyder
4 months
(8/13) Because STEP is sequential, instead of the batch size N, the evaluator sets Nmax: the greatest number of rollouts (per policy) they are willing to run in order to detect an improvement. STEP then automatically adapts the stopping time to the difficulty of the problem.
1
0
0
@das_princeton
David Snyder
4 months
(7/13) STEP acts as a statistically rigorous evaluation procedure which adapts to the difficulty of the specific comparison instance. In essence, it is a principled way to allow the evaluator to 'peek at the data' without compromising statistical assurances!.
1
0
0
@das_princeton
David Snyder
4 months
(6/13) Yes! . We propose STEP, a sequential test which aggregates evaluation rollouts one-by-one and stops automatically when a desired significance level is reached. It stops quickly when the performance gap is large, and waits if the gap is small.
1
0
0
@das_princeton
David Snyder
4 months
(5/13) This induces costly inefficiencies. Choosing a large N means that many (unnecessary) trials must be run on weak baselines; conversely, choosing a small N risks the failure to accumulate sufficiently compelling evidence of improvement. Can we do better?.
1
0
0
@das_princeton
David Snyder
4 months
(4/13) … because acting on any observation of partial results invalidates statistical assurances of the test. In other words: stopping early because the results appear ‘promising enough’ or running additional trials beyond the allotted N breaks the statistical guarantee!.
1
0
0
@das_princeton
David Snyder
4 months
(3/13) The standard evaluation procedure in robotics is batch testing: run N trials of each policy, then apply a statistical test (e.g., Barnard’s Test). This requires the evaluator to choose N prior to the experiment and stick to it. But this is very limiting. .
1
0
0
@das_princeton
David Snyder
4 months
(2/13) Most robotics papers rely on empirical performance gains — i.e., “we outperform the baseline” — as evidence of methodological efficacy. Such comparisons must be made rigorous to ensure reproducible science. STEP aims to ensure these comparisons are sound and efficient.
1
0
0
@das_princeton
David Snyder
4 months
(1/13) How should we rigorously compare robot policies? Comparison is central to robotics research, but is inherently expensive. We introduce STEP, a flexible and data-efficient method for statistically rigorous policy comparison. Accepted at RSS 2025:
1
7
25