Why Success Benchmarks are Misleading (And How We Set Benchmarks for Success)