SWE-Bench is a leader board that evaluates AI models by testing their ability to solve real-world coding tasks from GitHub issues. This is a collaboration between Princeton and Standford Universities.