By Princeton people.
This one aims to solve GitHub issues. It appears to contain 2,294 real-world GitHub issues and their corresponding pull requests.
Evaluation is simply based on "does the pull request make some pre-written failing test cases pass".
The dataset appears to be at: huggingface.co/datasets/princeton-nlp/SWE-bench in Parquet format.