I'm quite sure that GPT-4 has not degraded over time in the cognitive evals (exams, MMLU, etc). The problem is that they only test prime numbers and not composite numbers. The Mar model always says prime, and the June model always says not prime, but you would get 100% on their