Whatever happened to error bars and sample sizes? I have to assume those GLUE test results are n=1 for each row, which means they could have easily cherry-picked models.
I'm sure the algorithm is impressive, but seriously, come on people. It's called data science, not datum science.