Testing the GMATPrep scoring algorithm for Quant section

This topic has expert replies
Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members
The goal of this thread is to investigate the behavior of the GMATPrep scoring algorithm for the Quant section in different scenarios.

I will post results from my experiments with GMATPrep Quant section. Anyone else is welcomed to post their experimental results too.

My natural level in Quant is around 51 (above 98% percentile).
The GMATPrep I am testing is version 2.1.279 for Windows.
My computer runs Windows 7, 64 bit.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Tue Jul 10, 2012 9:13 pm
Experiment 1:
--------------
- used "Test 1" in GMATPrep
- answered the odd questions correctly
- answered the even questions randomly: used Mathematica to generate a uniformly distributed random sequence of numbers between 1 and 5, inclusive, translated to letters: BCEAA ECAAC AAEDD CAD

Results:
- odd questions 19 total: 2 incorrect (fell in stupid traps), 17 correct
- even questions 18 total: the software does not save how many correct/incorrec, probably around 5 correct by randomness
- scaled score: abysmally low 21 corresponding to 21%

Notes:
I knew the test was going bad during the whole test - the questions were way too easy. I did not expect such an abysmall score though. Later tests showed the reason is probably the incorrectly answered odd questions 1 and 7. Such a sensitivity to only two questions in the begining gave me the idea for Experiment 4 later.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Tue Jul 10, 2012 9:22 pm
Experiment 2:
--------------
- used "Test 1" in GMATPrep
- answered the odd questions correctly
- answered the even questions randomly: used Mathematica to generate a uniformly distributed random sequence of numbers between 1 and 5, inclusive, translated to letters: BCEAA ECAAC AAEDD CAD

Results:
- odd questions 19 total: 0 incorrect (this time no errors), 19 correct
- even questions 18 total: 13 incorrect, 5 correct by chance
- scaled score: 40 corresponding to 55%

Notes:
The quant section didn't start with the same question as in Experiment 1, hence it chooses the first question randomly from the questions of average difficulty. I would say the questions were average difficulty overall. The fact the percentile improved significantly from Experiment 1, is suggestive that the incorrectly answered questions 1 and 7 in Experiment 1, surrounded by the randomized and probably incorrect answers of the even questions 2, 6, and 8, biased the scoring algorithm towards lower percentiles.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Tue Jul 10, 2012 9:28 pm
Experiment 3:
--------------
- used "Test 1" in GMATPrep
- answered the odd questions correctly
- answered the even questions randomly: used Mathematica to generate a uniformly distributed random sequence of numbers between 1 and 5, inclusive, translated to letters: BCEAA ECAAC AAEDD CAD

Results:
- odd questions 19 total: 0 incorrect, 19 correct
- even questions 18 total: 16 incorrect, 2 correct by chance
- scaled score: 43 corresponding to 64%

Notes:
In this experiment, I got harder questions than experiment 3, although the setup is exactly the same. Two of the questions were very tough optimization/range problems for which I had to play with plugging numbers to see what was going on. That toughness is reflected in the final score.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Tue Jul 10, 2012 9:41 pm
The abysmally low score in Experiment 1 suggested to test the sensitivity of the scoring algorithm with respect to:
(1) answering incorrectly problems in the begining of the test when it is still oscillating wildly in problem difficulty, trying to adapt to the test-taker level
(2) answering incorrectly several questions in a row, thus rejecting the problem difficulty oscillations and biasing the algorithm towards offering later problems of lower difficulty.

It was said on this forum these were problems in the 90's with the GRE adaptive algorithm which was later abandoned. Unfortunately the following experiment, still shows those problems with the current GMATPrep algorithm, which doesn't surprise me at all.

Experiment 4:
--------------
- used "Test 1" in GMATPrep
- answered questions 1,2,3 and 6,7,8 purposely incorrectly
- answered all other questions correctly

Results:
- questions 37 total: incorrect 6 (the intended questions 1,2,3,6,7,8), correct 31
- scaled score: 47 corresponding to 76%

Notes:
I got only one very hard question (one of the very hard questions in Experiment 3). All other questions were either average difficulty or hillariously easy - easier than in Experiment 3. I have all questions recorded so I can show anyone what I mean.

The level after question 8 had dropped so dramatically that in question 9 I was asked if (x^6)(x^4) = x^10. That was expected after answering questions 1,2,3 and 6,7,8 wrong. What I found disappoing was that the scoring algorithm never recovered to offering higher level problems, it kept giving me easy problems, despite the fact I solved the next 29 questions correctly in a row without a single error.

The final result reflects that algorithmic flaw. I got only 6 questions out of 37 wrong, the algorithm thinks my true level is in the 76th percentile. That means that according to GMATPrep, every 4th person does better than that - better than getting 31 out of 37 questions correctly.

It seems that the algorithm decides what the test taker level is from the answers of the first questions, and for the later problems does not attempt wild oscillations in problem difficulty as it did in the begining of the test, it just keeps giving problems around the same old level.

The problem with that is that the level is decided in the first questions where most test takers make the silliest errors before they have warmed up enough. The second problem is that the algorithm doesn't give the test taker a second chance by drifting towards higher levels later if the test taker keeps answering correctly: I was delegated to the 76th percentile just because I had answered 6 of the first questions incorrectly.

Experiment 4 clearly suggests the adaptive scoring of GMATPrep is far from ideal and suffers from the same problems of the GRE adaptive scoring in the past. This may or may not translate to the real test scoring because the real test has a significantly larger bank of problems and may react more adequately to test takers that underperform in the begining of the test.

A good experiment for the future would be to answer questions 1, 2, and 3 purposely incorrectly and see what happens later when the test-taker answers all the other questions correctly.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

User avatar
Legendary Member
Posts: 1239
Joined: Tue Apr 26, 2011 6:25 am
Thanked: 233 times
Followed by:26 members
GMAT Score:680

by sam2304 » Tue Jul 10, 2012 10:13 pm
Getting defeated is just a temporary notion, giving it up is what makes it permanent.
https://gmatandbeyond.blogspot.in/

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Fri Jul 13, 2012 4:34 pm
Experiment 5 (same setup as Experiment 4 to checking reproducibility):
----------------------------------------------------------------------
- used "Exam 1" in GMATPrep
- answered questions 1,2,3 and 6,7,8 purposely incorrectly
- answered all other questions correctly

Results:
- questions 37 total: incorrect 6 (the intended questions 1,2,3,6,7,8), correct 31
- scaled score: 42 corresponding to 59%

Notes:
The abysmally low percentile of 59% with only 6 incorrect questions, clearly shows the test is sensitive to answering several questions wrong in a row in the begining, especially if those questions are fundamental and easy. The wrong questions were:

1. Data sufficiency about how the positivity/negativity of a product of numbers depends on the number of negative numbers multiplied. I was surprised the test started with that problem because I would classify it as above average difficulty.
2. Data suficiency boiling down to solvability of a system of two equations two unknowns. Easy one.
3. Solving a system of two equations two unknowns. Easy one.
6. Geometry problem to compare the ares of two triangles using coordinates of their vertices. Average level.
7. Easy problem about two machines working together. The rates and the job were given, the problem asked about the time to complete the job. Easy one.
8. Factoring out an easy symbolic expression. Very easy algebra.

Then a string of easy straight-forward problems followed. A few higher level problems appeared after problem 30, mixed with easy straight-forward ones.

Apparently, the algorithm decided that it was pointless to give me hard problems if I answer wrong problems with simple equations in them and did not give me opportunity later to prove I can do better than that.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Fri Jul 13, 2012 4:49 pm
Experiment 6 (testing sensitivity to wrong answers at the end of the section):
-------------------------------------------------------------------------------
- used "Exam 1" in GMATPrep
- answered questions 30, 31, 32 and 35, 36, 37 purposely incorrectly
- answered all other questions correctly

Results:
- questions 37 total: incorrect 6 (the intended questions 30, 31, 32, 35, 36, 37), correct 31
- scaled score: 50 corresponding to 92%

Notes:
Compare that percentile to the percentiles of Experiment 4 and Experiment 5. Clearly the algorithm is less sensitive to wrong answers at the end of the section because it has already decided the test-taker level and keeps giving problems at that level.

The questions I purposely answered wrong were:

30. Optimization/range problem involving average and median of a set of numbers. Had to find the maximal possible value of the smallest number in the set. Hard difficulty.
31. Two sets of machines, completing the same job for different number of days. This was above average difficulty work/proportionality problem.
32. Data sufficiency problem about positivity and distances on number line. Above average difficulty.

35. Average difficulty problem about factorizing expression with powers and finding prime factorization of the result.
36. Average difficulty rotational speed problem.
37. Hard problem about distances and order of 4 numbers on the number line.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Fri Jul 13, 2012 5:07 pm
Experiment 7 (testing sensitivity to fewer number of errors in the begining of the section):
-----------------------------------------------------------------------------------------------
- used "Exam 2" in GMATPrep
- answered questions 1, 2 and 5, 6 purposely incorrectly
- answered all other questions correctly

Results:
- questions 37 total: incorrect 4 (the intended questions 1,2,5,6), correct 31
- scaled score: 50 corresponding to 92%

Notes:
Experiment 7 (2+2 wrong questions in the begining) has the same percentile as the previous Experiment 6 (3+3 wrong questions at the end). That clearly indicates that the score is more sensitive to wrong questions in the begining than at the end.

The questions answered wrong were:

1. Data sufficiency about isosceles triangle with a trap in it. Above average difficulty because of the trap.
2. Very easy combinatorics problem.

3. Easy problem involving average and solving a linear equation with a single variable.
4. Data sufficiency problem about average speed, involving inequality. Average difficulty.

The problems that followed were average difficulty. Tough problems appeared after problem 25.

Altough the 92 percentile I got at the end is not bad, I find it excessive to penalize by 8% for 4 wrong questions in the begining. Clearly the scoring algorithm was biased untill problem 25, not giving me tough enough problems to prove myself.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Fri Jul 13, 2012 5:35 pm
SUMMARY OF RESULTS
====================
The experiments so far demonstrated the following behavior of the GMATPrep scoring algorithm (quant section):

(1) On average, the final score is lower for wrong answers in the begining vs the same number of wrong answers at the end of the section: Experiments 4 and 5 vs Experiment 6.

(2) On average, the final score is lower for larger number of wrong answers in a row: Experiments 4 and 5 vs Experiment 7.

(3) On average, the final score is lower for wrong answers to basic fundamental questions in the begining of the section: Experiment 5 vs Experiment 4.

So if you want a high score, make sure you don't make mistakes on several easy questions in a row, in the begining of the section. Once the algorithm decides you are a lower-level test taker, it won't drift up in question difficulty untill the very end of the section so your overall score will suffer because you are given problems that are below your level.

The problem with the adaptive algorithm is clearly that it takes every answer as reflecting the actual test-taker level, it doesn't consider the possibility of random test-taker errors. The second problem is that the algorithm, decides the test-taker level early in the test where most people are prone to random errors because they havent warmed up and lack focus. Later in the test, the algorithm doesn't re-evaluate continuously the test-taker level but tends to give questions at the same level, except at the very end of the test when it is already too late.

Clearly to me, a non-adaptive test, that separates the subject into subtopics and gives a fixed number of problems from lowest to highest level in each subtopic will be less sensitive to random test-taker errors and will give a more accurate evaluation of the test-taker ability. That is why other tests like SAT, ACT, and GRE stay clear from 'adaptive' scoring.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

User avatar
Junior | Next Rank: 30 Posts
Posts: 14
Joined: Thu Jun 28, 2012 4:19 pm
Thanked: 2 times
Followed by:1 members

by jdciaravino » Fri Jul 13, 2012 9:42 pm
I don't know whatever one else thinks, but shouldn't you be penalized for making careless errors on easy questions? I mean usually they are 1 level(in comparison to multiple level) questions that don't have many traps in them. If you're making careless errors here the likelihood of you making them on 700 level questions is tenfold since those questions, combine concepts and usually place more traps in their answer choices or wording of the problem.

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Fri Jul 13, 2012 10:08 pm
You are making the logical mistake to assume, just like GMAC, that someone's silly errors truelly reflect his/her level. More often than not, the wrong answers will be clustered in the begining (due to lack of focus and anxiety) and at the end of the test (due to high level questions, time pressure, or running out of time). So yeah, it is VERY probable to make a string of mistakes below your level and then get questions below your maximal level.

One of the basic reason most high scorers do not get a perfect score are hillariously silly mistakes WAY BELOW THEIR LEVEL, like wrong multiplication of two numbers or not reading the question correctly. They usually know they got it wrong after the test. Should you assume that just because someone got a 10 percentile question wrong, he/she is at that level?

The problem is that once you do a string of mistakes below your level, that is exactly what the algorithm is assuming and it won't give you a second chance. It even globalizes it in other areas: just because somebody answered wrong a simple combinatorics question, does it mean he/she is weak in statistics too? How probable is that? I recently had a student that god 710 with 48 on quant. He was very good on arithmetics and simple algebra but not so good on average and more complicated algebra. If GMAT happened to give him a string of average level algebra problems, he would have been screwed because then GMAT would deneralize that he is below average level.

Take a GMATPrep test, then analyze the sequence of your mistakes and honestly think how probable is for someone at your level to make those lower level mistakes. That will give you quite a fresh perspective on the 'likelihood of making mistakes'.

Of course mistakes have to be penalized but not in the way GMATPrep does it. It's algorithm adapts to a string of errors below your level, but doesn't adapt later to a string of questions you answer correctly because it already 'decided' your level, based on silly mistakes. Clearly to me, an algorithm that is more adaptive in the begining, than later, is not really objective.

I simply cannot understand why GMAT keeps pushing adaptive testing when a simple non-adaptive test with 40 questions, uniformly covering the topics with questions from lowest to highest levels, is capable of measuring the test-taker level very accurately without any unfound probability theories about the test-taker 'likelihood to make a mistake'. If someone makes a mistake on a simple question but solves the harder question in the same area, clearly he/she won't be penalized like on GMAT. Moreover, he/she will still get the chance to attain his/her true level by solving the hard questions in other areas, which simply won't happen on GMAT.

Again, there is a reason other tests stay away from 'adaptive testing'.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

GMAT/MBA Expert

User avatar
GMAT Instructor
Posts: 3380
Joined: Mon Mar 03, 2008 1:20 am
Thanked: 2256 times
Followed by:1535 members
GMAT Score:800

by lunarpower » Sat Jul 14, 2012 12:45 am
tutorphd wrote: Clearly to me, a non-adaptive test, that separates the subject into subtopics and gives a fixed number of problems from lowest to highest level in each subtopic will be less sensitive to random test-taker errors and will give a more accurate evaluation of the test-taker ability. That is why other tests like SAT, ACT, and GRE stay clear from 'adaptive' scoring.
(emphasis mine)

This is an interesting theory, but it suffers from one major flaw: it isn't true.

If a test is non-adaptive, it is much, much more sensitive to random errors, because there are only a couple of questions at each approximate level. This is especially true for high-achieving students, who have to try to get all the problems correct on non-adaptive exams.

If you prefer numbers, consider the fact that missing a single question anywhere on the SAT math section will, on average,** drop your SAT math score all the way from 800 down to 760-770.
Compare that to the GMAT, where you missed the last 6 questions (or however many, the point is that it's way more than just one) at the end of the test, and your score plummeted precipitously off a cliff from 51 to ... 50.

--

In any case, this seems to be a case of "If you think the sun is setting in the east, then the issue is your compass, not the sun."

Specifics aside, look at the conclusion you're reaching -- you are concluding that non-adaptive exams are better than adaptive exams at adapting to the student. something is rotten in the state of denmark.
Ron has been teaching various standardized tests for 20 years.

--

Pueden hacerle preguntas a Ron en castellano
Potete chiedere domande a Ron in italiano
On peut poser des questions à Ron en français
Voit esittää kysymyksiä Ron:lle myös suomeksi

--

Quand on se sent bien dans un vêtement, tout peut arriver. Un bon vêtement, c'est un passeport pour le bonheur.

Yves Saint-Laurent

--

Learn more about ron

Master | Next Rank: 500 Posts
Posts: 126
Joined: Sun Jun 24, 2012 10:11 am
Location: Chicago, IL
Thanked: 36 times
Followed by:7 members

by tutorphd » Sat Jul 14, 2012 1:47 am
Correction: I missed the last 6 questions and my score plummeted from 60 to 50, and the percentile plummeted from 99% to 92%.

You are comparing bananas to apples here. When the test is very hard like GMAT, getting a few questions wrong will lead to a smaller decrease in percentiles because very few people get those questions right. SAT is way lower difficulty than GMAT so comparing them to prove a paper test is more prone to errors than adaptive test is not very sound.

If both tests contain the same range of problem difficulty, a balanced paper test will always beat the hell out of the GMAT adaptive test in terms of sensitivity to random errors by the test-taker.

The problem I see in the GMAT adaptive scoring is the fact it is less adaptive in the middle and the end of the test, while a paper test is continuously scanning the full difficulty range in every sub-topic, irrespective how the candidate does in another subtopic.
Skype / Chicago quant tutor in GMAT / GRE
https://gmat.tutorchicago.org/

User avatar
Senior | Next Rank: 100 Posts
Posts: 53
Joined: Tue Aug 03, 2010 3:09 am
Location: Los Angeles
Thanked: 8 times
Followed by:27 members

by LIL » Sat Jul 14, 2012 1:54 am
tutorphd wrote:Correction: I missed the last 6 questions and my score plummeted from 60 to 50, and the percentile plummeted from 99% to 92%.
51, though 98th percentile, is the highest score you can get on quant. and on verbal. but in verbal it's 99th.