H0: Urshi is NOT a sadistic liar

General Discussion
Long post so here is the 5 second summary.
I decided to test if Urshi is lying to us.
H0(Null Hypothesis): Urshi is correctly stating the 60% gem upgrade odds.
TLDR Results: Da girl aint lyin, at least not about the 60% cases +0 and +1.

Experimental Goal: Perform sufficient gem upgrades at 60% to test with reasonable confidence whether the claimed odds are correct. Reasonable confidence in this case will be defined as at least 95%. Alternate hypothesis odds to be considered will be at least 50% or 70% gem upgrade probability.

Results: I performed 932 upgrades from 233 empowered grift runs which were divided between the two 60% upgrade cases GR=Gem (451 trials) and GR=Gem+1 (481 trials). No other cases were tested in this experiment. The GR=Gem case had 274 successful upgrades out of 451 which gives a 95% confidence that the underlying probability is between 56.2% and 65.3%. The GR=Gem+1 case had 289 successful upgrades out of 481 for an interval of 55.7% to 64.5%. Both cases have 60% inside their respective intervals and both alternate hypothesis values outside, so I accept the null hypothesis that Urshi is being truthful.

Given that the actual mean results for both cases are close to 60% (60.8% and 60.1%) the result will not change greatly even if I increase the required confidence.
90.00%: GR=Gem(57.0% to 64.5%) GR=Gem+1(56.4% to 63.8%)
95.00%: GR=Gem(56.2% to 65.3%) GR=Gem+1(55.7% to 64.5%)
99.00%: GR=Gem(54.8% to 66.7%) GR=Gem+1(54.3% to 65.8%)
99.90%: GR=Gem(53.2% to 68.3%) GR=Gem+1(52.7% to 67.4%)
99.99%: GR=Gem(51.8% to 69.7%) GR=Gem+1(51.4% to 68.8%)

If I consider the two 60% cases combined as a single case the interval becomes even smaller. The final intervals expand but the call stays the same until well beyond 1 chance in 10K tests.
90.00%: Combined(57.8% to 63.0%)
95.00%: Combined(57.3% to 63.5%)
99.00%: Combined(56.3% to 64.5%)
99.90%: Combined(55.1% to 65.7%)
99.90%: Combined(54.2% to 66.6%)

Given the above results it is less than 1 chance in 10000 that Urshi is lying about the tested 60% upgrade cases if that underlying erroneous probability is actually less than 50% or greater than 70%.

Design Notes:
After reading a number of recent posts (especially http://us.battle.net/d3/en/forum/topic/20743535499) I thought I should try and test this myself. People dumped on doing "only" 100 trials as being insufficient for good confidence but were also talking about needing 1000 or 10000 trials. So the first thing I needed to determine was how many tests would actually be required.

I assume that each gem upgrade is one of N independent trials, where P is the probability of an upgrade and (1-P) is therefore the probability of a failure. A binomial distribution is often used to model an experiment like this where there are a number of trials with two possible outcomes (ie Head/Tail, Yes/No, or Pass/Fail). There are times when this model breaks down, but given that I expect to be doing more than 100 trials and the nominal probability is well away from either 0% or 100% this model should work just fine.

For this model the expected mean is equal to N*P and follows a standard curve so the variance can be modeled as N*P*(1-P). The variance is the expectation of the squared deviation which I can use to estimate an expected range of results from a theoretical test. If the number of trials is low the deviation is large compared to the mean therefore the potential results I would get from 50% 60% and 70% cases all overlap easily. As the number of trials goes up the theoretical outcomes become more clustered near to their expected means and therefore are less likely to overlap. More trials is better.

From the experimental goal I know I want to be able to tell the difference between an underlying P of 60% and either 50% or 70% with 95% confidence. On a standard curve I can cover 95% of the area by selecting all possibilities which are within Z=1.96 standard deviations of the mean. So I need to pick a number of trials where the highest expected 50% result is still lower than the lowest expected 60% result. Similarly the highest 60% result should be lower than the lowest 70% result. The worst case first crossover point can be derived by solving for N in the following:
N*P50 + Z*sqrt(N*P50*(1-P50)) = N*P60 - Z*sqrt(N*P60*(1-P60))
which solved for N gives:
N = ( Z * (sqrt(P50*(1-P50)) + sqrt(P60*(1-P60))) / (P60-P50) ) ^ 2
Do a similar calculation for P60/P70 and plug in the numbers and you get:
N >= 376 (for z=1.96 in the P50/P60 case)
N >= 345 (for z=1.96 in the P60/P70 case)
If I wanted to go to 99% confidence then Z would increase to 2.576 giving:
N >= 650 (for z=2.576 in the P50/P60 case)
N >= 597 (for z=2.576 in the P60/P70 case)
If I wanted to go to 99.9% confidence then Z would increase to 3.291 giving:
N >= 1061 (for z=3.291 in the P50/P60 case)
N >= 974 (for z=3.291 in the P60/P70 case)
I rounded up the 95% requirement to give me an initial goal of roughly 400 gem upgrades.

While working on the experimental design I thought of an error case that might cause testing issues. Depending on how Blizzard's code was written, I don't know if each of the 60% gem upgrades are all actually 60%. Potentially the GR=Gem+1 case could be 60% while an error in the GR=Gem case could have it be say 30%. If someone was doing an equal mix of the two cases Urshi would graphically show 60% but they would have actual upgrade average of 45%. Given this potential error case I decided to focus on only two of the 60% cases and gather all my samples from those two cases.

I also expect that most folks who are upgrading at 60% are doing so because they are close to their GR cap. After all if you can do a GR=Gem+10 then you don't need to worry about failure. Therefore the most interesting two of the different 60% cases are the two where the GR and the Gem are nearly the same level.

The gem cases also tied into exactly how I would choose each trial GR level so that I could maximize my data given what level gems I have available for upgrades. Since I had a lot of trials I needed to run and I intended to run empowered GRs, I needed to have 4 gems of the same level to upgrade simultaneously. I chose to start with a group of nine mostly zero level gems and level them up equally for use as armor upgrades. 400 at 60% is 240 upgrades so roughly level 27 for each of the gems with room to go up easily if I wanted to run even more trials.

The initial upgrade algorithm was simple:
Case GR=Gem+1 algorithm: Look at the gems to be upgraded and pick a GR level that is equal to the lowest level gem plus one. Run the empowered grift and make 4 attempts to upgrade the lowest level gems first.

This puts most of my trials into the GR=Gem+1 case with any spillover going into the GR=Gem case. However once I got to GR30 (420 overall trials) I had only 52 trials of the GR=Gem case so the GR=Gem case was hard to report any usable results. I decided to change the GR choose algorithm to bias it to the GR=Gem case and continue taking data.

Case GR=Gem algorithm: If you have 4 or more gems at the lowest level pick GR=lowest level. If you have fewer than 4 of the lowest level gems, pick GR=lowest level + 1. Run the empowered grift and make 4 attempts to upgrade the lowest level gems first.

This case puts most of the trials into the GR=Gem set and only when there are 3 or fewer low gems is the GR level forced to rise. I continued to collect data until I had more than 400 trials for both cases and I ran out of grift keys for that segment.

I will post the raw gem upgrade data in this thread for anyone who is interested.
Here is the raw log of upgrades I did during this experiment.
GRXX is the Greater Rift run at level XX
GemXX is the level of the gem before an upgrade was attempted.
0= failed to upgrade gem, 1= upgraded gem
My starting gem set was one level 5, two level 4, and six level 0 gems.

2016-05-21 (
GR01: Gem00(0100 1100 1101) Gem01()
GR02: Gem01(0110 1010 11) Gem02(01)
GR03: Gem02(0101 0111) Gem03()
GR04: Gem03(0111 1001 1) Gem04(011)
GR05: Gem04(0101 1111) Gem05()
GR06: Gem05(0101 1101 1100 11) Gem06(11)
GR07: Gem06(1100 1110 011) Gem07(1)
GR08: Gem07(1000 0111 0111 1) Gem08(001)
GR09: Gem08(1101 0010 1011 001) Gem09(1)

GR10: Gem09(1011 0100 0101 0101) Gem10(1111)
GR11: Gem10(1101 11) Gem11(00)
GR12: Gem11(1011 1011 0110 1) Gem12(110)
GR13: Gem12(1101 0100 0011 1) Gem13(101)
GR14: Gem13(0110 0110 1101) Gem14()
GR15: Gem14(1011 0010 1101 011) Gem15(0)
GR16: Gem15(1001 1000 0101 0110 11) Gem16(11)

GR17: Gem16(0111 0110 11) Gem17(01)
GR18: Gem17(0111 1110 11) Gem18(10)
GR19: Gem18(1110 1111 01) Gem19(10)
GR20: Gem19(1111 1111) Gem20()
GR21: Gem20(1110 0010 1100 111) Gem21(1)
GR22: Gem21(1001 1101 111) Gem22(1)
GR23: Gem22(0101 1101 111) Gem23(0)
GR24: Gem23(0111 0111 111) Gem24(1)
GR25: Gem24(1110 1010 0010 11) Gem25(00)

2016-05-22 (
GR26: Gem25(1001 0111 1101 1) Gem26(110)
GR27: Gem26(0001 0010 1101 11) Gem27(10)
GR28: Gem27(0111 0100 1110 1) Gem28(001)
GR29: Gem28(0010 0101 1000 0001 1001 1) Gem29(100)
GR30: Gem29(0101 1110 0100 11) Gem30(10)

Note: Changed leveling emphasis to GR==Gem

GR30: Gem29() Gem30(0100 0001 1111)
GR31: Gem30(11) Gem31(01 0011 1111)
GR32: Gem31(11) Gem32(11 1101 1101)
GR33: Gem32(1) Gem33(011 0101 1111)
GR34: Gem33(1) Gem34(111 0111)
GR35: Gem34(0001 0011) Gem35(0011 1101 0100)
GR36: Gem35(0100 101) Gem36(1)

2016-05-23 (
GR36: Gem35() Gem36(1011 0111)
GR37: Gem36(0100 1) Gem37(110 1110 1101)
GR38: Gem37(01) Gem38(00 1100 1011 1100)
GR39: Gem38(011) Gem39(1 1101 0011)
GR40: Gem39(1001 01) Gem40(01 0000 0011 1111)
GR41: Gem40(101) Gem41(1 0101 1001 1111)
GR42: Gem41() Gem42(1100 0110 1110)
GR43: Gem42(0011) Gem43(1001 1111)

2016-05-27 (
GR44: Gem43(0000 0001 11) Gem44(10 1111 1011)
GR45: Gem44(1) Gem45(110 1011 1110)
GR46: Gem45(1) Gem46(110 1101 0110)
GR47: Gem46(1001) Gem47(0000 1111 0010 1111)
GR48: Gem47() Gem48(1111 1011)
GR49: Gem48(101) Gem49(1 0110 1001 0011)

Note: Replaced one maxed gem50 with a new gem49 (leveled separately using 100%)

2016-05-28 (
GR50: Gem49(1100 1) Gem50(111 1110)
GR51: Gem50(0100 011) Gem51(0 0101 0010 1110)
GR52: Gem51(111) Gem52(0 0110 1110 0010)
GR53: Gem52(0010 11) Gem53(11 1111)
GR54: Gem53(111) Gem54(1 0011 0110 1011)
GR55: Gem54(1) Gem55(011 0101 1001)
GR56: Gem55(1001 0001) Gem56(0011 0010 0110 1101)

2016-05-29 (
GR57: Gem56(1) Gem57(011 1011 1111)
GR58: Gem57() Gem58(0011 0111 0100 1010)
GR59: Gem58(01) Gem59(10 0100 0010 1101 1001)
GR60: Gem59(1) Gem60(010 0001 0111 0011)
GR61: Gem60(0100 1) Gem61(000 1001 0011 1101)
GR62: Gem61(101) Gem62(0 1111 0111)
GR63: Gem62(0010 1) Gem63(110 1110 1000 0101)
Allllrighty, then...
The distribution is binomial whether or not you have a high number of trials or a probability far away from 0 and 1. You used the normal distribution to estimate the binomial distribution. That is what requires a middle probability and high number of trials.
I am not sure I understand your comment Mmbah. From the central limit theorem I expect the sum of my N theoretical results to have a normal distribution around the expected mean (N*P in this case) regardless of the underlying distribution from which those results were chosen.
Uh, whatever helps you enjoy the game, I guess...
06/04/2016 02:35 PMPosted by Nerdicus
Long post so here is the 5 second summary.
I decided to test if Urshi is lying to us.
H0(Null Hypothesis): Urshi is correctly stating the 60% gem upgrade odds.
TLDR Results: Da girl aint lyin, at least not about the 60% cases +0 and +1.

This was a waste of time to research. It's already commonly accepted that Urshi's odds are as stated. The only people who don't believe this are a small subset of people who either slept through statistics class or suffer from some delusion that they are unlucky.
Person fails all 4 upgrades doing an empowered rift, comes on forums and complains chances are broken. Next empowered rift they complete, they are successful on all 4 upgrades, says nothing. ugh
I've been noting my 60% upgrades as well. I'm 48/84, which is about a 57% success rate. Not a huge sample size yet, but not far enough from 60% to be alarming.
Good work OP.

This was a waste of time to research. It's already commonly accepted that Urshi's odds are as stated. The only people who don't believe this are a small subset of people who either slept through statistics class or suffer from some delusion that they are unlucky.

You might be surprised. People notice their bad luck streaks a lot more than their normal or good luck. Combine that with the sample bias of forum posts (people who experience unlikely streaks are far more likely to post about it than people who experience normal streaks) and you find that a lot of posters get easily convinced that they should get more upgrades.
Kudos to you, Nerdicus. I applaud your efforts to do this after reading the previous thread regarding people's doubts that also focused on the 60% probability number. No, folks that didn't read that thread, Nerdicus didn't randomly choose 60%. It was the same number used in the previous thread where another person did statistics that he said showed an exactly opposite result.
Well well. This is a big surprise. Not. Thankfully this uselesss debate might be put to rest once and for all.
Interesting thing about the good and bad streaks in this data. On the upside I had two 12, one 11, one 10 and two 9 length upgrade success streaks which were a bit on the high side of the expected for 932 trials. On the low side I had one 7, two 6, one 5 and eight 4 length upgrade fail streaks. These check out pretty well with what would be expected for a decent RNG and would seem to indicate that there is no playing around with entropy from the game engine.

Thanks for the kind works everyone. To be honest just about any replication study is somewhat of a waste of time, but then playing this game is kind of a waste of time in its own right so yeah it comes down to how I want to waist my own time. "OP name checks out" I think is probably appropriate.

In this case its been a few years since I have had to play with statistics without a recipe so just the process of designing the experiment was its own reward. I am a little surprised that someone with better knowledge hasn't chimed in to show me where my design went off the rails. That is frankly why I included all the extra info about the experiment design and the raw data.
Urshi is NOT a sadistic liar, just a compulsive one.
06/04/2016 10:15 PMPosted by Coolfool808
Person fails all 4 upgrades doing an empowered rift, comes on forums and complains chances are broken. Next empowered rift they complete, they are successful on all 4 upgrades, says nothing. ugh

I wish that was the case. I've had six successive failures @ 60% lots of times, but never 6 successes in a row! Whilst I haven't made as my lgem upgrade runs as others on here, my initial thoughts are that the upgrade chances for me, have been between 50-55% overall. I'm not a particularly lucky person though - in fact, I'm what many of my personal friends consider "the unluckiest poor bastard they've ever met". I once played a game of 2 up (Aussie gambling game that's only legal to play on ANZAC day) - it consists of lipping a coin and guessing heads or tails. This one particular time, it took me 27 straight goes before I guessed correctly. Let's just say my friends were...dumbfounded. When I used to play D&D with my mates, it was always me rolling the 1 on a D20 and screwing up providing lots of fun for the rest of the party. I remember once playing poker on my old Vista ultimate PC, and having 4 of a kind (4 7's) and got beaten by a...royal flush. What are the odds of that...

I have a personal belief that a small amount of the population is just plain lucky (say, 5%), most are average (90%) and some are just plain unlucky (remaining 5%).

RNG is just plain painful to some people and just plain nice to others. For most, it's just a middle of the road !@#$%^!
Right now I've got 76/132, or ~57.6%. Small sample size, I know, but close enough to 60 that I'm not worried about it.

Longest upgrade streak is 6 (and there are 3 of those)
Longest fail streak is 4 (occurring twice)

I have had much longer streaks in both directions before, naturally.

Not quite streaks, but an example of what can happen when you're just looking at your most recent results (as we do when we're just playing normally knocking some rifts out and not writing this down like dorks)...upgrades 81-100 include 16 upgrades out of 20 attempts. The worst set was 61-80, with just 8 successes out of 20.

I chose to use blocks of 20 as an example because that's how many lines my notebook has so it's just looking at the columns ;)
Yeah, it's all relative and longer sized samples do help I think. These days I just shrug and get on with the game if I get a crappy run with RNG. It's frustrating sometimes (especially hunting for Sir William lol!!!) but it's the way that the game works. I came across a guy on YouTube who'd spent a 1000 hours and never found The Furnace. I've had 4 of them now (1 from a rare to leg upgrade, the other 3 in game drops) after around 400 or so hours. Weird isn't?

Join the Conversation

Return to Forum