Post by leafo in Jam ratings calculation issue

Viewing post in Jam ratings calculation issue

I sound like a broken record here, but I want to emphasize that avoiding the score adjustment is not a design goal of this system. The point of the adjustment is to allow entries to be relatively ranked in the bottom half by minimizing the randomness factor by scaling down scores with lower levels of confidence.

Also, keep in mind we’re using the median, not the average, so if a few people end up going all in rating a lot of entries, the median will not be affected. The median is a good representation of how participants of the jam are voting.

Lastly, for higher medians, also understand that increasing the median by 1 will have less of a score adjustment on entries that are around the median. The ratio of ratings received versus median number of ratings is used. (eg. 99/100 is much smaller than 9/10).

By adjusting the algorithm with the focus on reducing the number of people getting a score adjustment you essentially introduce the entropy into the rankings (so I don’t agree that your suggestion would stabilise the system). The main goal of the ranking algorithm is to avoid “fluke” type situations where a submission that wasn’t seen by many happened to get a higher rating, because people who did see it didn’t care, were biased, or something else. Especially in a jam like GMTK, where public ratings are allowed, this is definitely a concern. The secondary goal is to let entries that were both highly scored and got a large number of ratings to rise above in ranking. You may argue this may hide “hidden gem” like submissions (seen by few, but actually very good), but since those types of projects have less confidence in their overall score I think it’s a necessary sacrifice to accomplish the goal ranking every entry relative to every other entry. (For example, on itch.io’s browse pages, “hidden gems” are a good thing, so we use a different algorithm so allow that type of content to surface.)

All that said though, I wasn’t immediately dismissing your idea. I think it’s an interesting suggestion and I would need time to run results from existing jams and observe the kind of impact it has. Just thinking about it off hand I believe it would most likely introduce a higher “fluke” factor for jams that have lower medians. It’s hard to intuitively reason about how Median * 80% compares to something like 40th percentile for the adjustment cutoff without actually running the numbers to see how it performs.

Regarding your point about communication, I definitely agree that we can add more information to participants that they should participate in ratings games to boost their visibility. In the case of GMTK I don’t remember off hand how the host communicated that to the participants, but I believe most people understood that they should be rating games.

Thanks!

Like Reply

Alice3 years ago(+2)

I want to emphasize that avoiding the score adjustment is not a design goal of this system. The point of the adjustment is to allow entries to be relatively ranked in the bottom half by minimizing the randomness factor by scaling down scores with lower levels of confidence.

It is not, but I believe enforcing the score adjustment isn't a design goal of this system, either.

The problem with the current system is that even if everyone puts at least close-to-median-sized effort, they still might get their score adjusted semi-randomly, with some entries getting one or two ratings below a median of, say, 20 (just like rolling a 6-sided die 60 times doesn't mean all numbers will appear exactly 10 times). It can lead to a somewhat ironic situation, where the system designed to minimise the randomness factor introduces another randomness factor (i.e. which entry ends up with an adjusted score and which won't). After all, using median means that - excluding entries with exact median number of votes - the lower-voted half of entries will get its scores lowered no matter what.
Also, while the median increasing by 1 might not be significant with a median of 100 votes, the score adjustment might be more significant with a median of 20. And considering median depends on how many games can people play within voting time (as opposed to number of entries), I'd wager getting something like 10-20 median across 200 entries wouldn't be all that unusual. With medians this low, the randomness factor of score adjustment becomes particularly prominent - possibly even moreso than the few-votes variance it's designed to minimise.

Another randomness factor comes from the indirect relationship of giving-receiving - some people might get lucky and get 100% of reciprocal votes, while others might often give their feedback to people who aren't interested in voting at all. Not sure if itch.io more prominently displays entries with higher "coolness rating" (i.e. how much feedback the author gave vs how many votes their entry received); it would definitely add a stronger cause-effect in the giving-receiving relationship.
On the other hand, I imagine public voting would add some extra randomness to giving-receiving, because there's no way to vote on a public voter's entry in hopes of receiving a reciprocal vote. I suppose public voting shifts relevance away from feedback-giving to self-promotion (the higher the proportion of public voters, the more self-promotion becomes relevant compared to feedback-giving). Not really calling to remove public voting altogether, rather pointing out another reason why voting for other entries might not always be the most effective nor reliable method of getting past the median threshold.

I do not advocate for 100% entries avoiding score adjustment most of the time. I do, however, believe that if I take my time to cast a median amount of votes, I should reliably be able to avoid score adjustment (say, 95%+ of the time). Thus, among the numbers-checking for the previous Jams, it might be worth finding out how votes given correlate with votes received. In particular, how much of median I'd be guaranteed to receive 95%+ of the time if I voted on median number of entries. This could give a more fitting median multiplier than my feeling-in-the-gut 80% I initially proposed (assuming my proposal would be implemented in the first place).

Like Reply

leafoAdmin3 years ago (3 edits)

I really appreciate your response, but my original points still stand.

The design goal of the current algorithm does not include that you should individually have direct control over if your score adjusted or not. Just like you can’t directly control how many people how many people decide to rate your game. You can only influence how visible your game is by how much you participate in rating & comments.

The goal is to rank each project relative to every other project. As an individual participating in a ranked jam you should be prepared to accept that there may be entries that organically are more well received than yours, with a high average and number of ratings.

I full understand your point about the aliasing that happens around the median. This aliasing still happens even with your suggestion, but will happen at a different spot and, as I mentioned, increase the odds of the ‘fluke’ scenario: a game with 6 votes ranking above a game with 200 votes (since generally medians are pretty low). By lowering the bar you diminish the reward that the proven submissions get (those with high number of ratings & score).

I also want to note, since I don’t know if you are even familiar with the algorithm, but the score adjustment is avg_score * sqrt(min(median, votes_received) / median). Note the sqrt which effectively reduces the slope of the penalty for those near the median, trying to mitigate some of the aliasing issue. (One experiment worth exploring might be changing the denominator of the exponent)

Another thing to note is the nature of games jams, especially apparent in larger ones. There are quite a few broken, incomplete, partially made, or just low quality entries. Many people will skip voting these depending on the circumstances. The median can map over this curve of completion for entries. There’s also the group of people who don’t participate in voting so their games get restricted exposure.

I can generally confidently suggest to anyone that if they put in a genuine effort into their game (including its presentation) and put in a genuine effort into rating entries they’re unlikely to get caught in the adjusted group since the median work required is lower than they might think.

(It’s possible your intuition is mixing up average effort with median effort for number of ratings. The average number of ratings is always larger than the median. Your concerns are based around the scenario when median == average, which doesn’t really happen)

Also, one last thing, if a jam host ever reaches out and they have specific goals in mind with how voting and ranking works, I’m happy to work with them.

Lastly, if you’re curious to experiment, here are some ranked jams you can look through:

https://itch.io/jam/brackeys-4/results
https://itch.io/jam/cgj/results
https://itch.io/jam/brackeys-3/results
https://itch.io/jam/lowrezjam-2020/results
https://itch.io/jam/igmc2018/results
https://itch.io/jam/united-game-jam-2020/results
https://itch.io/jam/nokiajam3/results

Like Reply

Alice3 years ago(+2)

Thank you for your response (and, uh, for other responses so far too ^^).

I understand the point about the design goal of the system, and it being primarily to compare entries against one another. My point about agency likely stems from this emphasized suggestion "participate in rating games" and generally encouraging participation. It's a very valid message, but also made it sound like it's primarily participant's responsibility to get their game above median, rather than a (closer to reality) combination of participant's involvement and out-of-control factors.

I'd like to point out there are two key aspects to the Jam experience:

"global" aspect - what kind of games were created and whether top-ranked entries deserve their spots (since it's the top places people are mostly excited about)
"individual" aspect - how one's entry performed, both in terms of feedback and ranking

The median measure seems focused on improving the global aspect - making sure that ratings are fairer.
Except it's a finnicky measure, because:

You mention 6-votes entry ranking above 200-votes entry, which I presume is about the 80% (rounded-up) measure of mine; but this implies the median is 7, and 7-votes entry ranking above 200-votes entry doesn't seem like a massive improvement.
In a recent (non-itch) Jam I participated in, there were 25 entries with votes from 19 entrants + 4 more people. Most of them ranked nearly all entries (people couldn't rank their own entry). The median-and-above entries got 20-22 votes, the below-median entries got mostly 18-19 votes (two entries got 14 and 16 votes). Also, one of the 19-voted entries was 5th out of 25, making it a relevant contender.
With the strict median measure, an entry getting 19 votes would have its score adjusted while 22-votes (most-voted entry) would not. It means that, depending on a situation 19 is deemed too unreliable vs 22, but in another Jam 7 seems reliable enough vs 200. Now, even with my proposal it would be 16-22 vs 6-200 spread, but it goes to show that median system adds extra noise - potentially near top-ranking entries, too - when all entries are voted-for almost evenly. The difference is that raw median semi-randomly punishes 11 out of 25 entries, while with my adjustment only 1 of 25 entries qualifies for score adjustment - that's one 1 less!

I guess the problem of extreme-voted entries can be tackled two-fold (maybe even both measures at once):

Promote the high-ranking (e.g. top 5%) low-voted entries, so that more people will see them and either prove their worth or get them off their high horse. People don't even need to specifically be aware these are near-top entries (especially since temporary score isn't revealed), what matters is that they'll play, rate and verify.
It's sort of "unfair" for poorer-quality entries, but chances are already stacked against them and it can improve the quality of top rankings by whittling down the number of undeserved all-5-star outliers. And let's face it - who really minds that 6-voted entry with all 2s ranks above a 200-voted entry with mostly 2s and some 1s?
More work in this one, but with great potential to improve jam experience - streamline the voting process.
In that Jam I mentioned, we have a tool called "Jam Player". It's packaged with the ZIP of all games, and from there you can browse the entries, run their executables, write comments, sort entries etc. As the creator of the Jam player I might be blowing my own horn, but before lots of voters played only a fraction of games. Ever since introducing the Jam player, the vast majority of voters play all or nearly all entries, even when the number of entries reach 50 or so (with 80 the split between complete-played and partially-played votes was more even, but still in favour of complete-played).
I imagine similar tool for integrated voting process could work for itch.io - obviously there are lots of technical challenges between a ZIP-embedded app for a local jam and a tool handling potentially very large Jams, but with itch.io hosting all the Jam games it might be feasible (compare that with Ludum Dare and its free links). With such a player app, same people would play more entries, making the votes distributions more even and thus reliable (say, something like 16/20 vs 220 instead of 6/7 vs 200).
Perhaps I should write up a thread on the itch.io Jam Player proposal...

The 80% median seeks to improve the individual aspect - making sure it's easier to avoid the disappointment of getting score adjusted on own entry despite one's efforts

If someone cares about not getting their score adjusted and isn't a self-entitled buffoon, they'll do their best to participate and make their entry known. If someone doesn't care, then they won't really mind whether their entry gets score adjusted or not. The question is, how many people care and how many don't.
If less than 50% people care, they'll likely end up in the higher-voted half of entries. Thus, no score adjustment for them, the lower-voted half doesn't care, everything is good.
However, if more than 50% people care, there'll inevitably be some that get in the score-adjusted lower half. E.g. if 70% people cared about score adjustment, then roughly 20% would get score-adjusted despite their efforts not to. The score adjustment might not even be that much numerically, but it still can have a psychological impact like "I failed to participate enough" or "I was wronged by bad luck". I'm pretty sure it would sour the Jam experience, which goes against the notion of "the jam is the best experience possible for as many people as possible". The fact that 60-70% Ludum Dare entries end up above 20 entries threshold, and that 19/25 entrants voted in the Jam I mentioned, I'd expect in a typical jam at least half of participants would care.

Do note that in the example Jam from earlier, 9 of 19 voting entrants would get score-adjusted with 100% median system despite playing and ranking all or near-all entries. Most of that with quality feedback, too, you can hardly participate more than that. Now, I don't know how about you, but if I lost a rank or several to the score-adjustment despite playing, ranking and reviewing all entries - just because someone didn't have time to play my game and its votes count got below median - I'd be quite salty indeed.
With the 80% median system, all voting entrants would pass at the cost of 16 vs 22 variance, which isn't all that great compared to 20 vs 22 variance (the least voted entrant didn't vote).

To sum it up:

if the votes count variance is outrageous in the first place (like 6/7 vs 200), then sticking to strict median won't help much
if the votes count variance is relatively tame (like 18 vs 22), then using strict median adds more noise than it reduces
provided that someone cares about score adjustment and actively participates to avoid it, the very fact of score-adjustment can souring/discouraging, even if the adjustment amount isn't all that much
rather than adhering to strict median, the votes variance problem may be better solved by promoting high-ranked low-voted entries (so that they won't be so low-voted anymore) and increasing number-of-votes-per-person by making the voting process smoother (like the Jam Player app; this one is ambitious, though)
with more votes-per-person and thus more even distribution of votes, we should be able to afford a leeway in the form of 80% median system

Also, thanks for the links to the historical Jams. Is there some JSON-like API that could fetch the past Jam results (entry, score, adjusted score, number of times entry was voted on) for easier computer processing? Scraping all this information from webpages might be quite time-consuming and transfer-inefficient.

Like Reply