update 6:28PM ET 8 May On reddit, someone with a flag suggesting they ought to know tells me I got it wrong and the original algorithm was correct and is used in production. All I can say is I can’t figure out how that could be, I could not get it to work in a non-reddit codebase, I could get it to work with my ammendment.
If I haven’t corrected a typo, then I guess I’ve derived my own variation which works a lot better for me (that is, works at all for me, but also seems to mimic reddit), in my own codebase. Good enough for me. If you are trying to reimplement this algorithm in a non-reddit codebase, I suspect you’ll find my investigation useful.
Now back to the blog post as originally written.
More oddly, this same significant typo is in the public version of reddit’s code released on github.
I’ve found myself finding joy in code-for-code’s-sake like I haven’t since past days of being an undergrad staying up all the night in the CS lab working on the most fun homework ever. And so I found myself staying up into the wee hours last night investigating reddit’s story ranking algorithm (the one used for stories/posts in the default ‘hot’ ranking, that is time-of-posting sensitive. A different algorithm is used for comments).
The wrong algorithm
The (typo’d) algorithm is most nicely described by Amir Salihefendic. He even provides a python implementation. I figured I’d translate it to my preferred comfortable language ruby, and play around with it changing different parameters to get a feel for it, and get a feel of how it might be modified/tuned to behave somewhat differently if one wanted.
My assumption was that this algorithm outputs a number which can be stored in the database, and stories can be ordered purely by this number, to produce the on-page ranking. This seems indeed to be true, although I was doubting it a bit in the middle. (Another nice thing about this particular algorithm, that everyone did catch on to even in the typo’d version, is that a ranking order calculation only needs to be done when a ‘vote’ happens, then it can be stored in the db unchanging forever (until another vote happens)).
So I translated Amir’s python to ruby and starting playing, and the results made no sense. They didn’t match how things work on reddit, and they didn’t result in any kind of useful ranking algorithm.
Users of reddit know that the story listings are mostly chronological. The vote count will change a story’s position somewhat, but not put it dozens of weeks or pages ahead or behind of it’s strictly chronological order. But that’s what this algorithm did. It also gave any story with a net-negative vote a negative score. Which would put all the net-positive vote stories before any of the net-negative voted stories. Which is not how reddit works.
Looking at the math again now, I’m kind of embaressed I didn’t immediately see the problem, it’s not complicated math. But I didn’t, it just made my head swim. I’ll give you the relevant line from Amir’s python version here, maybe you can do better than I did, now that I’ve primed you:
return round(order + sign * seconds / 45000, 7)
Before I give you the answer, I’m going to tell you all the things I did first:
- Went back and from scratch re-ported Amir’s python to ruby, doing as literal a translation as possible. Same nonsensical result.
- Took the more mathematical description on Amir’s page (the one in a giant png? That he took from semoz?), and implemented that in ruby. Same nonsensical results (modulo some new bugs I introduced).
- Googled around for anyone elses description of the algorithm to see if they had a different version, or explained it better, or explained how it fit into the context of the software as a whole (maybe I was wrong that it was supposed to produce the actual bare ordering number?). No dice.
- Found the relevant source in the reddit github open repository (Amir’s link was broken as reddit moved their public source repo. Hooray github for being so easy to navigate on the web). Translated this to ruby. Same results.
- Okay, noted that the reddit source mentions an “equivalent function in
postgres” Okay, found the SQL implementation, translated THAT to ruby, same results.
(Yep, the implementation in github public reddit has the typo, and is wrong!).
At this point, I gave up on understanding the reddit algorithm, I figured there was something I was missing (wrong, only thing I was missing was the typo). But, okay, I dove back into the math, trying to understand it and convert it to something that would work for me.
Take a moment to note lesson learned
Like many programmers, I rather like working from fixed assumptions and constraints, and building on top. This is kind of the nature of abstraction, don’t question the lower levels, take em as assumptions, don’t question em, build upon them.
This is the second time recently that’s led me astray into butting my head against a dead end wall repeatedly, assuming the problem was in my own implementation or understanding, instead of in the framework code I was using, or the published algorithm or explanation I was working from.
Sometimes you’ve got to start questioning the validity of the algorithm you’re working from or the correctness of the library/framework code you’re using sooner rather than later, to save yourself some time. However, do it privately, if you start questioning your dependencies in public without evidence, everyone’s probably (rightly) going to tell you “occam’s razor, the bug is probably in your code, not the dependency.”
WRONG: return round(order + sign * seconds / 45000, 7) RIGHT: return round( (order * sign) + (seconds / 45000), 7)
Is it obvious now that you see it, that the first one makes no sense, but the second one does? Maybe if you see it in context, here’s my ruby implementation of the corrected algorithm.
I feel kind of stupid for not noticing this right away; on the other hand, as far as I can find on google, nobody’s pointed out the typo bug before, and several have commented on the (wrong, typo buggy) algorithm.
The Explanation of the Algorithm
With the typo corrected, it’s much easier to explain the algorithm. The crucial line, from my ruby version, with variables named how I think is clearer:
return (displacement * sign.to_f) + ( epoch_seconds(date) / 45000 )
It plots each story on a fixed timeline by post, and then displaces a story on the timeline by it’s votes. It uses only the vote difference between up and down for displacement, the total number of votes is irrelevant. First:
... + ( epoch_seconds(date) / 45000 )
This just plots each story on a fixed timeline, with distance between two stories always exactly proportional to difference between absolute posting time.
The `/45000` fixes the units of the timeline as “12 hour periods” (45000 seconds in 12 hours), rather than seconds. This reduces the order of magnitude of the units by 4.5ish, making them conveniently less likely to overflow wherever you’re keeping them. But more importantly, choosing the units matters for how much displacement the actual votes will cause, making sure they match appropriately. Then:
(displacement * sign.to_f) + ....
Here’s our displacement. `displacement` is the based on the vote difference (up – down), but on a logarithmic scale. The way the logarithmic scale is calculated, it loses the sign, so it just has to be added back in to net-down-votes will displace the story to be older on the timeline, and net-up-votes will displace the story to be newer on the timeline.
Why is a logarithmic scale used? Other explainers have said something like “to weight the first votes higher than the rest.” While it might have this effect because of reddit voter’s behavior, this is a misleading explanation. The algorithm pays no attention to which votes were made first, either in absolute chronological time or in sequence. It’s just vote-difference. ”10 up, 1 down” has exactly the same effect as “100 up, 91 down” or “1000 up, 991 down”. And it doesn’t matter what order the ups and downs were placed.
The logarithmic scale is in fact used to prevent the displacement-from-voting from displacing the display order too much. Reddit doesn’t want a very high or low voted story to be months ahead or behind, the reddit ‘hot’ order is mostly chronological, with just some displacement from votes.
I dont’ do this kind of mathematical analysis much, and don’t know how to get, say, R, to make you a pretty plot (it ought to be an actual function plot not a bar graph, for explanatory power). So I’ll just give you some samples of how much displacement a given vote-diff can get. Again vote-diff is just ups minus downs, doens’t matter total number of votes. I’ve converted from the “12-hour units” the displacement is actually expressed in to more comprehensible ‘in hours’ units.
|vote-diff||displacement in hours|
As you see, even something with an absurdly high 10000 vote-diff only gets put 50 hours ahead of it’s usual place in the timeline. Likewise, if it had a -10000 vote-diff (10k more downvotes than upvotes), it would be only 50 hours behind it’s usual place in the timeline.
Keeps votes from changing the position of a story too much, keeping it at the top forever, or moving it so many pages in that nobody ever sees it. That’s what the log scale does.
And that scale pretty well matches what we reddit users actually observe on reddit, I went and checked it against some popular reddits; reddit only displays approximate posting time of a story as far as I can tell (“1 day ago” could mean 28 hours or 32 hours or whatever), so can’t check completely, but the actual ordering could be explained by the corrected algorithm.
Wrong in the public source on github?
update 6:55pm ET 8 May 2012. reddit assures me that this code is what reddit runs live, and I have made some really stupid mistake. Fair enough. Struck out this section.
This means that there’s pretty much no way actual reddit.com is using the code they’ve posted publicly on github. At the very least, they’ve fixed this bug in the implementation they’re actually running. It probably means nobody else is using the reddit github source either, cause it wouldn’t work right with ‘hot’ ranking. (Or someone else is using it, and fixed the bug in their source but didn’t send it back). How did this bug end up in the publicly shared reddit source? Not fixed yet? I’m kind of curious, and curious as to what relationship this publicly shared source has to what reddit actually runs.
Considering tweaking the algorithm
Now that we understand the basic “timeline + displacement” algorithm, we can consider tweaks/modifications/tuning of the algorithm to behave differently in different environments, which curiosity was my original motivation for looking into this in the first place.
You might want vote displacement to have more of an effect, or have the effect trail off faster or slower . You’d still want to use a log-scale (or a mathematical function with similar properties) to keep very high vote-diffs from displacing a story too much, you still want a trail-off effect.
You could change the log from base-10 to base-something else to effect the velocity of the trail-off effect. You could also introduce a factor into the operand of the log, take `log( factor * vote-diff)` instead of just `log(vote-diff)`. You potentially could change the units from 12-hour units to something else (the 45000 number), but that could get confusing quick, you might need to add another factor on the left hand of the sum to compensate. So actually, instead, you can just add a factor in the left-hand side, `factor * log(votediff)` instead of just `log (votediff)`
I’m not enough of a math guy to predict exactly what all those things would do, I’d want to actually plot the function in R (or something else) and see what it looks like when you change the various factors, and I don’t know enough R (or anything else) to do it. I think plotting vote-diff vs hours-of-displacement is the right thing to plot though to give you the right feedback.
You could also try to introduce something to the equation to take account of total number of votes, so “10 up, 1 down” and “100 up, 91 down” don’t have exactly the same effect. You’d want to base this on the Wilson score confidence interval used by reddit for default comment ranking somehow, that’s the right way to take account of total number of votes, but it’s not immediately clear to me where you’d introduce that into the equation how (Did I mention I’m not a math guy?). That would make it a bit harder to see what it does by plotting it, since it’ll be a multi-variate function now, doh.
And you might not want to trust that the algorithm found in reddit’s public github source for Wilson score confidence interval is actually bug free. Last year someone said they found a bug in at least one published implementation; I think I saw someone say it had been fixed on reddit.com, but I don’t know if it’s been fixed in the github public source.
You might also want to make up votes worth more or less than downvotes, instead of equivalent. Not quite sure how you’d do that. You could make net-negative votes worth more or less than the same absolute value net-positive, just by using a factor in `factor * log(diff)` that depends on diff being positive or negative.