We left Part 2 with a dataset including a set of Redditors and the Subreddits they comment on. We also defined which similarity index we are going to use to measure the similarity among each Subreddit.
So, let’s continue!
Step 3. Calculating Subreddit similarity
Let’s refresh how we want the Subreddit similarity table look like
sub1 | sub2 | similarity
funny | aww | similarity(funnny-aww)
funny | Iama | similarity(funny-Iama)
...
My first implementation of the Similarity calculation algorithm was done in Python. However, this task is very heavy in terms of processing, so I ended up writing another implementation in Go (Go language performs better than Python in general).
The algorithm goes like this:
0. Sort all the Subreddits alphabetically
1. for each Subreddit, sub1
:
2. For each other subreddit alphabetically after sub1, sub2
:
3. Calculate the number of users who wrote comments in sub1 AND sub2, Union1-2
.
4. Calculate the number of users who wrote comments in sub1 OR sub2, Intersection1-2
.
5. Calculate the Jaccard Similarity Index of sub1 and sub2 as similarity1-2 = Union1-2 / Intersection1-2
.
6. Write on the similarity
table the row sub1,sub2,similarity1-2
.
Here is the code to implement the algorithm:
package main
import (
"database/sql"
"log"
"sync"
"utils"
"github.com/coopernurse/gorp"
_ "github.com/lib/pq"
)
func main() {
dbmap := InitDb()
var subs []string
n_threads := 200 //This will depend on the computing power of your machine
_, err := dbmap.Select(&subs, "select distinct(sub) from redditors_comments order by sub asc")
utils.CheckErr(err, "Error selecting redditors")
calculateSimilarity(subs, n_threads)
}
func createSplitSlice(s []string, n int) [][]string {
lenSlice := int(len(s) / n)
splitSlice := make([][]string, n)
i := 0
for _, e := range s {
if len(splitSlice) > lenSlice {
i = i + 1
}
splitSlice[i] = append(splitSlice[i], e)
}
return splitSlice
}
type SimilarityComments struct {
Sub1 string `db:"sub1"`
Sub2 string `db:"sub2"`
Similarity float64 `db:"similarity"`
}
func InitDb() *gorp.DbMap {
user := <DB_USER>
password := <DB_PWD>
host := <DB_HOST>
port := <DB_PORT>
dbname := <DB_NAME>
dbString := "host=" + host +
" port=" + port +
" user=" + user +
" password=" + password +
" dbname=" + dbname
db, err := sql.Open("postgres", dbString)
dbmap := &gorp.DbMap{Db: db, Dialect: gorp.PostgresDialect{}}
utils.CheckErr(err, "sql.Open failed")
//dbmap.TraceOn("[gorp]", log.New(os.Stdout, "reddit:", log.Lmicroseconds)) //uncomment to log the sql queries
dbmap.AddTableWithName(RedditorSubsComments{}, "redditors_comments")
dbmap.AddTableWithName(SimilarityComments{}, "similarity_comments")
return dbmap
}
func calculateSimilarity(subs []string, n_operators int) {
lenSlice := int(len(subs) / n_operators)
var start int
var end int
var wg sync.WaitGroup
for op := 0; op < n_operators; op++ {
start = op * lenSlice
end = (op + 1) * lenSlice
wg.Add(1)
go calculateSliceSimilarity(subs, start, end, &wg)
}
wg.Wait()
}
func calculateSliceSimilarity(subs []string, start int, end int, wg *sync.WaitGroup) {
/* sort subs
n processors
each procesor process lenSplice = len(subs) / n_processors
each processor process subs[i*lenSplice:(i+1)*lenSplice]
for each sub:
calculate similarity of that sub and those after it
*/
dbmap := InitDb()
defer dbmap.Db.Close()
lenSubs := len(subs)
for i := start; i < end; i++ {
for j := i + 1; j < lenSubs; j++ {
calculateSubsSimilarity(subs[i], subs[j], dbmap)
}
}
wg.Done()
}
func calculateSubsSimilarity(sub1 string, sub2 string, dbmap *gorp.DbMap) {
var count_union []int
var count_intersection []int
var union int
var intersection int
var similarity float64
query := "Select count(distinct(redditor)) from redditors_comments where sub ='" + sub1 +
"' or sub ='" + sub2 + "';"
_, err := dbmap.Select(&count_intersection, query)
utils.CheckErr(err, "Error on subsimilarity union")
intersection = count_intersection[0]
query = "select count(distinct(redditor)) from redditors_comments where sub='" +
sub1 + "' and redditor in (select distinct(redditor) from redditors_comments where sub='" + sub2 + "');"
_, err = dbmap.Select(&count_union, query)
utils.CheckErr(err, "Error on subsimilarity intersection")
union = count_union[0]
similarity = float64(union) / float64(intersection)
s := &SimilarityComments{sub1, sub2, similarity}
err = dbmap.Insert(s)
utils.CheckErr(err, "Error inserting similarity")
}
This script will take several hours to finish (on my 2 years old laptop it took 3 complete days).
If you are using Postgresql like me, you might need to modify your posgres.conf file to increase the maximum number of connections.
Once the similarity calculation is done, we will know how similar each one of Reddit top subreddits to each other.
Good thing of this implementation is, you can continue running it while your application is in production, so the algorithm gets better with every new user information that gets to the database.
Here is the list of the top pairs of subreddits in terms of Similarity:
sub1 | sub2 | similarity |
---|---|---|
iOSthemes | jailbreak | 0.142020815264527 |
asktransgender | transgender | 0.112781954887218 |
aviation | flying | 0.101759227319765 |
ukpolitics | unitedkingdom | 0.101730688054031 |
Liberal | progressive | 0.0998427672955975 |
keto | ketorecipes | 0.0974065138721351 |
Pokemongiveaway | pokemontrades | 0.0894308943089431 |
ImaginaryCharacters | ImaginaryMonsters | 0.088479809976247 |
frugalmalefashion | malefashionadvice | 0.0846720846720847 |
GiftofGames | RandomActsOfGaming | 0.0812588069516205 |
By observing the table with the most similar subreddits, we can see that the similarity calculations have some relevancy, all of the pairs of subreddits are very very similar to each other.
Another interesting analysis we could do would be to display which subreddits are more similar to a specific one.
For example, here are the most similar subreddits to the subreddit /r/beer:
sub | similarity |
---|---|
beerporn | 0.0723035554227667 |
Homebrewing | 0.0376598768356229 |
cocktails | 0.0122950819672131 |
wine | 0.0110444935310824 |
bourbon | 0.0105506099571382 |
newjersey | 0.0103806228373702 |
drunk | 0.0100190839694656 |
showerbeer | 0.00997719498289624 |
Cooking | 0.00884573894282632 |
Coffee | 0.008186852877438 |
Pretty decent results, we see how the most similar subreddits to the /r/beer sub are alcoholic/food related, and for some reason, /r/newjersey.
Now, the last step to figure out is, how can we recommend Subreddits to a Redditor?.
Very easy. Given a Redditor r
with a set of subreddits he/she likes [sub1,sub2,...]
,
we can sum the similarity of each subreddit on the dataset with each subreddit that redditor likes.
so on this example, r1
likes [sub1,sub2,...]
, so we compute the most similar subreddits to [sub1,sub2,...]
:
For sub1:
sub1 | sub2 | sim1-3 |
sub1 | sub4 | sim1-4 |
sub1 | sub5 | sim1-5 |
sub1 | … | … |
For sub2:
sub2 | sub3 | sim2-3 |
sub2 | sub4 | sim2-4 |
sub2 | sub5 | sim2-5 |
sub2 | … | … |
we continue calculating this subtables for each Subreddit that r1
likes.
Note that we don’t calculate sim1-2
because sub1 and sub2 are already liked by r1
.
Now, we calculate the similarity of each Subreddit with the Redditor r1
:
sim3-r1 = sim1-3 + sim2-3 ...
sim4-r1 = sim1-4 + sim2-4 ...
sim5-r1 = sim1-5 + sim2-5 ...
...
We now have the similarity of each Subreddit with Redditor r1
.
Last step it’s just returning to the user the top N subreddits sorted by this similarities.
So now we can say we have built a similarity engine. Oh yeah!
On the next (and final) part I will talk about how to use the similarity engine on a live Web Application.