sam ford

Reddit is a content aggregation site where users submit pictures, and links, and videos, and just about anything else. Inside reddit, there are places called subreddits, which set the topic of discussion and content. Each NBA team has their own subreddit, and fans, or anyone, but mostly fans, can post content relating to their team. Here's a look at the top posts of the day for the Boston Celtics subreddit, reddit.com/r/bostonceltics.

boston

The Celtics played the Nets last night. Here's a picture of their subreddit, reddit.com/r/gonets after the game. I don't think I need to say who won the game.

nets

These two pictures exemplify my idea that you can tell who won a game by the feelings of fans on their teams subreddit.

Getting the Winners and Loser From the Day

Before doing anything else, we need to get the winners and losers. Since we're building our "training set", we want to keep each team labeled as a winner or a loser so we can find a pattern once we collect enough data. I looked into an API or two to get the games data, but I settled on scraping the data myself from basketball-reference.com because the APIs couldn't do exactly what I wanted. I hope to replace this code with an API since basketball-reference.com can change their website anytime.

def get_days_teams():
    # get games of the day
    from bs4 import BeautifulSoup, SoupStrainer, Comment
    import requests

    headers = {
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9",
    }

    now = datetime.datetime.now()

    url = "https://www.basketball-reference.com/boxscores/?month={0}&day={1}&year={2}".format(now.month, now.day-1, now.year)

    result = requests.get(url, headers=headers)
    soup = BeautifulSoup(result.content, 'html.parser')
    content = soup.findAll(class_="teams")

    winners = []
    losers = []
    for teams in content:
        # find winner
        winner = teams.find(class_="winner")
        for td in winner.find('td'):
            winners.append(td.text)
        loser = teams.find(class_="loser")
        for td in loser.find('td'):
            losers.append(td.text)

    return winners, losers

> ['Boston', 'Detroit', 'Indiana', 'Golden State', 'Minnesota', 'Cleveland', 'Sacramento', 'Milwaukee']
> ['Brooklyn', 'Houston', 'Chicago', 'LA Clippers', 'New Orleans', 'Orlando', 'Denver', 'Washington']

Retrieving Subreddit Posts

Once we have the winners and losers, we can map those to their respective subreddit and grab the top posts from the day.

def get_posts(subreddit):
    from bs4 import BeautifulSoup, SoupStrainer, Comment
    import requests

    headers = {
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9",
    }

    result = requests.get('https://www.reddit.com' + subreddit, headers=headers)
    soup = BeautifulSoup(result.content, 'html.parser')
    content = soup.findAll(class_="title")

    titles = content[2:-4:2] # need to cut out first two and last two, and remove repeating titles

    sentences = [sen.text for sen in titles]
    paragraph = ' '.join(w.strip() for w in [sen.text for sen in titles])
    return paragraph, sentences

This code returns both each individual post, under sentences, and a full paragraph of each sentence joined together. It's still too early to decide what provides a better metric, so I'm using both.

Analyzing Sentiment

Now we can analyze the posts individually and the posts as a whole.

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# nltk.download('vader_lexicon')
# nltk.download('punkt')

def analyze_paragraph(paragraph):
    sid = SentimentIntensityAnalyzer()
    ss = sid.polarity_scores(paragraph)
    return ss['pos'], ss['neu'], ss['neg'], ss['compound']

def analyze_sentences(sentences):
    sid = SentimentIntensityAnalyzer()
    pos, neu, neg, com = 0
    for sentence in sentences:
        ss = sid.polarity_scores(sentence)
        pos += ss['pos']
        neu += ss['neu']
        neg += ss['neg']
        com += ss['compound']
    return pos, neu, neg, com

NLTK, or Natural Language Toolkit, has a package that can analyze the sentiment of text. Running polarity_scores on some text gives four scores.

> pos: 0.876, neu: 0.124, neg: 0.0, compound: 0.7925

Positive, neutral, and negative are all sentiment measures. Compound is what tells the intensity of positivity and negativity and is bounded between -1 and 1. When compound is close to 1, the text is super positive, and when it's close to -1, it's super negative. Here's a great, very in depth stack overflow post explaining exactly how the compound measure works. I intend to use the compound score, but I'm going to keep all the other scores as well. Space isn't an issue, so no sense in throwing it away before I make my prediction model.

analyze paragraph takes in the full text joined together for each subreddit and returns the single compound score. analyze sentences takes in the list of posts and calculates the sum of compound scores. Each team has the same number of posts, so we don't need to worry about normalizing right now. Perhaps we'll need to once we start creating a model with more data, but it's good right now.

Getting Results

Finally, we can put all these pieces together and find the scores for each team.

def compute_all():
    winners, losers = get_days_teams()
    winner_data = compute_winners(winners)
    loser_data = compute_losers(losers)
    data = winner_data + loser_data
    return data

def compute_winners(winners):
    data = []
    for team in winners:
        paragraph, sentences = get_posts(subreddits[team])
        sen_pos, sen_neu, sen_neg, sen_com = analyze_sentences(sentences)
        par_pos, par_neu, par_neg, par_com = analyze_paragraph(paragraph)
        data.append([team, sen_pos, sen_neu, sen_neg, sen_com, par_pos, par_neu, par_neg, par_com, True])
    return data

def compute_losers(losers):
    data = []
    for team in losers:
        paragraph, sentences = get_posts(subreddits[team])
        sen_pos, sen_neu, sen_neg, sen_com = analyze_sentences(sentences)
        par_pos, par_neu, par_neg, par_com = analyze_paragraph(paragraph)
        data.append([team, sen_pos, sen_neu, sen_neg, sen_com, par_pos, par_neu, par_neg, par_com, False])
    return data

Note the format of the output for each team. I'm storing each game for each team as a row with the team name, the post score, the total score, and a boolean variable indicating if they won or lost. This allows me to export the data very simply to a csv and interact with it with Pandas in the future or analysis. I've found it's extremely important to really think about how you want to store your data and what data you want to store. If space is not an issue, it's better to just have more data because you can't go back and get the data ¯\_(ツ)_/¯

Storing for Later Use

Now we can store the data in pandas, view the results, and append to an existing csv file.

# store results
import pandas as pd
import datetime
import os

data = compute_all()
df = pd.DataFrame(data, columns=['team', 'sen_pos', 'sen_neu', 'sen_neg', 'sen_com', 'par_pos', 'par_neu', 'par_neg', 'par_com', 'won'])
df['date'] = pd.to_datetime(datetime.datetime.now())
df.index = df['date']
del df['date']
with open('nba_sentiment.csv', 'a') as f:
    df.to_csv(f, index=False, mode='a', header=(not os.path.exists(f)))

Which gives the results

date	team	sen_pos	sen_neu	sen_neg	sen_com	par_pos	par_neu	par_neg	par_com	won
2018-01-07	Boston	1.508	24.014	1.478	1.0604	0.080	0.874	0.046	0.9163	True
2018-01-07	Detroit	0.638	25.185	0.177	1.2514	0.054	0.931	0.015	0.8574	True
2018-01-07	Indiana	3.831	21.794	1.375	4.4474	0.113	0.844	0.043	0.9365	True
2018-01-07	Golden State	2.746	23.576	0.677	4.3177	0.131	0.836	0.033	0.9906	True
2018-01-07	Minnesota	2.113	22.509	1.378	1.6908	0.091	0.862	0.047	0.9365	True
2018-01-07	Cleveland	2.913	23.112	0.975	3.8053	0.141	0.821	0.037	0.9888	True
2018-01-07	Sacramento	2.314	21.636	1.050	1.9895	0.096	0.865	0.039	0.9307	True
2018-01-07	Milwaukee	2.728	21.862	2.410	0.5261	0.112	0.786	0.102	0.7825	True
2018-01-07	Brooklyn	2.462	20.851	2.687	0.1842	0.094	0.822	0.084	0.6326	False
2018-01-07	Houston	1.780	22.713	1.507	0.0988	0.091	0.847	0.062	0.9347	False
2018-01-07	Chicago	2.207	22.777	1.016	4.1347	0.097	0.867	0.035	0.9806	False
2018-01-07	LA Clippers	1.214	24.353	1.433	0.8443	0.050	0.914	0.036	0.5914	False
2018-01-07	New Orleans	2.023	20.676	2.301	-0.0109	0.083	0.853	0.064	0.5754	False
2018-01-07	Orlando	1.822	23.606	0.572	1.7442	0.093	0.888	0.019	0.9718	False
2018-01-07	Denver	1.724	23.939	1.337	1.1355	0.085	0.866	0.049	0.9214	False
2018-01-07	Washington	2.808	21.159	2.033	1.9738	0.129	0.787	0.083	0.9325	False

Future Plans for Pt. 2