Doing a Sentiment Analysis on Tweets (Part 1)

INTRO

So… This post is my first foray into the R twitteR package. This post assumes that you have that package installed already in R. I show here how to get tweets from Twitter in preparation for doing some sentiment analysis. My next post will be the actual sentiment analysis.

For this example, I am grabbing tweets related to “Comcast email.” My goal of this exercise is to see how people are feeling about the product I support.

STEP 1: GETTING AUTHENTICATED TO TWITTER

First, you’ll need to create an application at Twitter. I used this blog post to get rolling with that. This post does a good job walking you through the steps to do that.

Once you have your app created, this is the code I used to create and save my authentication credentials. Once you’ve done this once, you need only load your credentials in the future to authenticate with Twitter.


library(twitteR) ##built in R package that does some of the Twitter API heavy lifting

consumerKey <- "INSERT YOUR KEY HERE"
consumerSecret <- "INSERT YOUR SECRET HERE"
reqURL <- "https://api.twitter.com/oauth/request_token "
accessURL <- "https://api.twitter.com/oauth/access_token "
authURL <- "https://api.twitter.com/oauth/authorize "
twitCred <- OAuthFactory$new(consumerKey=consumerKey,
consumerSecret=consumerSecret,
requestURL=reqURL,
accessURL=accessURL,
authURL=authURL)

twitCred$handshake()

save(cred, file="credentials.RData")

STEP 2: GETTING THE TWEETS

Once you have your authentication credentials set, you can use them to grab tweets from Twitter.

The next snippets of code come from my scraping_twitter.R script, which you are welcome to see in it’s entirety on GitHub.


##Authentication
load("credentials.RData") ##has my secret keys and shiz
registerTwitterOAuth(twitCred) ##logs me in

##Get the tweets to work with
tweetList <- searchTwitter("comcast email", n = 1000) ##Searches twitter for anything with comcast and email in it

tweetList <- twListToDF(tweetList) ##converts that data we got into a data frame

As you can see, I used the twitteR R Package to authenticate and search Twitter. After getting the tweets, I converted
the results to a Data Frame to make it easier to analyze the results.

STEP 3: GETTING RID OF THE JUNK

Many of the tweets returned by my initial search are totally unrelated to Comcast Email. An example of this would be: “I am selling something random… please email me at myemailaddress@comcast.net”

The tweet above includes the words email and comcast, but has nothing to actually do with Comcast Email and the way
the user feels about it, other than they use it for their business.

So… based on some initial, manual, analysis of the tweets, I’ve decided to pull those tweets with the phrases:

  • “fix” AND “email” in them (in that order)
  • “Comcast” AND “email” in them in that order
  • “no email” in them
  • Any tweet that comes from a source with “comcast” in the handle
  • “Customer Service” AND “email” OR the reverse (“email” AND “Customer Service”) in them


  • This is done with this code:


    ##finds the rows that have the phrase "fix ... email" in them
    fixemail <- grep("(fix.*email)", tweetList$text)
    ##finds the rows that have the phrase "comcast ... email" in them
    comcastemail <- grep("[Cc]omcast.*email", tweetList$text)
    ##finds the rows that have the phrase "no email" in them
    noemail <- grep("no email", tweetList$text)
    ##finds the rows that originated from a Comcast twitter handle
    comcasttweet <- grep("[Cc]omcast", tweetList$screenName)
    ##finds the rows related to email and customer service
    custserv <- grep("[Cc]ustomer [Ss]ervice.*email|email.*[Cc]ustomer [Ss]ervice", tweetList$text)

    After pulling out the duplicates (some tweets may fall into multiple scenarios from above) and ensuring they are in order (as returned initially), I assign the relevant tweets to a new variable with only some of the returned columns.

    The returned columns are:

    text favorited favoriteCount replyToSN
    created truncated replyToSID id
    replyToUID statusSource screenName retweetCount
    isRetweet retweeted longitude latitude

    All I care about are:

    text created statusSource screenName

    This is handled through this tidbit of code:


    ##combine all of the "good" tweets row numbers that we greped out above and then sorts them and makes sure they are unique
    combined <- c(fixemail, comcastemail, noemail, comcasttweet, custserv)
    uvals <- unique(combined)
    sorted <- sort(uvals)

    ##pull the row numbers that we want, and with the columns that are important to us (tweet text, time of tweet, source, and username)
    paredTweetList <- tweetList[sorted, c(1, 5, 10, 11)]

    STEP 4: CLEAN UP THE DATA AND RETURN THE RESULTS

    Lastly, for this first script, I make the sources look nice, add titles, and return the final list (only a sample set of tweets shown):


    ##make the device source look nicer
    paredTweetList$statusSource <- sub("<.*\">", "", paredTweetList$statusSource)
    paredTweetList$statusSource <- sub("</a>", "", paredTweetList$statusSource)

    ##name the columns
    names(paredTweetList) <- c("Tweet", "Created", "Source", "ScreenName")

    paredTweetList

    Tweet created statusSource screenName
    Dear Mark I am having problems login into my acct REDACTED@comcast.net I get no email w codes to reset my password for eddygil HELP HELP 2014-12-23 15:44:27 Twitter Web Client riocauto
    @msnbc @nbc @comcast pay @thereval who incites the murder of police officers. Time to send them a message of BOYCOTT! Tweet/email them NOW 2014-12-23 14:52:50 Twitter Web Client Monty_H_Mathis
    Comcast, I have no email. This is bad for my small business. Their response “Oh, I’m sorry for that”. Problem not resolved. #comcast 2014-12-23 09:20:14 Twitter Web Client mathercesul
    CHALLENGES OBSERVED

    As you can see from the output, sometimes some “junk” still gets in. Something I’d like to continue working on is a more reliable algorithm for identifying appropriate tweets. I also am worried that my choice of subjects is biasing the sentiment.

    Advertisements

    One thought on “Doing a Sentiment Analysis on Tweets (Part 1)

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s