So… This post is my first foray into the R twitteR package. This post assumes that you have that package installed already in R. I show here how to get tweets from Twitter in preparation for doing some sentiment analysis. My next post will be the actual sentiment analysis.
For this example, I am grabbing tweets related to “Comcast email.” My goal of this exercise is to see how people are feeling about the product I support.
STEP 1: GETTING AUTHENTICATED TO TWITTER
First, you’ll need to create an application at Twitter. I used this blog post to get rolling with that. This post does a good job walking you through the steps to do that.
Once you have your app created, this is the code I used to create and save my authentication credentials. Once you’ve done this once, you need only load your credentials in the future to authenticate with Twitter.
STEP 2: GETTING THE TWEETS
Once you have your authentication credentials set, you can use them to grab tweets from Twitter.
The next snippets of code come from my scraping_twitter.R script, which you are welcome to see in it’s entirety on GitHub.
As you can see, I used the twitteR R Package to authenticate and search Twitter. After getting the tweets, I converted
the results to a Data Frame to make it easier to analyze the results.
STEP 3: GETTING RID OF THE JUNK
Many of the tweets returned by my initial search are totally unrelated to Comcast Email. An example of this would be: “I am selling something random… please email me at email@example.com”
The tweet above includes the words email and comcast, but has nothing to actually do with Comcast Email and the way
the user feels about it, other than they use it for their business.
So… based on some initial, manual, analysis of the tweets, I’ve decided to pull those tweets with the phrases:
This is done with this code:
After pulling out the duplicates (some tweets may fall into multiple scenarios from above) and ensuring they are in order (as returned initially), I assign the relevant tweets to a new variable with only some of the returned columns.
The returned columns are:
All I care about are:
This is handled through this tidbit of code:
STEP 4: CLEAN UP THE DATA AND RETURN THE RESULTS
Lastly, for this first script, I make the sources look nice, add titles, and return the final list (only a sample set of tweets shown):
|Dear Mark I am having problems login into my acct REDACTED@comcast.net I get no email w codes to reset my password for eddygil HELP HELP||2014-12-23 15:44:27||Twitter Web Client||riocauto|
|@msnbc @nbc @comcast pay @thereval who incites the murder of police officers. Time to send them a message of BOYCOTT! Tweet/email them NOW||2014-12-23 14:52:50||Twitter Web Client||Monty_H_Mathis|
|Comcast, I have no email. This is bad for my small business. Their response “Oh, I’m sorry for that”. Problem not resolved. #comcast||2014-12-23 09:20:14||Twitter Web Client||mathercesul|
As you can see from the output, sometimes some “junk” still gets in. Something I’d like to continue working on is a more reliable algorithm for identifying appropriate tweets. I also am worried that my choice of subjects is biasing the sentiment.