Data Science & Machine Learning @Tunisia

mardi 11 octobre 2016

Text Mining with R

Hello Readers,

Welcome back to my blog. Today we will discuss analysis of a term document matrix that we created in the last post of the Text Mining Series.

We will perform frequent term searches, and terms associations with visualizations. Then we finish the post by creating a visual word cloud (to the right) to 'display' the content of the terms in the tweets from @nbastats. Read Part 5 here.

Start R and let us begin programming!

Plotting Word Frequencies

Here we continue from where we left off last time. Begin by loading the twitteR, tm, and ggplot2 packages in R. Using the findFreqTerms() function in the tm library to find the most frequent terms. We can specify the lower and upper bounds of the frequency values using lowfreq and highfreq. Here we return terms with 20 or more occurrences.

High Frequency Terms

Next we take the 17 terms and create a bar graph of the frequencies using ggplot2. We can obtain the term counts by using rowSums() and we subset the sums to return values 20 or greater. Then we can plot the graph using qplot(), and the geom="bar" argument will create the bar graph, the coord_flip() flips the x and y axis.

Bar Graph Code

The neat result is shown below:

Term Frequencies

We can see that "games", "last", and "amp" are the top three terms by frequency.

Finding Word Associations

Using word associations, we can find the correlation between certain terms and how they appear in tweets (documents). We can perform word associations with the findAssocs() function. Let us find the word associations for "ppg" (points per game) and return the word terms with correlations higher than 0.25.

Associated Terms for "ppg"

We see that "ppg" has a high 0.65 correlation with "rpg", or rebounds per game. This makes sense as a tweet which contacts statistics about the points per game would also include other statistics like rebounds as well as "apg"- assists per game and "fg" field goals.

What about a team, say the Heat which has LeBron James? We can find the word associations for "heat":

Word Associations for "heat"

The top correlated terms are "adjusts", "fts" (free throws), "rockets", "thunder", and "value". Both the Houston Rockets and the Oklahoma City Thunder at top teams so it makes sense that there would be mentions of top teams in the same tweet, especially if they play against each other. LeBron is having a record year in player efficiency which might be why "trueshooting" is an associated term with 0.49 correlation for true shooting percentage.

We can plot the word associations for "heat". The code, similar to the previous plot, is shown below.

Plotting "heat" Word Associations

Which yields:

"heat" Word Associations

And there we have the word associations for the term "heat". I think that is a nice looking visual.

Creating a Word Cloud

We are going to continue the visual creation spree, and this time we will create a word cloud. Load the wordcloud package in R and convert the tweet term document into a regular matrix.

wordcloud, Matrix Conversion, and Sorted Word Frequencies

After we created a word count of all the terms and sorted them in descending order, we can proceed to making the word cloud. We will set a set seed (at 1234) so that the work is reproducible. We need to create a gradient of colors (this time in gray) ranging from 0 to 1 for the cloud, based on the frequency of the word. More frequent terms will have darker font in the word cloud.

With the wordcloud() function, we can create the word cloud. We need to specify the words, their frequencies, a minimum frequency of a term for inclusion, and the color of the words in the cloud.

Predicting Fraudulent Transactions in R

Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price

Hello Readers,

Here we delve into a quick evaluation of quality metrics of the rankings of unlabeled reports. Previously we relied on labeled reports for the true fraud/non-fraud label for evaluation, and including the training set of reports, there will undoubtedly be unlabeled reports ranked in the near the top. Should those unlabeled reports be in the top in terms of likelihood for being fraudulent? That is where we use our predictive model to determine the classification of those unlabeled reports based on the the labeled reports.

One method compares the unit price of that report with that of the average unit price of transactions of that particular product. If the difference between the unit price for that transaction and the typical unit price transaction for that product is large- then that report is likely to belong in the top possible fraudulent transactions predicted by the model. Here we provide a method in evaluating the model performance.

(This is a series from Luis Torgo's Data Mining with R book.)

Normalized Distance to Typical Price

To calculate the normalized distance to typical price (NDTP) for a specific product of a transaction (report), we take the difference of the unit price of that transaction and the overall unit median price of that product, and divide it by the inter-quartile range of that product's prices.

Below is the code for calculating the NDTP. Starting on line 2, we define the 'avgNDTP' function to accept the arguments 'toInsp', 'train', and 'stats'. 'toInsp' contains the transaction data you want to inspect; 'train' contains raw transactions to obtain the median unit prices; 'stats' contains the pre-calculated typical unit prices for each product. This way, we can calling the 'avgNDTP' function repeatedly, sometimes with the pre-calculated stats is more efficient than calculating the stats each time.

Normalized Distance to Typical Price Code:

> # normalized distances of typical prices ####
> avgNDTP <- function( toInsp, train, stats ) {
+   if(missing(train) && missing(stats))
+     stop('Provide either the training data or the product stats')
+   if(missing(stats)) {
+     notF <- which(train$Insp != 'fraud')
+     stats <- tapply(train$Uprice[notF],
+                     list(Prod=train$Prod[notF]),
+                     function(x) {
+                       bp <- boxplot.stats(x)$stats
+                       c(median=bp[3],iqr=bp[4]-bp[2])
+                     })
+     stats <- matrix(unlist(stats),
+                     length(stats), 2, byrow=T,
+                     dimnames=list(names(stats), c('median','iqr')))
+     stats[which(stats[,'iqr']==0),'iqr'] <-
+       stats[which(stats[,'iqr']==0),'median']
+   }
+   
+   mdtp <- mean(abs(toInsp$Uprice-stats[toInsp$Prod,'median']) /
+                  stats[toInsp$Prod,'iqr'])
+   return(mdtp)
+ }

The if statements check to see if sufficient arguments are present, and if the 'stats' input needs to be calculated. If so, line 6 begins to calculate the typical unit price for each product from those inspected transactions which are not labeled fraudulent. From there, we calculate the median and IQR starting on line 9. After unlisting and transforming the results, 'stats' into a matrix on line 13, we replace those with IQR=0 with their median value, because we cannot divide by zero.

On line 20, we create the variable 'mdtp' to hold the normalized distances from the typical price using the unit price from 'toInsp' and the median and IQR from the 'stats' provided or the 'stats' generated from the 'train' argument.

We will use this NDTP metric to evaluate and compare future predictive models, and seeing how well they identify fraudulent transactions given the few inspected transactions we have in our dataset.

So stay tuned for the upcoming predictive model posts!

As always, thanks for reading,