Predicting Fraudulent Transactions in R: Part 5. Normalized Distance to Typical Price
Hello Readers,
Here we delve into a quick evaluation of quality metrics of the rankings of unlabeled reports. Previously we relied on labeled reports for the true fraud/non-fraud label for evaluation, and including the training set of reports, there will undoubtedly be unlabeled reports ranked in the near the top. Should those unlabeled reports be in the top in terms of likelihood for being fraudulent? That is where we use our predictive model to determine the classification of those unlabeled reports based on the the labeled reports.
One method compares the unit price of that report with that of the average unit price of transactions of that particular product. If the difference between the unit price for that transaction and the typical unit price transaction for that product is large- then that report is likely to belong in the top possible fraudulent transactions predicted by the model. Here we provide a method in evaluating the model performance.
(This is a series from Luis Torgo's Data Mining with R book.)
Normalized Distance to Typical Price
To calculate the normalized distance to typical price (NDTP) for a specific product of a transaction (report), we take the difference of the unit price of that transaction and the overall unit median price of that product, and divide it by the inter-quartile range of that product's prices.
Below is the code for calculating the NDTP. Starting on line 2, we define the 'avgNDTP' function to accept the arguments 'toInsp', 'train', and 'stats'. 'toInsp' contains the transaction data you want to inspect; 'train' contains raw transactions to obtain the median unit prices; 'stats' contains the pre-calculated typical unit prices for each product. This way, we can calling the 'avgNDTP' function repeatedly, sometimes with the pre-calculated stats is more efficient than calculating the stats each time.
Normalized Distance to Typical Price Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | > # normalized distances of typical prices #### > avgNDTP <- function( toInsp, train, stats ) { + if(missing(train) && missing(stats)) + stop('Provide either the training data or the product stats') + if(missing(stats)) { + notF <- which(train$Insp != 'fraud') + stats <- tapply(train$Uprice[notF], + list(Prod=train$Prod[notF]), + function(x) { + bp <- boxplot.stats(x)$stats + c(median=bp[3],iqr=bp[4]-bp[2]) + }) + stats <- matrix(unlist(stats), + length(stats), 2, byrow=T, + dimnames=list(names(stats), c('median','iqr'))) + stats[which(stats[,'iqr']==0),'iqr'] <- + stats[which(stats[,'iqr']==0),'median'] + } + + mdtp <- mean(abs(toInsp$Uprice-stats[toInsp$Prod,'median']) / + stats[toInsp$Prod,'iqr']) + return(mdtp) + } |
The if statements check to see if sufficient arguments are present, and if the 'stats' input needs to be calculated. If so, line 6 begins to calculate the typical unit price for each product from those inspected transactions which are not labeled fraudulent. From there, we calculate the median and IQR starting on line 9. After unlisting and transforming the results, 'stats' into a matrix on line 13, we replace those with IQR=0 with their median value, because we cannot divide by zero.
On line 20, we create the variable 'mdtp' to hold the normalized distances from the typical price using the unit price from 'toInsp' and the median and IQR from the 'stats' provided or the 'stats' generated from the 'train' argument.
We will use this NDTP metric to evaluate and compare future predictive models, and seeing how well they identify fraudulent transactions given the few inspected transactions we have in our dataset.
So stay tuned for the upcoming predictive model posts!
As always, thanks for reading,