Thursday, 6 August 2015

Running R Scripts from within Rapidminer

This is an interesting aspect of Rapidminer that it can connect to R and allows users to run R scripts from within Rapidminer. I am using the older version of Rapidminer (v5.3.1) so that I can get most of the operators those I require in my works. Moreover, there is no restriction on RAM and I can use almost 6 GB of my 8 GB RAM under linux. However, configuring R for Rapidminer requires some amount of pain! but once the configuration is done, R works quite well from within Rapidminer. Once the R plugin is installed, when Rapidminer is restarted, it gives detailed instructions on how to configure the R for Rapidminer. Users need to be patient and follow the instructions to connect Rapidminer properly with R. In linux, when R gets updated, the older files are removed which is not the case in windows. That is why, in windows, multiple versions of R can run and that might create a problem when a newer version of R is installed and the same in not incorporated in the system environment (the PATH variable). So, I am assuming that you have configured R properly for Rapidminer and you are good to go.  

The blog will show only a few simple data analysis processes which are done by R and Rapidminer in a collaborated way. The following things are done in this process flow diagram.

1. Data input in Rapidminer
2. Data modification using R
3. Data modeling on modified data (through R scripts) using Rapidminer (PCA and Clustering)
4. Joining cluster membership as obtained from cluster analysis with the dataset obtained by means of PCA
5. Data visualization using Rapidminer

The process flow diagram is shown below


I have used the same wholesale data in the demonstration. It can be seen that the process will give three outputs. The outputs are:

1. Cluster analysis output
2. Aggregated data output (w.r.t Region) from R scripts
3. Output for data visualization

The final plot is shown below. The data points look nicely clustered. The green coloured data points are mostly outliers. However, a few blue coloured data points are also outliers but those are not captured here in this plot. Capturing outliers is discussed separately in another post which you can visit later.


Region wise mean values are shown below.


 The entire XML file for generating the same process is given below.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="5.3.015" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="csv_file" value="/path/Wholesale customers data.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="UTF-8"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Channel.true.integer.attribute"/>
          <parameter key="1" value="Region.true.integer.attribute"/>
          <parameter key="2" value="Fresh.true.integer.attribute"/>
          <parameter key="3" value="Milk.true.integer.attribute"/>
          <parameter key="4" value="Grocery.true.integer.attribute"/>
          <parameter key="5" value="Frozen.true.integer.attribute"/>
          <parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
          <parameter key="7" value="Delicassen.true.integer.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="r:execute_script_r" compatibility="5.3.000" expanded="true" height="112" name="Execute Script (R)" width="90" x="179" y="30">
        <parameter key="script" value="library(dplyr)&#10;# Select all variables except Region and Channel&#10;WH=select(Wholesale,-Region,-Channel)&#10;# Standardized values&#10;WH.std=as.data.frame(scale(WH))&#10;WH.aggre=aggregate(cbind(Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen)~Region,data=Wholesale,mean)&#10;&#10;&#10;"/>
        <enumeration key="inputs">
          <parameter key="name_of_variable" value="Wholesale"/>
        </enumeration>
        <list key="results">
          <parameter key="WH" value="Data Table"/>
          <parameter key="WH.std" value="Data Table"/>
          <parameter key="WH.aggre" value="Data Table"/>
        </list>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.3.015" expanded="true" height="76" name="Clustering" width="90" x="313" y="165">
        <parameter key="k" value="3"/>
        <parameter key="max_runs" value="40"/>
        <parameter key="determine_good_start_values" value="true"/>
        <parameter key="measure_types" value="NumericalMeasures"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="165">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|id|cluster"/>
      </operator>
      <operator activated="true" class="principal_component_analysis" compatibility="5.3.015" expanded="true" height="94" name="PCA" width="90" x="313" y="30">
        <parameter key="variance_threshold" value="0.8"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID" width="90" x="447" y="30"/>
      <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join" width="90" x="581" y="120">
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="id" value="id"/>
        </list>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Execute Script (R)" to_port="input 1"/>
      <connect from_op="Execute Script (R)" from_port="output 1" to_op="PCA" to_port="example set input"/>
      <connect from_op="Execute Script (R)" from_port="output 2" to_op="Clustering" to_port="example set"/>
      <connect from_op="Execute Script (R)" from_port="output 3" to_port="result 2"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="PCA" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Join" to_port="right"/>
      <connect from_op="Join" from_port="join" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>



This analysis is shown only for demonstration purpose. For proper story telling about the data, a few more analyses are needed. Viewers are encouraged to run the above piece of code and do more experimentations to harness the power of both R and Rapidminer.

Thank you for visiting the site.

Tuesday, 4 August 2015

Detecting outliers using clustering algorithm

Existence of outliers in a dataset poses many difficulties in data modeling. For example, in regression analysis, a group of outlying data points can seriously distort the slope of the regression line to produce a bad fit. Similarly, in k-means clustering, existence of outliers can drag the centroid from its natural point to another point which might not represent the cluster characteristics properly. Almost all statistical data analysis techniques get affected by these outlying points. In group comparison using t-test, presence of one outlier can cause type-II error which is even more critical than type-I error. Thus effective detection of outliers is an important part of data analysis. In some cases those outliers need special treatments and in other cases, they can be safely removed. In this blog, I am only going to discuss detection of outliers because treatments vary from case to case. 

Outliers can be univariate or multivariate. Box and Whisker plot is particularly useful in detecting univariate outliers. Whereas for multivariate cases, other techniques are to be used. One of the such methods of detecting outliers is using Mahalanobis distance. Extracting Local Outlier Factor is another way of detecting multivariate outliers. It works on the basis of nearest-neighborhood concept. Density based clustering algorithm (DBSCAN) can also be used for detecting such outliers. DBSCAN is a special clustering algorithm which can detect irregular shaped clusters and is quite robust to noises and outliers. The algorithm detects clusters on the basis of density reachability concept which resembles nearest-neighborhood concept. DBSCAN needs two parameters to be declared in the beginning, i.e. MinPts and epsilon. However, deciding the optimum value is a problem which needs to be solved first. In this blog, I am going to use the same wholesale dataset which I used in my earlier blog on cluster analysis. The dataset can be downloaded from here.  

The dataset contains outliers and it contains 6 variables. Hence, visualizing the data requires dimension reduction technique. A simple PCA (Principal Component Analysis) is good enough for this job. The code is given below.

> pca=princomp(wholesale[,-c(1,2)],scores = T)

> pca
Call:
princomp(x = wholesale[, -c(1, 2)], scores = T)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6 
12830.468 12046.640  5008.277  3970.892  2319.592  1482.779 

 6  variables and  440 observations.

> summary(pca)
Importance of components:
                             Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6
Standard deviation     1.283047e+04 1.204664e+04 5.008277e+03 3.970892e+03 2.319592e+03 1.482779e+03
Proportion of Variance 4.596136e-01 4.051723e-01 7.003008e-02 4.402344e-02 1.502212e-02 6.138475e-03
Cumulative Proportion  4.596136e-01 8.647859e-01 9.348160e-01 9.788394e-01 9.938615e-01 1.000000e+00


The first two variables are not taken into considerations as they are categorical in nature. As per the output, first two components explain 86.4% of total variance and first three components explains 93.5% of total variance. Thus for visualization purpose, if the first two components are taken, the plot will give good approximation of the data distribution. In the above code, score= is kept true so that component scores are made available for further processing. Based on the scores, a 2D scatter plot can now be drawn easily. The code is given below.

library(dplyr)
library(ggplot2)
pca.scores=as.data.frame(pca$scores)
pca.scores %>% select(.,Comp.1,Comp.2) %>%
  qplot(Comp.1,Comp.2,data=.)

As per the plot given below, there is quite a large number of outliers.


The outliers are required to be identified and if required they are needed to be removed. We will see how DBSCAN clustering technique can be used to identify these outliers.

DBSCAN with R

DBSCAN is made available in R via fpc library. As I mentioned earlier, the two important parameters in DBSCAN are MinPts and epsilon. Theory on DBSCAN will not be discussed here as detail information can be found in many other sites. I shall, rather, try to find out how to decide optimum values of MinPts and epsilon. To do this, we need a distance matrix. The following codes will generate a few plots which are helpful in deciding (approximate) optimum values of there two parameters.

# Scaled distance matrix
dist.Mat=dist(scale(wholesale[,-c(1,2)]),method = "euclidean")
hist=hist(dist.Mat,breaks = 100,col = "red",main = "Histogram of Distance Matrix",xlab="Distance")

> head(hist$breaks,n=15)
 [1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8

> head(hist$counts,n=15)
 [1]   95 1115 2920 4617 5583 6810 7371 7300 7041 6147 5566 4998 4782 4033 3614


It is possible to get a histogram of these distance values. The plot is shown below.

Due to presence of outliers, the histogram is also skewed. The peak occurs at around 1.5 value. It means, as the inter point distance is increased, more and more data points are becoming neighbors till the distance reaches 1.5 and afterward, the density starts decreasing. In this case, we essentially see a single mode but in other cases, this diagram may look multi-modal. In those situations, the first maxima can be chosen as the epsilon value. In this case, as shown in the output above, we can choose epsilon value as 1.5 or 1.4. The next thing is to decide MinPts value. To get an approximate MinPts value the following codes are run.

mat=as.matrix(dist.Mat)
mat.long=reshape2::melt(mat)
mat.long$flag=ifelse(mat.long$value>0.0 & mat.long$value<=1.5,1,0)
library(dplyr)
DF=summarise(group_by(mat.long,Var1),count=sum(flag))
hist(DF$count,col="red",main="Histogram of # of points within 1.5", xlab="# of points within 1.5)



The histogram is shown below.


As per the histogram, from 150 onwards there is, a kind of, steady increase and thereafter decrease in the frequencies. Thus, in this case, MinPts value can be fixed at 150.

Once we have got the approximate values of MinPts and epsilon, we can run DBSCAN on the basis of these values. Clustering the wholesale data using DBSCAN is quite easy and is shown in the codes below.


dbscan.fit=dbscan(wholesale[,-c(1,2)],eps = 1.5,MinPts = 150,scale = T,method = "raw")
pca.scores %>% select(.,Comp.1,Comp.2) %>%
  qplot(Comp.1,Comp.2,data=.)+geom_point(col=dbscan.fit$cluster+2)


The red points are outliers in the plot given below.


As it can be seen that most of the outliers are correctly identified. A few red data points are lying within the green data points and that is due to third component which is not shown. Those red points are outliers in vertical direction and hence their projections are lying within the green points.

Thus, density based clustering can be effectively used for outlier detection.

This is just one of the many methods.

Comments are welcome..!!


Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do...