Thursday, 6 August 2015

Running R Scripts from within Rapidminer

This is an interesting aspect of Rapidminer that it can connect to R and allows users to run R scripts from within Rapidminer. I am using the older version of Rapidminer (v5.3.1) so that I can get most of the operators those I require in my works. Moreover, there is no restriction on RAM and I can use almost 6 GB of my 8 GB RAM under linux. However, configuring R for Rapidminer requires some amount of pain! but once the configuration is done, R works quite well from within Rapidminer. Once the R plugin is installed, when Rapidminer is restarted, it gives detailed instructions on how to configure the R for Rapidminer. Users need to be patient and follow the instructions to connect Rapidminer properly with R. In linux, when R gets updated, the older files are removed which is not the case in windows. That is why, in windows, multiple versions of R can run and that might create a problem when a newer version of R is installed and the same in not incorporated in the system environment (the PATH variable). So, I am assuming that you have configured R properly for Rapidminer and you are good to go.  

The blog will show only a few simple data analysis processes which are done by R and Rapidminer in a collaborated way. The following things are done in this process flow diagram.

1. Data input in Rapidminer
2. Data modification using R
3. Data modeling on modified data (through R scripts) using Rapidminer (PCA and Clustering)
4. Joining cluster membership as obtained from cluster analysis with the dataset obtained by means of PCA
5. Data visualization using Rapidminer

The process flow diagram is shown below


I have used the same wholesale data in the demonstration. It can be seen that the process will give three outputs. The outputs are:

1. Cluster analysis output
2. Aggregated data output (w.r.t Region) from R scripts
3. Output for data visualization

The final plot is shown below. The data points look nicely clustered. The green coloured data points are mostly outliers. However, a few blue coloured data points are also outliers but those are not captured here in this plot. Capturing outliers is discussed separately in another post which you can visit later.


Region wise mean values are shown below.


 The entire XML file for generating the same process is given below.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="5.3.015" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="csv_file" value="/path/Wholesale customers data.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="UTF-8"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Channel.true.integer.attribute"/>
          <parameter key="1" value="Region.true.integer.attribute"/>
          <parameter key="2" value="Fresh.true.integer.attribute"/>
          <parameter key="3" value="Milk.true.integer.attribute"/>
          <parameter key="4" value="Grocery.true.integer.attribute"/>
          <parameter key="5" value="Frozen.true.integer.attribute"/>
          <parameter key="6" value="Detergents_Paper.true.integer.attribute"/>
          <parameter key="7" value="Delicassen.true.integer.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="r:execute_script_r" compatibility="5.3.000" expanded="true" height="112" name="Execute Script (R)" width="90" x="179" y="30">
        <parameter key="script" value="library(dplyr)&#10;# Select all variables except Region and Channel&#10;WH=select(Wholesale,-Region,-Channel)&#10;# Standardized values&#10;WH.std=as.data.frame(scale(WH))&#10;WH.aggre=aggregate(cbind(Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen)~Region,data=Wholesale,mean)&#10;&#10;&#10;"/>
        <enumeration key="inputs">
          <parameter key="name_of_variable" value="Wholesale"/>
        </enumeration>
        <list key="results">
          <parameter key="WH" value="Data Table"/>
          <parameter key="WH.std" value="Data Table"/>
          <parameter key="WH.aggre" value="Data Table"/>
        </list>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.3.015" expanded="true" height="76" name="Clustering" width="90" x="313" y="165">
        <parameter key="k" value="3"/>
        <parameter key="max_runs" value="40"/>
        <parameter key="determine_good_start_values" value="true"/>
        <parameter key="measure_types" value="NumericalMeasures"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="165">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|id|cluster"/>
      </operator>
      <operator activated="true" class="principal_component_analysis" compatibility="5.3.015" expanded="true" height="94" name="PCA" width="90" x="313" y="30">
        <parameter key="variance_threshold" value="0.8"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID" width="90" x="447" y="30"/>
      <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join" width="90" x="581" y="120">
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="id" value="id"/>
        </list>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Execute Script (R)" to_port="input 1"/>
      <connect from_op="Execute Script (R)" from_port="output 1" to_op="PCA" to_port="example set input"/>
      <connect from_op="Execute Script (R)" from_port="output 2" to_op="Clustering" to_port="example set"/>
      <connect from_op="Execute Script (R)" from_port="output 3" to_port="result 2"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="left"/>
      <connect from_op="PCA" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Join" to_port="right"/>
      <connect from_op="Join" from_port="join" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>



This analysis is shown only for demonstration purpose. For proper story telling about the data, a few more analyses are needed. Viewers are encouraged to run the above piece of code and do more experimentations to harness the power of both R and Rapidminer.

Thank you for visiting the site.

2 comments:

  1. Im no expert, but I believe you just made an excellent You certainly understand what youre speaking about, and I can truly get behind that.
    Regards,
    SAS Training in Chennai|SAS Course in Chennai

    ReplyDelete

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...