Friday, 22 February 2019

Working with Python and R from within Rapid Miner

Rapid Miner is a very powerful data mining software having quite a lot of operators which can perform varied operations on data. Two quite interesting operators are 'Execute R' and 'Execute Python' operators. If an analyst is aware of both the scripting languages, he/she can easily call these two operators from within Rapid Miner and perform tasks which are not so easily accomplished by Rapid Miner. In this blog I am going to demonstrate how anyone can read a data set using R, perform missing data imputation using EM algorithm (using R), Generalized Linear Model (using Rapid Miner) and MICE (using Python) and then perform 5 fold cross validation with k-NN (using Rapid Miner).

A few critical packages must be available in the native R and Python installation. Rapid Miner tries to automatically detect the presence of R and Python but at times it might be required to connect Rapid Miner with proper R and Python executable. For this purpose, it is important to go to Settings-->Preferences and from there check the path of R and Python. If everything is okay, R and Python scripts should run without problems. People who are working with Windows, may require to install a few additional packages for proper working of python in Windows. If you are working with Linux, you should not face too much of trouble. I work with Ubuntu 18.04 LTS. My Rapid Miner is having version 9.2. My R and Python versions are 3.4.4 and 3.6 respectively. The following libraries would be required (with all dependencies) in R and Python:

RKEEL, mclust in R
fancyimpute (requires TensorFlow) in Python

Latest version of fancyimpute is not having MICE, rather it has 'IterativeImputer'. Look at this for the proper use of this imputer.

For the task in hand, the data set has been taken from KEEL data set repository and the marketing data set 'marketing.dat' dataset has been chosen. The data set has almost 9K instances with 23% missing values. The data set has 14 variables which are all categorical in nature. However, the categories are all encoded to numbers. As the article is meant to showcase the power of Rapid Miner in combining the power of R and Python, no detailed analysis of the data set will be performed here. Since Rapid Miner is not having an operator to read a data set in KEEL format, the data is read using R by calling the 'RKEEL' library. Rapid Miner has easy to use cross-validation operator which is used to run 5 fold cross validation of a k-NN model having k=27. The process flow diagram is shown below:



In this analysis, the objective is to predict income class after imputing the missing values. This is a 9 class classification problem and it is difficult to attain good accuracy with single model. Powerful ensemble models (GBM, Random Forest, XgBoost etc.) would be required to improve the accuracy. But that would be covered in a separate post. The XML codes of the entire process in embedded below so that anyone can recreate the entire process.

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="free_memory" compatibility="9.2.000" expanded="true" height="68" name="Free Memory" width="90" x="45" y="340"/>
      <operator activated="true" class="r_scripting:execute_r" compatibility="9.1.000" expanded="true" height="82" name="Read Keel Dataset" width="90" x="45" y="136">
        <parameter key="script" value="# rm_main is a mandatory function, 
# the number of arguments has to be the number of input ports (can be none)
rm_main = function()
{
    library('RKEEL')
    d = read.keel(file = '/home/subhasis/Dataset/KEEL Dataset/marketing.dat')
    is.na(d)=d=="NA"
    d1 = as.data.frame(sapply(d,as.numeric))
    return(list(d1))
}
"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="124" name="Multiply" width="90" x="179" y="136"/>
      <operator activated="true" class="python_scripting:execute_python" compatibility="9.2.000" expanded="true" height="103" name="MICE Imputation" width="90" x="313" y="289">
        <parameter key="script" value="import pandas as pd
from fancyimpute import MICE

# rm_main is a mandatory function, 
# the number of arguments has to be the number of input ports (can be none)
def rm_main(data):
    mice = MICE()
    d = mice.complete(data.iloc[:,:13])
    d = pd.DataFrame(d)
    d['Income']=data['Income']
    d.columns = data.columns
    return d"/>
        <parameter key="use_default_python" value="true"/>
        <parameter key="package_manager" value="conda (anaconda)"/>
      </operator>
      <operator activated="true" class="r_scripting:execute_r" compatibility="9.1.000" expanded="true" height="103" name="EM based Imputation" width="90" x="313" y="34">
        <parameter key="script" value="# rm_main is a mandatory function, 
# the number of arguments has to be the number of input ports (can be none)
rm_main = function(data)
{
    library(mclust)
    d2 = imputeData(data[,1:13], seed=12345)
    d2 = as.data.frame(d2)
    d2$Income = data$Income
    return(list(d2))
}
"/>
      </operator>
      <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Income"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role (2)" width="90" x="581" y="34">
        <parameter key="attribute_name" value="Income"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="9.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="715" y="34">
        <parameter key="split_on_batch_attribute" value="false"/>
        <parameter key="leave_one_out" value="false"/>
        <parameter key="number_of_folds" value="5"/>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="k_nn" compatibility="9.2.000" expanded="true" height="82" name="k-NN" width="90" x="45" y="34">
            <parameter key="k" value="27"/>
            <parameter key="weighted_vote" value="true"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
          </operator>
          <connect from_port="training set" to_op="k-NN" to_port="training set"/>
          <connect from_op="k-NN" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="9.2.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="true"/>
            <parameter key="kappa" value="true"/>
            <parameter key="weighted_mean_recall" value="false"/>
            <parameter key="weighted_mean_precision" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_mean_squared_error" value="false"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="false"/>
            <parameter key="squared_correlation" value="false"/>
            <parameter key="cross-entropy" value="false"/>
            <parameter key="margin" value="false"/>
            <parameter key="soft_margin_loss" value="false"/>
            <parameter key="logistic_loss" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal (2)" width="90" x="447" y="289">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Income"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role (3)" width="90" x="581" y="289">
        <parameter key="attribute_name" value="Income"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="9.2.000" expanded="true" height="145" name="Cross Validation (3)" width="90" x="715" y="340">
        <parameter key="split_on_batch_attribute" value="false"/>
        <parameter key="leave_one_out" value="false"/>
        <parameter key="number_of_folds" value="5"/>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="k_nn" compatibility="9.2.000" expanded="true" height="82" name="k-NN (2)" width="90" x="112" y="34">
            <parameter key="k" value="27"/>
            <parameter key="weighted_vote" value="false"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
          </operator>
          <connect from_port="training set" to_op="k-NN (2)" to_port="training set"/>
          <connect from_op="k-NN (2)" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="34">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="9.2.000" expanded="true" height="82" name="Performance (3)" width="90" x="179" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="true"/>
            <parameter key="kappa" value="true"/>
            <parameter key="weighted_mean_recall" value="false"/>
            <parameter key="weighted_mean_precision" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_mean_squared_error" value="false"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="false"/>
            <parameter key="squared_correlation" value="false"/>
            <parameter key="cross-entropy" value="false"/>
            <parameter key="margin" value="false"/>
            <parameter key="soft_margin_loss" value="false"/>
            <parameter key="logistic_loss" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
          <connect from_op="Performance (3)" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="numerical_to_polynominal" compatibility="9.2.000" expanded="true" height="82" name="Numerical to Polynominal (3)" width="90" x="313" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Income"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role" width="90" x="447" y="136">
        <parameter key="attribute_name" value="Income"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="impute_missing_values" compatibility="9.2.000" expanded="true" height="68" name="GLM based Imputation" width="90" x="581" y="136">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="iterate" value="true"/>
        <parameter key="learn_on_complete_cases" value="true"/>
        <parameter key="order" value="chronological"/>
        <parameter key="sort" value="ascending"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
          <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.2.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="45" y="34">
            <parameter key="family" value="AUTO"/>
            <parameter key="link" value="family_default"/>
            <parameter key="solver" value="AUTO"/>
            <parameter key="reproducible" value="false"/>
            <parameter key="maximum_number_of_threads" value="4"/>
            <parameter key="use_regularization" value="true"/>
            <parameter key="lambda_search" value="false"/>
            <parameter key="number_of_lambdas" value="0"/>
            <parameter key="lambda_min_ratio" value="0.0"/>
            <parameter key="early_stopping" value="true"/>
            <parameter key="stopping_rounds" value="3"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="standardize" value="true"/>
            <parameter key="non-negative_coefficients" value="false"/>
            <parameter key="add_intercept" value="true"/>
            <parameter key="compute_p-values" value="false"/>
            <parameter key="remove_collinear_columns" value="false"/>
            <parameter key="missing_values_handling" value="MeanImputation"/>
            <parameter key="max_iterations" value="0"/>
            <parameter key="specify_beta_constraints" value="false"/>
            <list key="beta_constraints"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
          </operator>
          <connect from_port="example set source" to_op="Generalized Linear Model" to_port="training set"/>
          <connect from_op="Generalized Linear Model" from_port="model" to_port="model sink"/>
          <portSpacing port="source_example set source" spacing="0"/>
          <portSpacing port="sink_model sink" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="9.2.000" expanded="true" height="145" name="Cross Validation (2)" width="90" x="715" y="187">
        <parameter key="split_on_batch_attribute" value="false"/>
        <parameter key="leave_one_out" value="false"/>
        <parameter key="number_of_folds" value="5"/>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="k_nn" compatibility="9.2.000" expanded="true" height="82" name="k-NN (3)" width="90" x="112" y="34">
            <parameter key="k" value="27"/>
            <parameter key="weighted_vote" value="false"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
          </operator>
          <connect from_port="training set" to_op="k-NN (3)" to_port="training set"/>
          <connect from_op="k-NN (3)" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="45" y="34">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="9.2.000" expanded="true" height="82" name="Performance (2)" width="90" x="179" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="true"/>
            <parameter key="kappa" value="true"/>
            <parameter key="weighted_mean_recall" value="false"/>
            <parameter key="weighted_mean_precision" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_mean_squared_error" value="false"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="false"/>
            <parameter key="squared_correlation" value="false"/>
            <parameter key="cross-entropy" value="false"/>
            <parameter key="margin" value="false"/>
            <parameter key="soft_margin_loss" value="false"/>
            <parameter key="logistic_loss" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Keel Dataset" from_port="output 1" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="EM based Imputation" to_port="input 1"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Numerical to Polynominal (3)" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 3" to_op="MICE Imputation" to_port="input 1"/>
      <connect from_op="MICE Imputation" from_port="output 1" to_op="Numerical to Polynominal (2)" to_port="example set input"/>
      <connect from_op="EM based Imputation" from_port="output 1" to_op="Numerical to Polynominal" to_port="example set input"/>
      <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
      <connect from_op="Cross Validation" from_port="performance 1" to_port="result 1"/>
      <connect from_op="Numerical to Polynominal (2)" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Cross Validation (3)" to_port="example set"/>
      <connect from_op="Cross Validation (3)" from_port="performance 1" to_port="result 3"/>
      <connect from_op="Numerical to Polynominal (3)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="GLM based Imputation" to_port="example set in"/>
      <connect from_op="GLM based Imputation" from_port="example set out" to_op="Cross Validation (2)" to_port="example set"/>
      <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>


I hope you would find this post interesting and helpful.

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...