Skip to content

MIP-588 linear regression cross validation

Created by: jassak

  • Implementation of a class KFold for splitting an experiment's dataset into train/test sets according to k-fold cross-validation.
  • Implementation of CV for Linear Regression. Closes MIP-588.
    • Some fixes where needed for this: The computation of n_obs is now split into n_obs_train and n_obs_test.
    • The CV implementation is currently inefficient due to two reasons:
      1. The implementation of KFold does a lot of calls to run_udf_on_local_nodes due to limitations in the current UDF generator.
      2. Cross-validation is completely parallelizable and should be done asynchronously but it is currently done synchronously.
  • UDF generator new feature: when a UDF takes a relational table as input, the row_id is passed as well and it is used as the index of the corresponding dataframe. Closes MIP-536.
  • UDF generator new feature: a new constant DEFERRED can now be passed instead of an output relation's schema. This allows the user to defer the declaration of the schema at runtime. It is then required to pass the desired schema as an extra argument to run_udf_on_local_nodes or run_udf_on_global_node. Closes MIP-610.
  • UDF generator enhancement: remove the variable name prefix in column names of relations/dataframes.
  • Test case generator enhancement: when a generated input makes no sense it can be skipped by the implemented of compute_expected_output by simply returning None.
  • In important issue arose during testing, see MIP-634.
  • In github actions job run_tests_in_five_nodes tests were decreased to 10 per algorithm, due to lack of resources issue.
  • Various small fixes

Merge request reports