MIP-588 linear regression cross validation
Created by: jassak
- Implementation of a class
KFold
for splitting an experiment's dataset into train/test sets according to k-fold cross-validation. - Implementation of CV for Linear Regression. Closes MIP-588.
- Some fixes where needed for this: The computation of
n_obs
is now split inton_obs_train
andn_obs_test
. - The CV implementation is currently inefficient due to two reasons:
- The implementation of
KFold
does a lot of calls torun_udf_on_local_nodes
due to limitations in the current UDF generator. - Cross-validation is completely parallelizable and should be done asynchronously but it is currently done synchronously.
- The implementation of
- Some fixes where needed for this: The computation of
- UDF generator new feature: when a UDF takes a relational table as input, the
row_id
is passed as well and it is used as the index of the corresponding dataframe. Closes MIP-536. - UDF generator new feature: a new constant
DEFERRED
can now be passed instead of an output relation's schema. This allows the user to defer the declaration of the schema at runtime. It is then required to pass the desired schema as an extra argument torun_udf_on_local_nodes
orrun_udf_on_global_node
. Closes MIP-610. - UDF generator enhancement: remove the variable name prefix in column names of relations/dataframes.
- Test case generator enhancement: when a generated input makes no sense it can be skipped by the implemented of
compute_expected_output
by simply returningNone
. - In important issue arose during testing, see MIP-634.
- In github actions job
run_tests_in_five_nodes
tests were decreased to 10 per algorithm, due to lack of resources issue. - Various small fixes