HowTo: Use easyPheno as a pip package

In this Jupyter notebook, we show how you can use easyPheno as a pip package and also guide you through the steps that easyPheno is doing when triggering an optimization run.

Please clone the whole GitHub repository if you want to run this tutorial on your own, as we need the tutorial data from our GitHub repository and to make sure that all paths we define are correct: git clone https://github.com/grimmlab/easyPheno.git

Then, start a Jupyter notebook server on your machine and open this Jupyter notebook, which is placed at docs/source/tutorials in the repository.

However, you could also download the single files and define the paths yourself:

Installation, imports and paths

First, we may need to install easyPheno (uncomment if it is not already installed). Then, we import easyPheno as well as further libraries that we need in this tutorial. In the end, we define some paths and filenames which we will use more often throughout this tutorial. We will save the results in the same directory where this repository is placed.

[1]:
# !pip3 install easypheno
[ ]:
import easypheno
import pathlib
import pandas as pd
import datetime
import pprint
[7]:
# Definition of paths and filenames
cwd = pathlib.Path.cwd()
data_dir = cwd.joinpath('tutorial_data')
save_dir = cwd.parents[3]
genotype_matrix = 'x_matrix.csv'
phenotype_matrix = 'y_matrix.csv'
phenotype = 'continuous_values'

Run whole optimization pipeline at once

As shown for the Docker workflow, easyPheno offers a function optim_pipeline.run() that triggers the whole optimization run.

In the definition of optim_pipeline.run(), we set several default values. In order to run it using our tutorial data, we just need to define the data and directories we want to use as well as the models we want to optimize. Furthermore, we set values for the datasplit and n_trials to limit the waiting time for getting the results.

When calling the function, we first see some information regarding the data preprocessing and the configuration of our optimization run, e.g. the data that is used. Then, the current progress of the optuna optimization with results of the individual trials is shown. In the end, we show a summary of the whole optimization run.

[8]:
easypheno.optim_pipeline.run(
    data_dir=data_dir, genotype_matrix=genotype_matrix, phenotype_matrix=phenotype_matrix, phenotype=phenotype,
    save_dir=save_dir, models=['xgboost'], n_trials=10, datasplit='cv-test'
)
Check if all data files have the required format
Genotype file not in required format. Will load genotype matrix and save as .h5 file. Will also create required index file.
Load genotype file /home/fhaselbeck/PycharmProjects/easyPheno/docs/source/tutorials/tutorial_data/x_matrix.csv
Save unified genotype file /home/fhaselbeck/PycharmProjects/easyPheno/docs/source/tutorials/tutorial_data/x_matrix.h5
Have genotype matrix. Load phenotype continuous_values from /home/fhaselbeck/PycharmProjects/easyPheno/docs/source/tutorials/tutorial_data/y_matrix.csv
Have phenotype vector. Start matching genotype and phenotype.
Done matching genotype and phenotype. Create index file now.
Done checking data files. All required datasets are available.
----- Starting dataset preparation -----
Load and match raw data
Apply MAF filter
Filter duplicate SNPs
Check if final snp_ids already exist in index_file for used encoding and maf percentage. Save them if necessary.
Load datasplit file
Checked datasplit for all folds.
+++++++++++ CONFIG INFORMATION +++++++++++
Genotype Matrix: x_matrix.csv
Phenotype Matrix: y_matrix.csv
Phenotype: continuous_values
Encoding: 012
Models: xgboost
Optuna Trials: 10
Datasplit: cv-test (5-20)
MAF: 0
Dataset Infos
- Task detected: regression
- No. of samples: 286, No. of features: 590
- Encoding: 012
- Target variable statistics:
count    286.000000
mean      84.542832
std       20.141244
min       53.000000
25%       69.250000
50%       78.750000
75%       97.750000
max      157.500000
Name: 0, dtype: float64
++++++++++++++++++++++++++++++++++++++++++++
### Starting Optuna Optimization for xgboost ###
[I 2022-10-04 16:07:07,575] A new study created in RDB with name: 2022-10-04_16-07-04_x_matrix-y_matrix-continuous_values-MAF0-SPLITcv-test5-20-MODELxgboost-TRIALS10
Params for Trial 0
{'n_estimators': 2500, 'learning_rate': 0.07500000000000001, 'max_depth': 3, 'gamma': 300, 'subsample': 0.45, 'colsample_bytree': 0.35000000000000003, 'reg_alpha': 290.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:07:15,020] Trial 0 finished with value: 369.22728277444065 and parameters: {'n_estimators': 2500, 'learning_rate': 0.07500000000000001, 'max_depth': 3, 'gamma': 300, 'subsample': 0.45, 'colsample_bytree': 0.35000000000000003, 'reg_alpha': 290.0}. Best is trial 0 with value: 369.22728277444065.
Params for Trial 1
{'n_estimators': 3000, 'learning_rate': 0.3, 'max_depth': 9, 'gamma': 300, 'subsample': 0.1, 'colsample_bytree': 0.55, 'reg_alpha': 440.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:07:25,321] Trial 1 finished with value: 454.8753047090302 and parameters: {'n_estimators': 3000, 'learning_rate': 0.3, 'max_depth': 9, 'gamma': 300, 'subsample': 0.1, 'colsample_bytree': 0.55, 'reg_alpha': 440.0}. Best is trial 0 with value: 369.22728277444065.
Params for Trial 2
{'n_estimators': 2250, 'learning_rate': 0.2, 'max_depth': 10, 'gamma': 80, 'subsample': 0.2, 'colsample_bytree': 0.05, 'reg_alpha': 320.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:07:32,860] Trial 2 finished with value: 398.20160514328586 and parameters: {'n_estimators': 2250, 'learning_rate': 0.2, 'max_depth': 10, 'gamma': 80, 'subsample': 0.2, 'colsample_bytree': 0.05, 'reg_alpha': 320.0}. Best is trial 0 with value: 369.22728277444065.
Params for Trial 3
{'n_estimators': 2000, 'learning_rate': 0.225, 'max_depth': 8, 'gamma': 770, 'subsample': 0.1, 'colsample_bytree': 0.3, 'reg_alpha': 110.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:07:41,691] Trial 3 finished with value: 369.6724074334373 and parameters: {'n_estimators': 2000, 'learning_rate': 0.225, 'max_depth': 8, 'gamma': 770, 'subsample': 0.1, 'colsample_bytree': 0.3, 'reg_alpha': 110.0}. Best is trial 0 with value: 369.22728277444065.
Params for Trial 4
{'n_estimators': 1750, 'learning_rate': 0.25, 'max_depth': 6, 'gamma': 520, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.05, 'reg_alpha': 100.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:07:49,667] Trial 4 finished with value: 364.574017351775 and parameters: {'n_estimators': 1750, 'learning_rate': 0.25, 'max_depth': 6, 'gamma': 520, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.05, 'reg_alpha': 100.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 5
{'n_estimators': 2750, 'learning_rate': 0.2, 'max_depth': 9, 'gamma': 810, 'subsample': 0.15000000000000002, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 540.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:08:01,195] Trial 5 finished with value: 452.7427816744316 and parameters: {'n_estimators': 2750, 'learning_rate': 0.2, 'max_depth': 9, 'gamma': 810, 'subsample': 0.15000000000000002, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 540.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 6
{'n_estimators': 100, 'learning_rate': 0.3, 'max_depth': 4, 'gamma': 520, 'subsample': 0.6000000000000001, 'colsample_bytree': 0.3, 'reg_alpha': 980.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:08:06,452] Trial 6 finished with value: 447.1631958152293 and parameters: {'n_estimators': 100, 'learning_rate': 0.3, 'max_depth': 4, 'gamma': 520, 'subsample': 0.6000000000000001, 'colsample_bytree': 0.3, 'reg_alpha': 980.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 7
{'n_estimators': 50, 'learning_rate': 0.3, 'max_depth': 4, 'gamma': 670, 'subsample': 0.6500000000000001, 'colsample_bytree': 0.2, 'reg_alpha': 730.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:08:10,461] Trial 7 finished with value: 420.049507545245 and parameters: {'n_estimators': 50, 'learning_rate': 0.3, 'max_depth': 4, 'gamma': 670, 'subsample': 0.6500000000000001, 'colsample_bytree': 0.2, 'reg_alpha': 730.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 8
{'n_estimators': 1000, 'learning_rate': 0.2, 'max_depth': 3, 'gamma': 690, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 130.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:08:19,589] Trial 8 finished with value: 373.9128450040709 and parameters: {'n_estimators': 1000, 'learning_rate': 0.2, 'max_depth': 3, 'gamma': 690, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 130.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 9
{'n_estimators': 250, 'learning_rate': 0.125, 'max_depth': 5, 'gamma': 730, 'subsample': 0.7500000000000001, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 780.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:08:28,648] Trial 9 finished with value: 419.9258576836199 and parameters: {'n_estimators': 250, 'learning_rate': 0.125, 'max_depth': 5, 'gamma': 730, 'subsample': 0.7500000000000001, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 780.0}. Best is trial 4 with value: 364.574017351775.
## Optuna Study finished ##
Study statistics:
  Finished trials:  10
  Pruned trials:  0
  Completed trials:  10
  Best Trial:  4
  Value:  364.574017351775
  Params:
    colsample_bytree: 0.05
    gamma: 520
    learning_rate: 0.25
    max_depth: 6
    n_estimators: 1750
    reg_alpha: 100.0
    subsample: 0.35000000000000003
## Retrain best model and test ##
## Results on test set ##
{'test_mse': 393.8179197745938, 'test_rmse': 19.84484617664228, 'test_r2_score': 0.05719242798046553, 'test_explained_variance': 0.05724940832753167}
### Finished Optuna Optimization for xgboost ###
# Optimization runs done for models ['xgboost']
Results overview on the test set(s)
{'xgboost': {'Test': {'best_params': {'colsample_bytree': 0.05,
                                      'gamma': 520,
                                      'learning_rate': 0.25,
                                      'max_depth': 6,
                                      'n_estimators': 1750,
                                      'reg_alpha': 100.0,
                                      'subsample': 0.35000000000000003},
                      'eval_metrics': {'test_explained_variance': 0.05724940832753167,
                                       'test_mse': 393.8179197745938,
                                       'test_r2_score': 0.05719242798046553,
                                       'test_rmse': 19.84484617664228},
                      'runtime_metrics': {'process_time_max': 36.719140110000005,
                                          'process_time_mean': 20.092465339700002,
                                          'process_time_min': 1.2973512780000078,
                                          'process_time_std': 13.062248523467469,
                                          'real_time_max': 10.839427947998049,
                                          'real_time_mean': 7.2125649690628055,
                                          'real_time_min': 3.837096929550171,
                                          'real_time_std': 2.0072883008924216}}}}

Within the defined save_dir, a results folder will be created.

Then, easyPheno’s default folder structure follows: name_of_genotype_matrix/name_of_phenotype_matrix/phenotype/. For instance, all phenotype matrices assigned to the same genotype matrix are gathered in the same subdirectory (name_of_genotype_matrix/). The same applies for all phenotypes assigned to the same phenotype matrix (name_of_genotype_matrix/name_of_phenotype_matrix/)

We can see this structure below with all optimization results for the defined phenotype.

[9]:
result_folders = list(save_dir.joinpath('results', genotype_matrix.split('.')[0], phenotype_matrix.split('.')[0], phenotype).glob('*'))
for results_dir in result_folders:
    print(results_dir)
/home/fhaselbeck/PycharmProjects/results/x_matrix/y_matrix/continuous_values/cv-test_5-20_MAF0_xgboost_2022-10-04_16-07-04

These result folder names show information on the datasplit (type and parameters, e.g. in case of cv-test the part 5-20: 5 relates to 5 folds of the cross-validation and 20 to a test set consisting of 20 percent of the data). Furthermore, we see the maf filter that was applied (MAF), the models that were optimized and a time stamp.

In the example below, we can see that each result folder contains a Results_overview_*.csv as well as detailed results for each of the optimized models. In case of nested-cv, this is preceded by a subfolder for each of the outer folds.

[10]:
result_elements = list(result_folders[0].glob('*'))
for result_element in result_elements:
    print(result_element)
/home/fhaselbeck/PycharmProjects/results/x_matrix/y_matrix/continuous_values/cv-test_5-20_MAF0_xgboost_2022-10-04_16-07-04/xgboost
/home/fhaselbeck/PycharmProjects/results/x_matrix/y_matrix/continuous_values/cv-test_5-20_MAF0_xgboost_2022-10-04_16-07-04/Results_overview_xgboost.csv

The Results_overview_*.csv file contains the best parameters, evaluation as well as runtime metrics for each of the optimized models as we can see in the example below.

[11]:
results_overview_file = [overview_file for overview_file in result_elements if 'Results_overview' in str(overview_file)][0]
pd.read_csv(results_overview_file)
[11]:
Unnamed: 0 xgboost___best_params xgboost___eval_metrics xgboost___runtime_metrics
0 Test [{'colsample_bytree': 0.05, 'gamma': 520, 'lea... [{'test_mse': 393.8179197745938, 'test_rmse': ... [{'process_time_mean': 20.092465339700002, 'pr...

Beyond that, we see below that the detailed results for each optimized model contain validation and test results, saved prediction models, an optuna database, a runtime overview with information for each trial (good for debugging, as pruning reasons are also documented) and for some prediction models also feature importances.

[12]:
for subdir in [overview_file for overview_file in result_elements if 'Results_overview' not in str(overview_file)][0].rglob('*'):
    print(subdir)
/home/fhaselbeck/PycharmProjects/results/x_matrix/y_matrix/continuous_values/cv-test_5-20_MAF0_xgboost_2022-10-04_16-07-04/xgboost/Optuna_DB.db
/home/fhaselbeck/PycharmProjects/results/x_matrix/y_matrix/continuous_values/cv-test_5-20_MAF0_xgboost_2022-10-04_16-07-04/xgboost/validation_results_trial4.csv
/home/fhaselbeck/PycharmProjects/results/x_matrix/y_matrix/continuous_values/cv-test_5-20_MAF0_xgboost_2022-10-04_16-07-04/xgboost/xgboost_runtime_overview.csv
/home/fhaselbeck/PycharmProjects/results/x_matrix/y_matrix/continuous_values/cv-test_5-20_MAF0_xgboost_2022-10-04_16-07-04/xgboost/unfitted_model_trial4
/home/fhaselbeck/PycharmProjects/results/x_matrix/y_matrix/continuous_values/cv-test_5-20_MAF0_xgboost_2022-10-04_16-07-04/xgboost/final_model_test_results.csv
/home/fhaselbeck/PycharmProjects/results/x_matrix/y_matrix/continuous_values/cv-test_5-20_MAF0_xgboost_2022-10-04_16-07-04/xgboost/final_model_feature_importances.csv

Single elements of the optimization pipeline

For a better understanding of the whole optimization pipeline, we subsequently show some of the single elements which are called within optim_pipeline.run().

First, optim_pipeline.run() contains some functions to check the specified arguments, which we will skip for this tutorial. However, we need to define some of the default values and create pathlib.Path objects.

[13]:
data_dir = pathlib.Path(data_dir)
save_dir = pathlib.Path(save_dir)
datasplit = 'cv-test'
n_innerfolds = 5
test_set_size_percentage=20
maf_percentage = 0
models = ['xgboost']
n_trials = 10

The first step of the optimization pipeline is the preparation of the raw data files using easypheno.preprocess.raw_data_functions.prepare_data_files(). If the format matches our Data Guide, the raw data files are preprocessed.

The genotype matrix is converted and unified to a .h5 file and saved with the same name as the raw file, if this genotype matrix is used for the first time.

The phenotype matrix is checked whether the format is fine, but not saved in a different format.

An index file containing indices for filtering the data (e.g. maf or duplicates) and creating the data splits is saved or updated in case it already exists and a datasplit that is currently not present in the file is requested. This ensures reproducibility of the preprocessing and data splits.

[14]:
easypheno.preprocess.raw_data_functions.prepare_data_files(
    data_dir=data_dir, genotype_matrix_name=genotype_matrix, phenotype_matrix_name=phenotype_matrix,
    phenotype=phenotype, datasplit=datasplit, n_outerfolds=5, n_innerfolds=n_innerfolds,
    test_set_size_percentage=test_set_size_percentage, val_set_size_percentage=20,
    models=models, user_encoding=None, maf_percentage=maf_percentage
)
Check if all data files have the required format
Found same file name with ending .h5
Assuming that the raw file was already prepared using our pipepline. Will continue with the .h5 file.
Genotype file available in required format, check index file now.
Index file x_matrix-y_matrix-continuous_values.h5 already exists. Will append required filters and data splits now.
Done checking data files. All required datasets are available.

After setting all seeds for reproducibility using easypheno.utils.helper_functions.set_all_seeds(), a model for the current optimization is selected. This information is then used to retrieve its standard_encoding if the user did not define an encoding.

With this information, the easypheno.preprocess.base_dataset.Dataset object is initialized.
We also print some information regarding the current progress, as loading the data might take some time for bigger datasets.
When running the optimization for multiple models, these are sorted according to their encoding and the dataset is only loaded new if the encoding changes between models.
[15]:
easypheno.utils.helper_functions.set_all_seeds()
current_model_name = models[0]
encoding = easypheno.utils.helper_functions.get_mapping_name_to_class()[current_model_name].standard_encoding

dataset = easypheno.preprocess.base_dataset.Dataset(
    data_dir=data_dir, genotype_matrix_name=genotype_matrix, phenotype_matrix_name=phenotype_matrix,
    phenotype=phenotype, datasplit=datasplit, n_outerfolds=5, n_innerfolds=n_innerfolds,
    test_set_size_percentage=test_set_size_percentage, val_set_size_percentage=20,
    encoding=encoding, maf_percentage=maf_percentage
)
Load and match raw data
Apply MAF filter
Filter duplicate SNPs
Check if final snp_ids already exist in index_file for used encoding and maf percentage. Save them if necessary.
Load datasplit file
Checked datasplit for all folds.

After retrieving the type of ML task using easypheno.utils.helper_functions.test_likely_categorical() as well as the time stamp for saving the results, we create an easypheno.optimization.optuna_optim.OptunaOptim object. For this purpose, we handover all information that is needed for the hyperparameter search.

[16]:
task = 'classification' if easypheno.utils.helper_functions.test_likely_categorical(dataset.y_full) else 'regression'
models_start_time = '+'.join(models) + '_' + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

optim_run = easypheno.optimization.optuna_optim.OptunaOptim(
    save_dir=save_dir, genotype_matrix_name=genotype_matrix, phenotype_matrix_name=phenotype_matrix,
    phenotype=phenotype, n_outerfolds=5, n_innerfolds=n_innerfolds,val_set_size_percentage=20,
    test_set_size_percentage=test_set_size_percentage, maf_percentage=maf_percentage, n_trials=n_trials,
    save_final_model=False, batch_size=None, n_epochs=10000, task=task,
    models_start_time=models_start_time, current_model_name=current_model_name, dataset=dataset
)

Finally, we just need to call the method run() of our easypheno.optimization.optuna_optim.OptunaOptim object to start the Bayesian hyperparameter search, which will print the current progress and return a dictionary with summary results.

[17]:
summary_results = optim_run.run_optuna_optimization()
pprint.PrettyPrinter(depth=4).pprint(summary_results)
[I 2022-10-04 16:08:39,294] A new study created in RDB with name: 2022-10-04_16-08-33_x_matrix-y_matrix-continuous_values-MAF0-SPLITcv-test5-20-MODELxgboost-TRIALS10
Params for Trial 0
{'n_estimators': 2500, 'learning_rate': 0.07500000000000001, 'max_depth': 3, 'gamma': 300, 'subsample': 0.45, 'colsample_bytree': 0.35000000000000003, 'reg_alpha': 290.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:08:46,932] Trial 0 finished with value: 369.22728277444065 and parameters: {'n_estimators': 2500, 'learning_rate': 0.07500000000000001, 'max_depth': 3, 'gamma': 300, 'subsample': 0.45, 'colsample_bytree': 0.35000000000000003, 'reg_alpha': 290.0}. Best is trial 0 with value: 369.22728277444065.
Params for Trial 1
{'n_estimators': 3000, 'learning_rate': 0.3, 'max_depth': 9, 'gamma': 300, 'subsample': 0.1, 'colsample_bytree': 0.55, 'reg_alpha': 440.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:08:59,323] Trial 1 finished with value: 454.8753047090302 and parameters: {'n_estimators': 3000, 'learning_rate': 0.3, 'max_depth': 9, 'gamma': 300, 'subsample': 0.1, 'colsample_bytree': 0.55, 'reg_alpha': 440.0}. Best is trial 0 with value: 369.22728277444065.
Params for Trial 2
{'n_estimators': 2250, 'learning_rate': 0.2, 'max_depth': 10, 'gamma': 80, 'subsample': 0.2, 'colsample_bytree': 0.05, 'reg_alpha': 320.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:09:11,364] Trial 2 finished with value: 398.20160514328586 and parameters: {'n_estimators': 2250, 'learning_rate': 0.2, 'max_depth': 10, 'gamma': 80, 'subsample': 0.2, 'colsample_bytree': 0.05, 'reg_alpha': 320.0}. Best is trial 0 with value: 369.22728277444065.
Params for Trial 3
{'n_estimators': 2000, 'learning_rate': 0.225, 'max_depth': 8, 'gamma': 770, 'subsample': 0.1, 'colsample_bytree': 0.3, 'reg_alpha': 110.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:09:24,604] Trial 3 finished with value: 369.6724074334373 and parameters: {'n_estimators': 2000, 'learning_rate': 0.225, 'max_depth': 8, 'gamma': 770, 'subsample': 0.1, 'colsample_bytree': 0.3, 'reg_alpha': 110.0}. Best is trial 0 with value: 369.22728277444065.
Params for Trial 4
{'n_estimators': 1750, 'learning_rate': 0.25, 'max_depth': 6, 'gamma': 520, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.05, 'reg_alpha': 100.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:09:31,163] Trial 4 finished with value: 364.574017351775 and parameters: {'n_estimators': 1750, 'learning_rate': 0.25, 'max_depth': 6, 'gamma': 520, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.05, 'reg_alpha': 100.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 5
{'n_estimators': 2750, 'learning_rate': 0.2, 'max_depth': 9, 'gamma': 810, 'subsample': 0.15000000000000002, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 540.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:09:40,337] Trial 5 finished with value: 452.7427816744316 and parameters: {'n_estimators': 2750, 'learning_rate': 0.2, 'max_depth': 9, 'gamma': 810, 'subsample': 0.15000000000000002, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 540.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 6
{'n_estimators': 100, 'learning_rate': 0.3, 'max_depth': 4, 'gamma': 520, 'subsample': 0.6000000000000001, 'colsample_bytree': 0.3, 'reg_alpha': 980.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:09:46,464] Trial 6 finished with value: 447.1631958152293 and parameters: {'n_estimators': 100, 'learning_rate': 0.3, 'max_depth': 4, 'gamma': 520, 'subsample': 0.6000000000000001, 'colsample_bytree': 0.3, 'reg_alpha': 980.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 7
{'n_estimators': 50, 'learning_rate': 0.3, 'max_depth': 4, 'gamma': 670, 'subsample': 0.6500000000000001, 'colsample_bytree': 0.2, 'reg_alpha': 730.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:09:49,729] Trial 7 finished with value: 420.049507545245 and parameters: {'n_estimators': 50, 'learning_rate': 0.3, 'max_depth': 4, 'gamma': 670, 'subsample': 0.6500000000000001, 'colsample_bytree': 0.2, 'reg_alpha': 730.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 8
{'n_estimators': 1000, 'learning_rate': 0.2, 'max_depth': 3, 'gamma': 690, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 130.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:09:56,208] Trial 8 finished with value: 373.9128450040709 and parameters: {'n_estimators': 1000, 'learning_rate': 0.2, 'max_depth': 3, 'gamma': 690, 'subsample': 0.35000000000000003, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 130.0}. Best is trial 4 with value: 364.574017351775.
Params for Trial 9
{'n_estimators': 250, 'learning_rate': 0.125, 'max_depth': 5, 'gamma': 730, 'subsample': 0.7500000000000001, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 780.0}
# Processing innerfold_0 #
# Processing innerfold_1 #
# Processing innerfold_2 #
# Processing innerfold_3 #
# Processing innerfold_4 #
[I 2022-10-04 16:10:00,982] Trial 9 finished with value: 419.9258576836199 and parameters: {'n_estimators': 250, 'learning_rate': 0.125, 'max_depth': 5, 'gamma': 730, 'subsample': 0.7500000000000001, 'colsample_bytree': 0.7500000000000001, 'reg_alpha': 780.0}. Best is trial 4 with value: 364.574017351775.
## Optuna Study finished ##
Study statistics:
  Finished trials:  10
  Pruned trials:  0
  Completed trials:  10
  Best Trial:  4
  Value:  364.574017351775
  Params:
    colsample_bytree: 0.05
    gamma: 520
    learning_rate: 0.25
    max_depth: 6
    n_estimators: 1750
    reg_alpha: 100.0
    subsample: 0.35000000000000003
## Retrain best model and test ##
## Results on test set ##
{'test_mse': 393.8179197745938, 'test_rmse': 19.84484617664228, 'test_r2_score': 0.05719242798046553, 'test_explained_variance': 0.05724940832753167}
{'Test': {'best_params': {'colsample_bytree': 0.05,
                          'gamma': 520,
                          'learning_rate': 0.25,
                          'max_depth': 6,
                          'n_estimators': 1750,
                          'reg_alpha': 100.0,
                          'subsample': 0.35000000000000003},
          'eval_metrics': {'test_explained_variance': 0.05724940832753167,
                           'test_mse': 393.8179197745938,
                           'test_r2_score': 0.05719242798046553,
                           'test_rmse': 19.84484617664228},
          'runtime_metrics': {'process_time_max': 36.19791426699999,
                              'process_time_mean': 20.51563152270001,
                              'process_time_min': 1.4952742340000214,
                              'process_time_std': 13.181602909370394,
                              'real_time_max': 11.987335681915283,
                              'real_time_mean': 7.454001760482788,
                              'real_time_min': 3.102874517440796,
                              'real_time_std': 3.14674639916681}}}

Beyond that, easypheno.optimization.optuna_optim.OptunaOptim creates and saves the Results_overview_*.csv files, which we show above in this tutorial.

Further information

This notebooks shows how the use the easyPheno pip package to run an optimization. Furthermore, we give an overview of the individual steps within optim_pipeline.run().

For more information on specific topcis, see the following links: