Synthetic data

In this tutorial we will show you how to use easyPheno to create synthetic phenotypes for real genotypes.

Besides the written tutorial, we recorded a Video tutorial: Synthetic data generation, which is embedded below.

Additive model

To create synthetic phenotypes, easyPheno uses an additive model

\[\mathbf{y} = \mathbf{X \beta} + \mathbf{Z \gamma} + \mathbf{\epsilon}\]

where the phenotype \(\mathbf{y}\) is given as the sum of one or more causal markers \(\mathbf{X}\) with effect sizes \(\mathbf{\beta}\); random effects \(\mathbf{Z}\) with small effect sizes \(\mathbf{\gamma}\) drawn from a Gaussian distribution, which simulate the polygenic background; and some noise \(\mathbf{\epsilon}\).

The noise can either follow a Gaussian distribution or, for skewed phenotypes, a gamma distribution. Additionally, the number of causal markers and of markers used to simulate the polygenic background, as well as the number of samples used for the simulation are adjustable. Further, the heritability, i.e. the amount of variance that can be explained by the polygenic background, and the variance explained by the causal markers can both be altered by the user.

Create synthetic data in easyPheno

To create a synthetic phenotype using the command line, all you need is the path to the folder where your data is stored (data_dir) and the name of your genotype matrix (name_of_genotype_matrix). Please read our Data Guide for more information on the data structure of the genotype matrix.

python3 -m easypheno.simulate.run_synthetic_phenotypes --data_dir data_dir --genotype_matrix name_of_genotype_matrix

This will create a subfolder name_of_genotype_matrix within the data_dir and save two files, where each simulation gets a unique number or ID (sim_id) to distinguish them from each other:

Simulation_{sim_id}.csv

Contains the sample IDs corresponding to the genotype matrix, a column for the simulated phenotype (e.g. sim1) and one column with the same phenotype but shifted to get rid of negative values (sim1_shift)

Simulations_Overview.csv

Contains the sim_id and additional information such as number of samples, number of causal SNPs, etc. for each simulation

And within another subfolder sim_configs three files containing additional information:

simulation_config_{sim_id}.csv

Contains detailed information of the phenotype such as the SNP ID and effect size of causal markers.

background_{sim_id}.csv

Contains all SNP IDs of the used background markers

betas_background_{sim_id}.csv

Contains the effect size for each background marker in the same order as the background SNPs

Per default easyPheno creates synthetic phenotypes with 1000 samples, and 1000 markers to simulate the polygenic background with a heritability of 70%, i.e. such that the background accounts for 70% of the phenotypic variance. To change that you can specify the number of samples (--number_of_samples), number of background markers (--number_background_snps) and heritability (--heritability). For example

python3 -m easypheno.simulate.run_synthetic_phenotypes --data_dir data_dir --genotype_matrix name_of_genotype_matrix --number_of_samples 100 --number_background_snps 200 --heritability 50

will create a phenotype with 100 samples and use 200 markers to simulate the background with a heritability of 50%.

easyPheno will use one causal marker for the synthetic phenotypes that explains 30% of the total variance. You can adjust that by specifying the number of causal markers (--number_causal_snps) and the explained variance (--explained_variance). For example

python3 -m easypheno.simulate.run_synthetic_phenotypes --data_dir data_dir --genotype_matrix name_of_genotype_matrix --number_causal_snps 5 --explained_variance 20

will create a phenotype with 5 causal markers that together explain around 20% of the total phenotypic variance.

It is also possible to simulate phenotypes with a skewed distribution by using the flag --distribution 'gamma'. If you use a gamma distribution you can additionally adjust the shape parameter with -shape.

If you want to create several phenotypes with the same specifications at once, you can specify the number of simulations with --number_of_simulations. Then the corresponding sim_id will contain the number of the first and last simulation, e.g. ‘10-15’ for the six simulations ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’.

To get an overview over the other options you can adjust when creating synthetic phenotypes with easyPheno, just use:

python3 -m easypheno.simulate.run_synthetic_phenotypes --help

Video tutorial: Synthetic data generation