Projects

Development of methods for bias correction of gene expression measurements

Grant NCN: Sonata 2016/23/D/ST7/03665
Total budget [PLN]: 536 900
Start date: 25/08/2017; End date: 25/06/2022
Status: completed

Introduction

Modern medicine increasingly reaches for the information stored in the genome of patients and associates the expression level of certain genes with specific diseases and treatment plans. However both in basic research and diagnostics one has to take into account the fact that the measurements of gene expression levels significantly depend on specific structural features of the RNA fragments studied and the specificity of the measurement method used [1]. For this reason, the comparison of signal levels obtained for different RNA fragments or identical fragments in different samples can lead to false conclusions. The problem of differences in the structure of nucleic acids significantly reduces the accuracy of the experiments in molecular biology and may lead to a reduction in the sensitivity of differentially expressed genes detection [2]. It can also  lead to false positives during detection of overrepresented sequence fragments in chip-on-chip experiments [3] and other.           

Factors that affect signal levels in studies of gene expression profile mainly include differences in the nucleotide composition and the length of the tested RNA fragments and the differences resulting from variations in the degradation rate of the studied molecules. All of those factors affect both the process of cDNA synthesis and amplification of the test genetic material [1]. Factors associated with specific research methods are also of crucial importance, including differences in nucleotide composition of microarray probes, which affect the level of non-specific hybridization, or the presence of certain motifs within the probe sequences. This group of factors also includes the use of specific reagents, for example oligo-dT primers for cDNA synthesis, which also introduce a certain bias [4].

Existing methods of bias correction often focus on one individual factor and are specific to a particular method and research platform [3, 5-8]. Due to the differences in the correction algorithm  it is impossible to combine various methods, in addition, most of the factors that affect the signal levels are currently too poorly described, to perform an effective signal correction. Our preliminary studies show that the influence of described factors should not be overlooked [4], which also applies to experiments involving a large number of replicates, either technical or biological. Additionally, we were able to demonstrate that, in the process of detection of differentially expressed genes the use of even simple statistical methods that take into account the structure of the probe and examined fragments of RNA, allows to obtain much better results [2].

In  this project we plan to develop a comprehensive method that allows to reduce bias in gene expression level estimates. The method will be based on mathematical models of associated biochemical processes, and it will be implemented as a publicly available software. We are planning to create a set of input-output models for each stage of the measurement procedure with the possibility to estimate or determine experimentally certain parameter values. These models will be used to create correction curves for subsequent processing steps. Finally a comprehensive mathematical model will be build, based on input-output models obtained for each step.

The result of the project will be a new method of data processing applicable for experiments in which nucleic acid concentrations are estimated, as well as the methodology for assessing the impact of technical factors on the obtained measurements. The implementation of these methods in the form of publicly available software can help to improve the accuracy of such measurements increasing their capabilities in basic research and diagnostic studies.  

  1. Jaksik, R., et al., Microarray experiments and factors which affect their reliability. Biology Direct, 2015. 10.
  2. Jaksik, R., W. Bensz, and J. Smieja, Nucleotide Composition Based Measurement Bias in High Throughput Gene Expression Studies, in Man–Machine Interactions 4, A. Gruca, et al., Editors. 2015, Springer International Publishing. p. 205-214.3.  
  3. Royce, T.E., J.S. Rozowsky, and M.B. Gerstein, Assessing the need for sequence-based normalization in tiling microarray experiments. Bioinformatics, 2007. 23(8): p. 988-97.4.  
  4. Jaksik, R., et al., Sources of High Variance between Probe Signals in Affymetrix Short Oligonucleotide Microarrays. Sensors, 2014. 14(1): p. 532-548.5.  
  5. Hulsman, M., et al., Delineation of amplification, hybridization and location effects in microarray data yields better-quality normalization. BMC Bioinformatics, 2010. 11: p. 156.6.  
  6. Fasold, M. and H. Binder, AffyRNADegradation: control and correction of RNA quality effects in GeneChip expression data. Bioinformatics, 2013. 29(1): p. 129-31.7.  
  7. Benjamini, Y. and T.P. Speed, Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res, 2012. 40(10): p. e72.8.   Gao, L., et al., Length bias correction for RNA-seq data in gene set analyses. Bioinformatics, 2011. 27(5): p. 662-9.
  8. Gao, L., et al., Length bias correction for RNA-seq data in gene set analyses. Bioinformatics, 2011. 27(5): p. 662-9.
Goal

The primary goal of this project is to develop a comprehensive method that allows to reduce bias in gene expression level estimates. The method will be based on mathematical models of associated biochemical processes, and it will be implemented as a publicly available software. We are planning to create a set of input-output models for each stage of the measurement procedure with the possibility to estimate or determine experimentally certain parameter values. These models will be used to create correction curves for subsequent processing steps. Finally a comprehensive mathematical model will be build, based on individual input-output models obtained for each step.

Tasks
  1. Identification of factors that that affect signal levels in gene expression level studies
  2. Development of a mathematical model that describes the influence of individual factors on the signal levels
  3. Estimation of the model parameters based on data obtained using custom oligonucleotide microarrays and RT-qPCR experiments
  4. Development of data correction algorithm based on created mathematical model
  5. Validation of the method by the use of publicly available testing data
     
Contractors

Project manager

Roman Jaksik

Contractors

Anna Lalik, Krzysztof Puszynski

Results

The main achievements of the project include:

  • Identification of various sources of bias and their impact on gene expression measurements using a oligonucleotide microarray designed for this purpose
  • Development of a new algorithm for bias correction of gene expression measurements, resulting from specific features of the RNA nucleotide sequence
  • Developing a methodology for designing microarray probes in terms of identifying data load sources and creating software for initial data processing and visualization of results obtained using the designed microarray
  • Development of algorithms for automatic data collection and processing from RNA-seq and microarray experiments based on the ArrayExpress database, in order to study the impact of specific factors based on many different data sets.
  • Development of new methods and visualization tools for quality control of data from RNA-seq experiments
  • Determination of the effect of data bias from the RNA-seq experiment resulting from the protocol used to prepare libraries for sequencing
  • Development of new recommendations regarding the fragmentation reaction conditions in the protocol for the preparation of libraries for RNA sequencing, maximizing the usefulness of the obtained data in terms of the detection of structural changes and short variants
  • Using the experience gained in the project, including new methods of data quality control, to select the methodology of data analysis from RNA-seq experiments in order to characterize the effect of the mucous field in bladder cancer and the impact of chronic infection on the process of age-related clonal hematopoiesis
Articles