شناسايي ژن هاي موثر در بروز بيماري با استفاده از داده كاوي ريز آرايه و آنتولوژي ژن

چكيده انگليسي :

Identification of Disease Causing Genes Using Microarray Data Mining and Gene Ontology Azadeh Mohammadi amohammadi@ec iut ac ir Date of Submission April 6 2009 Department of Electrical and Computer Engineering Isfahan University of Technology Isfahan 84156 83111 Iran Degree M Sc Language Farsi Mohammad Hossein Saraee saraee@cc iut ac irAbstract Genetic information has recently attracted a significant attention in the diagnosis and classification ofdisease such as cancer One of the best and most accurate methods in this context is monitoring geneexpression values using the microarray technology One of the shortcomings of using microarray data is thatthey provide a small quantity of samples with respect to the number of genes This problem reducesclassification accuracy and increases computational and laboratory costs In fact many of these collectedgenes have none or little to do with a disease under study therefore identification and selection of disease causing genes not only increases accuracy and reduces costs but also has significant importance frombiological view and can provide useful information on the cause and possible ways to cure disease Identification and selection of set of genes related to a disease among thousands of genes under study inmicroarray experiments is called gene selection In this thesis by investigating different approaches of gene selection a novel framework for geneselection is proposed which uses the advantageous features of conventional methods and covers their weakpoints In addition to gene expression values the proposed method uses gene ontology which is a reliablesource of information on genes Use of gene ontology beside gene expression data can compensate in partfor the limitations of microarrays including having a small number of samples and erroneous measurementresults In the proposed framework at first a significant number of irrelevant genes are omitted using thefiltering method fisher Since filtering methods do not take into account the correlation among genes theremaining genes will still have a large amount of redundancy In order to reduce redundancy in remaininggenes a greedy approach has been proposed for removing similar genes This approach calculates thesimilarity between genes considering the gene ontology information as well as gene expression data using ahybrid criterion and then removes redundant genes according to this criterion Finally genes that remain afterthis stage are processed more accurately by the SVMRFE method to derive the disease marker genes Theproposed method has been applied on DLBCL and colon cancer datasets It is observed that the proposedmethod improves the performance of classification Moreover comparing this method with the conventionalmethods of genes selection demonstrates a better performance for the same number of genes used Microarray data sets often contain missing value due to different reasons including scratches or dust on theslide error in experiments image corruption and insufficient resolution In this thesis a novel method isproposed which integrates CST clustering and gene ontology to estimate missing values at the preprocessingstage The performance of the proposed method has been studied on the DLBCL data sets with differentpercentage of missing values Comparing the results of proposed method with other existing estimatingmethods shows that the proposed method can estimate missing values with a higher accuracy Key WordsGene Selection Gene Ontology Gene Expression Microarray Missing Value

محمدي، آزاده

شناسايي ژن هاي موثر در بروز بيماري با استفاده از داده كاوي ريز آرايه و آنتولوژي ژن

مقدار گمشده , بيان ژن , داده كاوي زيستي , سرطان DLBCL