پاسخ‌دهي به پرسش‌هاي بصري با استفاده از يادگيري عميق مبتني بر گراف

شماره مدرك :

17432

شماره راهنما :

15259

پديد آورنده :

سجادي نيا، حسين

عنوان :

پاسخ‌دهي به پرسش‌هاي بصري با استفاده از يادگيري عميق مبتني بر گراف

مقطع تحصيلي :

كارشناسي ارشد

گرايش تحصيلي :

معماري كامپيوتر

محل تحصيل :

اصفهان : دانشگاه صنعتي اصفهان

سال دفاع :

1400

صفحه شمار :

دوازده، 82ص: مصور، جدول، نمودار

استاد راهنما :

مهران صفاياني

استاد مشاور :

عبدالرضا ميرزايي

توصيفگر ها :

پاسخ‌دهي به پرسش‌هاي بصري , بينايي ماشين , شبكه عصبي , يادگيري عميق , شبكه عصبي گرافي , يادگيري جملات هم معنا

استاد داور :

جلال ذهبي، محمدعلي خسروي فرد

تاريخ ورود اطلاعات :

1401/01/27

كتابنامه :

كتابنامه

رشته تحصيلي :

مهندسي كامپيوتر

دانشكده :

مهندسي برق و كامپيوتر

تاريخ ويرايش اطلاعات :

1401/01/27

كد ايرانداك :

2821366

چكيده فارسي :

پاسخ‌دهي به پرسش هاي بصري يك مسئله در حوزه بينايي ماشين است كه از رايج ترين روش هاي حل اين مسئله، استفاده از شبكه‌هاي عصبي عميق است. در اين مسئله تصوير به همراه يك سوال متن كه در مورد تصوير پرسيده شده است به عنوان ورودي به مدل داده مي‌شود و مدل يك پاسخ صحيح متني را نسبت به سوال به عنوان خروجي برمي‌گرداند. حل اين مسئله از طريق يادگيري عميق همواره با چالش هايي مواجه بوده است كه ريشه اصلي آن را مي‌توان در استفاده هم زمان از بلوك‌هاي متفاوت يادگيري عميق در معماري مدل، وجود روابط پيچيده و غير واضح بين اجزاي تصوير و سوال و همچنين نياز به داده آموزشي زياد براي آموزش مدل، دانست. رايج ترين روش حل اين مسئله استخراج بردار ويژگي از تصوير با استفاده از ابزارهاي پردازش تصوير و استخراج بردار ويژگي از متن سوال و آموزش يك شبكه عصبي عميق با توجه به اين دو بردار به عنوان ورودي براي رسيدن به جواب از طريق يك سازوكار انتها به انتها نظارت شده است. مشكل اساسي كه اين مدل ها دارند اين است كه در درك روابط اشياء تصوير با يكديگر و همچنين ارتباط اشياء تصوير با متن سوال دچار ضعف هستند. در اين پژوهش به معرفي و پياده سازي يك معماري يادگيري عميق مبتني بر مدل گرافي براي حل مسئله پاسخ‌دهي به پرسش هاي بصري مي‌پردازيم. از آن جايي كه مدل‌هاي گرافي در درك روابط پيچيده بين اعضاي يك سامانه توانايي بالايي دارند، در اين پژوهش متن سوال شامل كلمات آن و تصوير شامل اشياء درون آن هر كدام به صورت گراف مدل مي‌شوند. با توجه به اينكه اشياء تصوير و روابط بين آن‌ها در رسيدن به پاسخ نهايي به سوال نقش كليدي دارند، براي بدست آوردن گراف تصوير، از كلمات متن سوال در يك فرايند يادگيري استفاده شده است. مدل پيشنهادي با يادگيري هر دو مدل گرافي با استفاده از شبكه عصبي گرافي و تركيب آن ها در يك معماري يادگيري عميق انتها به انتها بدست مي آيد. اين مدل توانسته است به دقت بالاتري نسبت به مدل هاي رقيب برروي مجموعه داده VQA2دست يابد. همچنين توانايي مدل در دادن جواب هاي يكسان و درست به سوالاتي كه از لحاظ لغوي و گرامري متفاوت اند اما از لحاظ معنايي يكسان هستند در فرايند يادگيري جملات هم معنا، افزايش يافته است.

چكيده انگليسي :

Visual Question Answering is a problem in computer vision and using a deep neural network is one of the most common ways to solve this problem. in this problem, the image is given to the model as an input along with the question asked about the image and model returns a correct textual answer to the question as output. Solving this problem through deep learning has always faced challenges that the main reason can be the simultaneous use of different blocks of deep learning in the model architecture, the existence of complex and blurred relationships between image and question and also requires a lot of training data to teach the model. the most common way to solve this problem in deep learning manner is to extract feature vectors from the image using image processing tools and extract feature vectors from the text of the question and train a deep neural network with respect to these two vectors as input to reach the answer through a supervised end-to-end mechanism. the main problem with these models is that they lack the ability to understand the relationship of image objects to each other as well as the relationship of image objects to the text in question. In this research, we introduce and implement a deep learning architecture based on graph models to solve the problem of Visual Question Answering. since graph models have a high ability to understand the complex relationships between members of a system, in this study, the text of the question including its words and the image including the objects within it are each modeled as graphs. since the objects of the image and the relationships between them play main role in achieving the final answer to the question, the words of the question are used to obtain the image graph in a learning process. the proposed model is obtained by learning both graph models using graph neural network and combining them in a deep learning architecture. this model has been able to achieve higher accuracy than competing models on the VQA2 data set. Also, the model’s ability to give the same and correct answers to questions that are lexically and grammatically different but semantically similar has increased in the process of paraphrase learning.

استاد راهنما :

مهران صفاياني

استاد مشاور :

عبدالرضا ميرزايي

استاد داور :

جلال ذهبي، محمدعلي خسروي فرد

لينک به اين مدرک :

https://library.iut.ac.ir/dL/search/default.aspx?Term=17432&Field=0&DTC=107

کلیه حقوق این اثر برای شرکت مهندسی ارتباطات پيام مشرق محفوظ می باشد