مهره كش، شهاب

عنوان

پرسش و پاسخ از ويدئو با استفاده از شبكه هاي عصبي گرافي و تبديل كننده

مقطع تحصيلي

كارشناسي ارشد

گرايش تحصيلي

هوش مصنوعي و رباتيكز

محل تحصيل

اصفهان : دانشگاه صنعتي اصفهان

سال دفاع

1403

صفحه شمار

يازده،70ص.:مصور، جدول، نمودار

توصيفگر ها

پرسش و پاسخ از ويدئو , شبكه‌هاي عصبي گرافي , تبديل‌كننده , مدل زبان , توليد توضيح

تاريخ ورود اطلاعات

1403/06/26

كتابنامه

رشته تحصيلي

مهندسي كامپيوتر

دانشكده

مهندسي برق و كامپيوتر

تاريخ ويرايش اطلاعات

1403/06/27

كد ايرانداك

23064758

چكيده فارسي

پرسش و پاسخ از ويدئو يكي از مسائل جديد، مشترك بين دو حوزه پردازش زبان‌هاي طبيعي و بينايي كامپيوتر است. در اين مسئله، يك ويدئو و يك سوال مربوط به آن به سامانه داده مي‌شود و هدف دريافت پاسخ صحيح و قابل قبول از سامانه است. در قسمت بصري، پويايي و تغييرات اشيا در ويدئو در طول زمان نسبت به تصاوير ثابت و همچنين فهم و درك صحيحي از ارتباط‌هاي معنايي و فضايي و زماني بين اشياء مي‌تواند چالش‌هاي مهم در اين حوزه باشند. در قسمت متن، درك مناسب از سوال و ارتباط بين سوال مطرح شده با ويدئو از نكات مهم و چالش‌هاي مطرح در اين مسئله است. امروزه استفاده از مباحث يادگيري ماشين و يادگيري عميق و رويكردهاي جديد آن‌ها مانند شبكه‌هاي عصبي عميق و شبكه‌هاي عصبي گرافي در مسئله پرسش و پاسخ از ويدئو مورد توجه قرار گرفته است. همچنين بهره‌گيري از تبديل‌كننده مي‌تواند به رفع چالش‌هاي مطرح شده كمك كند. در اين پژوهش استفاده از گراف‌هاي مختلف استخراج شده از ويدئو نظير گراف فضايي و زماني و تركيب كردن آن‌ها، به بهبود ويژگي‌هاي اشيا در ويدئو و ارتباط‌هاي مختلف بين اشيا كمك كرده است. همچنين استفاده از تبديل‌كننده، باعث درك بهتر نسبت به ويژگي‌ها و روابط پيچيده بين اشيا شده است. براي فهم بهتر سوالات و ارتباط بين سوال مطرح شده و ويدئو استفاده از مدل‌هاي زباني قدرتمند بسيار موثر بوده است. همچنين توليد توضيح براي هر ويدئو، به شناخت بهتر ارتباط بين ويدئو و سوال منجر شده است. نتايج به دست آمده بر روي دو مجموعه داده NExT-QA و TGIF-QA نشان مي‌دهد مدل پيشنهادي توانسته است در مقايسه با مدل‌هاي رقيب عملكرد بهتري از خود نشان دهد كه مي‌تواند به عنوان مدلي كارآمد مورد استفاده قرار گيرد.

چكيده انگليسي

The task of Video Question Answering (Video QA) is one of the emerging and interdisciplinary challenges at the intersection of Natural Language Processing (NLP) and Computer Vision. In this task, a video along with a related question is provided to the system, and the goal is to get a correct and acceptable answer from the system. In the visual part, the dynamics and changes of objects within the video over time, as compared to static images, as well as a proper understanding of the semantic and spatial and temporal relationships between objects can be important challenges in this field. In the text part, proper understanding of the question and the relationship between the question raised and the video are important points and challenges in this issue. Nowadays, Recent advancements in machine learning and deep learning approaches, such as Deep Neural Networks (DNNs) and Graph Neural Networks (GNNs), have garnered attention in the task of Video Question Answering. Moreover, the use of transformers can help address some of these challenges. In this research, the use of various graphs extracted from video, such as spatial and temporal graphs and and their combination, has helped to improve representation of object in the video and the relationships between them. Also, the utilizing transformers has enhanced the understanding of the complex features and relationships between objects. The using of powerful language models has proven to be highly effective in better understanding questions and the relationship between the posed question and the video. Also, generating caption for each video has led to a better comprehension of the relationship between the video and the question. The results obtained on two datasets NExT-QA and TGIF-QA demonstrate that the proposed model has outperformed competing models, indicating its potential as an effective solution.

استاد راهنما

مهران صفاياني

استاد مشاور

عبدالرضا ميرزايي

استاد داور

نادر كريمي , حميدرضا حكيم داودي

لينک به اين مدرک

https://library.iut.ac.ir/dl/search/default.aspx?Term=19678&Field=0&DTC=107