كاربرد رمزگذار خودكار در تشخيص داده‌هاي پرت

شماره مدرك :

20296

شماره راهنما :

17489

پديد آورنده :

رحيمي، نيلوفر

عنوان :

كاربرد رمزگذار خودكار در تشخيص داده‌هاي پرت

مقطع تحصيلي :

كارشناسي ارشد

گرايش تحصيلي :

علوم داده

محل تحصيل :

اصفهان : دانشگاه صنعتي اصفهان

سال دفاع :

1404

صفحه شمار :

نه، 139ص: مصور، جدول، نمودار

توصيفگر ها :

رمزگذار خودكار , رمزگذار خودكار گرافي , داده‌هاي پرت

تاريخ ورود اطلاعات :

1404/03/10

كتابنامه :

كتابنامه

رشته تحصيلي :

رياضي كاربردي

دانشكده :

رياضي

تاريخ ويرايش اطلاعات :

1404/03/11

كد ايرانداك :

23137691

چكيده فارسي :

رمزگذار خودكار نوعي معماري يادگيري عميق است كه از شبكه‌هاي عصبي مصنوعي براي يادگيري بازنمايي فشرده و كدگذاري‌شده از داده‌ها استفاده مي‌كند. هدف اصلي اين معماري، كاهش ابعاد و فشرده‌سازي داده‌ها، بدون از دست دادن اطلاعات حياتي آن‌ها است. رمزگذار خودكار گرافي، نوعي خاص از رمزگذارهاي خودكار است كه داده‌هاي ساختارمند را به گراف تبديل مي‌كند و سپس مجموعه داده، همراه با گراف مربوط به آن وارد رمزگذار خودكار براي آموزش مي‌شود. تشخيص داده پرت به ويژه در اطراف خوشه‌هاي چگال، يكي از چالش‌هاي اصلي در بسياري از كاربردها است. در اين پايان‌نامه براي رفع اين چالش، روشي نوين مبتني بر رمزگذار خودكار گرافي پيشنهاد شده است. اين روش در ابتدا، با استفاده از سه روش كارآمد خوشه‌بندي شامل $DBSCAN$، $OPTICS$ و $HDBSCAN$ داده‌ها به خوشه‌هاي معنادار تقسيم مي‌شوند. سپس با بهره‌گيري از دو استراتژي مختلف مبتني بر فاصله و چگالي، داده‌هاي پرت احتمالي شناسايي و حذف مي‌شوند. داده‌هاي پاكسازي شده همراه با ماتريس مجاورت آن‌ها به شبكه رمزگذار خودكار گرافي ارائه مي‌شود تا اين شبكه بتواند با استفاده از داده‌هاي تقريباً پاكسازي شده آموزش ببيند و ويژگي‌ها و الگوهاي اصلي داده‌ها را استخراج كند. براي تعيين بعد لايه گلوگاه، از دو روش مختلف استفاده مي‌شود كه به بهينه‌سازي عملكرد شبكه كمك مي‌كند. پس از اتمام آموزش، كل داده‌ها به شبكه وارد شده و با محاسبه خطاي بازسازي يا دريافت داده‌هاي كدگذاري شده، داده‌هاي پرت تشخيص داده مي‌شوند. نتايج اين تحقيق نشان‌دهنده كارايي بالاي روش‌هاي پيشنهادي در شناسايي و حذف داده‌هاي پرت و همچنين ويژگي‌هاي كليدي از مجموعه داده‌ها مي‌باشد. اين رويكرد مي‌تواند به عنوان ابزاري كارآمد در تشخيص داده‌هاي پرت و پاكسازي داده‌ها مورد استفاده قرار گيرد.

چكيده انگليسي :

Autoencoders have gained significant attention in unsupervised learning due to their capability to learn meaningful data representations and serve as effective dimensionality reduction techniques. An autoencoder is a neural network designed to reconstruct its input at the output. The architecture of the network can be divided into two main parts: an encoder function h = f(x), which compresses the input data x into a lower-dimensional representation h, and a decoder function r = g(h), which reconstructs the input from h. The graph autoencoder is a novel structure specifically designed for unsupervised tasks such as outlier detection in graph-based data sets. In the case of a Graph Autoencoder (GAE), data is first transformed into a graph, and both the original data and the graph are then fed into the network. The model's output is the reconstruction of the graph, allowing it to learn graph based representations of the data. Outliers are data points that exhibit abnormal or inconsistent behavior compared to the rest of the dataset. These outliers can distort statistical analysis and negatively impact machine learning model performance by introducing noise into the data. As a result, the presence of outliers can lead to a reduction in the precision of predictive models. However, detecting outliers is important because these instances may represent rare but critical events or anomalies. Moreover, they may also signal errors in data collection or entry. Therefore, identifying and managing outliers is essential in data analysis, particularly when working with large datasets In this thesis, we present a new approach for outlier detection in graph-based datasets using a Graph Autoencoder. We examine how various factors, such as clustering methods, loss functions, the number of neurons in the bottleneck layer, and the number of adjacency matrix multiplications in network layers, influence the performance of outlier detection algorithms. Specifically, we eva‎luate three different clustering techniques and select the most effective one for this task. To enhance the model's ability to identify outliers, we use edge weights based on the distance between data points, as opposed to the traditional cosine similarity based weights. This adjustment allows the model to better identify outliers. Additionally, we limit the number of matrix multiplications in the network layers to enhance computational efficiency and overall performance. Furthermore, we apply two well-known methods to determine the intrinsic dimensionality and the optimal size of the embedding layer in the autoencoder. Outlier instances are identified based on the reconstruction error, and valuable information from the bottleneck layer can also aid in detecting outliers. The outlier score of each point within a cluster is then computed, and the Top-N points with the highest scores are selected as probable outliers. After removing these probable outliers, the cleaned data is fed into the graph autoencoder, and the model is trained to detect outliers in the entire dataset. Finally, the optimized graph autoencoder is used to identify outlier instances, and the encoder's output is further analyzed to find these outliers. Experimental results demonstrate that the proposed model outperforms existing approaches such as SOM, DPC, LOF, INFLO, and LDOF. Although the execution time of our method is longer than that of other approaches, it achieves the highest AUC on ten different datasets. The quality of the identified outliers depends on the effectiveness of the clustering technique and the loss function used. Additionally, the dimension of the bottleneck layer has a significant impact on the model's performance in detecting outlier instances. We use standard methods to detect intrinsic dimension of the dataset and set this number as the dimension of the bottleneck layer.

استاد راهنما :

رامين جوادي

استاد داور :

ساره گلي فروشاني , رضا مختاري

لينک به اين مدرک :

https://library.iut.ac.ir/dL/search/default.aspx?Term=20296&Field=0&DTC=107

کلیه حقوق این اثر برای شرکت مهندسی ارتباطات پيام مشرق محفوظ می باشد