چكيده فارسي :
رمزگذار خودكار نوعي معماري يادگيري عميق است كه از شبكههاي عصبي مصنوعي براي يادگيري بازنمايي فشرده و كدگذاريشده از دادهها استفاده ميكند. هدف اصلي اين معماري، كاهش ابعاد و فشردهسازي دادهها، بدون از دست دادن اطلاعات حياتي آنها است.
رمزگذار خودكار گرافي، نوعي خاص از رمزگذارهاي خودكار است كه دادههاي ساختارمند را به گراف تبديل ميكند و سپس مجموعه داده، همراه با گراف مربوط به آن وارد رمزگذار خودكار براي آموزش ميشود. تشخيص داده پرت به ويژه در اطراف خوشههاي چگال، يكي از چالشهاي اصلي در بسياري از كاربردها است. در اين پاياننامه براي رفع اين چالش، روشي نوين مبتني بر رمزگذار خودكار گرافي پيشنهاد شده است. اين روش در ابتدا، با استفاده از سه روش كارآمد خوشهبندي شامل
$DBSCAN$،
$OPTICS$
و
$HDBSCAN$
دادهها به خوشههاي معنادار تقسيم ميشوند. سپس با بهرهگيري از دو استراتژي مختلف مبتني بر فاصله و چگالي، دادههاي پرت احتمالي شناسايي و حذف ميشوند. دادههاي پاكسازي شده همراه با ماتريس مجاورت آنها به شبكه رمزگذار خودكار گرافي ارائه ميشود تا اين شبكه بتواند با استفاده از دادههاي تقريباً پاكسازي شده آموزش ببيند و ويژگيها و الگوهاي اصلي دادهها را استخراج كند. براي تعيين بعد لايه گلوگاه، از دو روش مختلف استفاده ميشود كه به بهينهسازي عملكرد شبكه كمك ميكند. پس از اتمام آموزش، كل دادهها به شبكه وارد شده و با محاسبه خطاي بازسازي يا دريافت دادههاي كدگذاري شده، دادههاي پرت تشخيص داده ميشوند. نتايج اين تحقيق نشاندهنده كارايي بالاي روشهاي پيشنهادي در شناسايي و حذف دادههاي پرت و همچنين ويژگيهاي كليدي از مجموعه دادهها ميباشد. اين رويكرد ميتواند به عنوان ابزاري كارآمد در تشخيص دادههاي پرت و پاكسازي دادهها مورد استفاده قرار گيرد.
چكيده انگليسي :
Autoencoders have gained significant attention in unsupervised learning due to their capability to learn meaningful data representations and serve as effective dimensionality reduction techniques. An autoencoder is a neural network designed to reconstruct its input at the output. The architecture of the network can be divided into two main parts: an encoder function h = f(x), which compresses the input data x into a lower-dimensional representation h, and a decoder function r = g(h), which reconstructs the input from h.
The graph autoencoder is a novel structure specifically designed for unsupervised tasks such as outlier detection in graph-based data sets. In the case of a Graph Autoencoder (GAE), data is first transformed into a graph, and both the original data and the graph are then fed into the network. The model's output is the reconstruction of the graph, allowing it to learn graph based representations of the data. Outliers are data points that exhibit abnormal or inconsistent behavior compared to the rest of the dataset. These outliers can distort statistical analysis and negatively impact machine learning model performance by introducing noise into the data. As a result, the presence of outliers can lead to a reduction in the precision of predictive models. However, detecting outliers is important because these instances may represent rare but critical events or anomalies. Moreover, they may also signal errors in data collection or entry. Therefore, identifying and managing outliers is essential in data analysis, particularly when working with large datasets
In this thesis, we present a new approach for outlier detection in graph-based datasets using a Graph Autoencoder. We examine how various factors, such as clustering methods, loss functions, the number of neurons in the bottleneck layer, and the number of adjacency matrix multiplications in network layers, influence the performance of outlier detection algorithms. Specifically, we evaluate three different clustering techniques and select the most effective one for this task.
To enhance the model's ability to identify outliers, we use edge weights based on the distance between data points, as opposed to the traditional cosine similarity based weights. This adjustment allows the model to better identify outliers. Additionally, we limit the number of matrix multiplications in the network layers to enhance computational efficiency and overall performance. Furthermore, we apply two well-known methods to determine the intrinsic dimensionality and the optimal size of the embedding layer in the autoencoder. Outlier instances are identified based on the reconstruction error, and valuable information from the bottleneck layer can also aid in detecting outliers. The outlier score of each point within a cluster is then computed, and the Top-N points with the highest scores are selected as probable outliers. After removing these probable outliers, the cleaned data is fed into the graph autoencoder, and the model is trained to detect outliers in the entire dataset. Finally, the optimized graph autoencoder is used to identify outlier instances, and the encoder's output is further analyzed to find these outliers. Experimental results demonstrate that the proposed model outperforms existing approaches such as SOM, DPC, LOF, INFLO, and LDOF. Although the execution time of our method is longer than that of other approaches, it achieves the highest AUC on ten different datasets. The quality of the identified outliers depends on the effectiveness of the clustering technique and the loss function used. Additionally, the dimension of the bottleneck layer has a significant impact on the model's performance in detecting outlier instances. We use standard methods to detect intrinsic dimension of the dataset and set this number as the dimension of the bottleneck layer.