Hasil Pencarian

Ditemukan 124003 dokumen yang sesuai dengan query

Gregorius Vidy Prasetyo

Metode easy ensemble dengan random forest untuk mengatasi masalah klasifikasi pada kelas data tidak seimbang = Easy ensemble with random forest to handle imbalanced data in classification

"ABSTRAK

Pada permasalahan seperti kesehatan atau dunia retail banyak dijumpai data-data yang memiliki kategori yang tidak seimbang. Sebagai contoh jumlah penderita penyakit tertentu relatif langka pada suatu studi atau jumlah transaksi yang terkadang merupakan transaksi palsu (fraud) jumlahnya secara signifikan lebih sedikit ketimbang transaksi normal. Kondisi ini biasa disebut sebagai kondisi data tidak seimbang dan menyebabkan permasalahan pada performa model, terutama pada kelas minoritas. Beberapa metode telah dikembangkan untuk mengatasi permasalahan data tidak seimbang, salah satu metode terkini untuk menanganinya adalah Easy Ensemble. Easy Ensemble diklaim dapat mengatasi efek negatif dari pendekatan konvensional seperti random-under sampling dan mampu meningkatkan performa model dalam memprediksi kelas minoritas. Skripsi ini membahas metode Easy Ensemble dan penerapannya dengan model Random Forest dalam mengatasi masalah data tidak seimbang. Dua buah studi empiris dilakukan berdasarkan kasus nyata dari situs kompetisi hacks.id dan kaggle.com. Proporsi kategori antara kelas mayoritas dan minoritas pada dua data di kasus ini adalah 70:30 dan 94:6. Hasil penelitian menunjukkan bahwa metode Easy Ensemble, dapat meningkatkan performa model klasifikasi Random Forest terhadap kelas minoritas dengan signifikan. Sebelum dilakukan resampling pada data (nhacks.id), nilairecall minority hanya sebesar 0.47, sedangkan setelah dilakukan resampling, nilainya naik menjadi 0.82. Begitu pula pada data kedua (kaggle.com), sebelum resampling nilai recall minority hanya sebesar 0.14, sedangkan setelah dilakukan resampling, nilai naik secara signifikan menjadi 0.71.

ABSTRACT

In the real world problem, there is a lot case of imbalanced data. As an example in medical case, total patients who suffering from cancer is much less than healthy patients. These condition might cause some issues in problem definition level, algorithm level, and data level. Some of the methods have been developed to overcome this issues, one of state-of-the-art method is Easy Ensemble. Easy Ensemble was claimed can improve model performance to classify minority class moreover can overcome the deï¬?ciency of random under-sampling. In this thesis discussed the implementation of Easy Ensemble with Random Forest Classifers to handle imbalance problem in a credit scoring case. This combination method is implemented in two datasets which taken from data science competition website, nhacks.id and kaggle.com with class proportion within majority and minority is 70:30 and 94:6. The results show that resampling with Easy Ensemble can improve Random Forest classifier performance upon minority class. This been shown by value of recall on minority before and after resampling which increasing significantly. Before resampling on the first dataset (nhacks.id), value of recall on minority is just 0.49, but then after resampling, the value of recall on minority is increasing to 0.82. Same with the second dataset (kaggle.com), before the resampling, value of recall on minority is just 0.14, but then after resampling, the value of recall on minority is increasing significantly to 0.71."

2019

S-Pdf

UI - Skripsi Membership Universitas Indonesia Library

Esti Latifah

Random forest untuk mengatasi masalah overfitting pada klasifikasi = Random forest to overcome overfitting problem in classification

"ABSTRAK

Klasifikasi merupakan proses pengelompokan suatu himpunan data ke kelas-kelas yang sudah ada sebelumnya. Pada umumnya, himpunan data dibagi menjadi dua bagian, yaitu training data dan testing data. Dibutuhkan suatu metode klasifikasi yang dapat mengelompokkan training data dan testing data ke dalam suatu kelas dengan tepat. Sering kali metode klasifikasi hanya dapat mengelompokkan training data dengan tepat saja, namun tidak demikian untuk testing data. Artinya, model yang terbentuk tidak cukup stabil atau model tersebut mengalami overfitting. Secara umum, overfitting merupakan kondisi saat akurasi yang dihasilkan pada training data cukup tinggi, namun cenderung tidak mampu memprediksi testing data. Penentuan metode klasifikasi yang rentan terhadap overfitting perlu dipertimbangkan. Random forest merupakan salah satu metode klasifikasi yang rentan terhadap masalah overfitting. Hal tersebut sekaligus menjadi salah satu kelebihan dari metode random forest. Oleh karena itu, pada tugas akhir ini akan dibahas metode random forest serta mengaplikasikannya pada data penderita penyakit Parkinson yang dibagi berdasarkan 2 sub-tipe, yaitu tremor dominant TD dan postural instability gait difficulty PIGD dominant. Selanjutnya, dari data tersebut diperoleh hasil akurasi model yang dihasilkan dalam mengklasifikasi training data, yaitu sekitar 94,25 . Sementara itu, akurasi metode ini dalam melakukan klasifikasi pada data yang tidak terkandung dalam membentuk model sebesar 94,26.

ABSTRACT

Classification is the process of grouping a set of data into pre existing classes. In general, the data set is divided into two parts. There are training data and testing data. It takes a classification method that can classify both training data and testing data of its class appropriately. However, some of the classification methods only fit in training data, but it can not apply in testing data. It means that the model is unstable or the model occurs overfitting. In general, overfitting is a condition when the model too fit in training, but unable to predict testing data. In other words, the accuracy of predicting the testing data is decreasing. Therefore, the determination of classification methods that are vulnerable to overfitting need to be considered. Random forest is one of the classification methods that is vulnerable to overfitting. It is also one of the advantages of the random forest method. Therefore, in this final project will be discussed random forest method and applying it to the data of Parkinson 39 s disease patients that is divided by 2 sub types. There are dominant tremor TD and postural instability gait difficulty PIGD dominant. Furthermore, from the data obtained the results of model accuracy in classifying the training data is about 94.25 . Meanwhile, the accuracy of this method in classifying the data not contained in forming a model is about 94.26."

2018

S-Pdf

UI - Skripsi Membership Universitas Indonesia Library

Muhammad Ilham Randi

Analisis Perbandingan Metode-metode Rebalancing Dalam Menangani Imbalanced Data Pada Klasifikasi Tingkat Keparahan Covid-19 Dengan Metode Random Forest = Comparative Analysis of Rebalancing Methods in Handling Imbalanced Data on COVID-19 Severity Classification with Random Forest

"Dalam melakukan klasifikasi, tidak jarang terdapat data dengan jumlah anggota kategori yang tidak seimbang. Khususnya dalam dunia kesehatan dimana kategori yang diamati umumnya lebih jarang terjadi. Jika ketidakseimbangan ini tidak ditangani terlebih dahulu maka dapat memberikan hasil klasifikasi yang bias dan kurang akurat. Terdapat beberapa metode rebalancing konvensional untuk menanganinya seperti random oversampling dan random undersampling, namun keduanya diklaim memiliki beberapa kelemahan sehingga beberapa metode yang lebih kompleks dikembangkan. Namun jumlah metode yang dapat digunakan untuk menangani data kategorik selain metode konvensional tersebut masih minim. Salah satu metode yang dapat menangani data kategorik adalah synthetic minority over sampling-technique nominal continuous atau SMOTE-NC yang merupakan ekstensi dari SMOTE yang dikembangkan untuk menangani dataset dengan variabel campuran. Skripsi ini membahas perbandingan dari metode random oversampling dan SMOTE-NC juga metode gabungannya dengan undersampling yaitu random oversampling + undersampling dan SMOTE-NC + undersampling untuk menangani ketidakseimbangan data. Masing-masing metode tersebut akan diterapkan untuk klasifikasi tingkat keparahan COVID-19 berdasarkan urgensi perawatan rumah sakit dengan menggunakan metode random forest dimana selanjutnya dapat dilihat kombinasi metode yang menghasilkan performa terbaik. Penelitian ini juga bertujuan untuk melihat faktor-faktor manakah yang paling penting dalam memprediksi tingkat keparahan COVID-19 berdasarkan urgensi rumah sakit. Digunakan metode Leave-One-Out Cross-Validation untuk mengukur konsistensi model. Diperoleh hasil bahwa metode SMOTE-NC dengan undersampling memberikan performa terbaik dengan komorbid paru-paru, kadar c-reactive protein dan prokalsitonin merupakan variabel terpenting dalam model. Selain itu diperoleh kesimpulan bahwa pemilihan metode rebalancing yang tepat bergantung pada karakteristik data yang dimiliki.

In conducting classification, it is not uncommon for data with an unbalanced number of category members. Especially in the world of health where the categories we observe are generally less common. If this imbalance is not handled first, it can give biased and less accurate classification results. There are several conventional rebalancing methods to handle it, such as random oversampling and random undersampling, but both are claimed to have several weaknesses so that several more complex methods were developed. However, the number of methods that can be used to handle categorical data other than the conventional methods is still minimal. One method that can handle categorical data is synthetic minority over sampling-technique nominal continuous or SMOTE-NC which is an extension of SMOTE which was developed to handle datasets with mixed variables. This thesis discusses the comparison of random oversampling and SMOTE-NC methods as well as their combined methods with undersampling, namely random oversampling + undersampling and SMOTE-NC + undersampling to handle data imbalances. These methods will be applied to the classification of the severity of COVID-19 based on the urgency of hospital care using the random forest method, wherein the combination of methods that produces the best performance will be seen. This study also aims to see which factors are the most important in predicting the severity of COVID-19 based on hospital urgency. The Leave-One-Out Cross-Validation method is used to measure the consistency of the model. It was found that the SMOTE-NC method with undersampling gave the best performance with lung comorbidities, c-reactive protein and procalcitonin levels were the most important variables in the model. In addition, it can be concluded that the selection of the right rebalancing method depends on the characteristics of the data held."

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2021

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Fiftitah Repfian Aszhari

"Klasifikasi Data Stroke Menggunakan Random Forest dengan Recursive Feature Elimination" = "Classification of Stroke Data Using Random Forest with Recursive Feature Elimination"

Stroke merupakan salah satu penyakit dengan risiko kematian dan kecacatan yang tinggi. Secara umum, stroke diklasifikasikan menjadi dua jenis, yaitu stroke iskemik dan stroke hemoragik. Klasifikasi jenis stroke secara cepat dan tepat diperlukan untuk menentukan jenis pengobatan dan tindakan yang tepat guna mencegah terjadinya dampak yang lebih fatal pada pasien stroke. Pada penelitian ini, klasifikasi stroke dilakukan menggunakan pendekatan machine learning. Adapun data penelitian yang digunakan adalah data stroke yang terdiri atas pemeriksaan laboratorium. Pada data penelitian tersebut, terdapat berbagai komponen pemeriksaan laboratorium yang dicatat serta memungkinkan adanya suatu pemeriksaan yang kurang relevan atau informatif dalam mengklasifikasi stroke. Apabila data tersebut tidak ditangani, akan mempengaruhi kinerja serta waktu komputasi model dalam mengklasifikasi stroke. Oleh karena itu, pada penelitian ini, Random Forest (RF) dengan seleksi fitur Recursive Feature Elimination (RFE) digunakan dalam mengklasifikasi data stroke. Dengan menerapkan metode tersebut, diperoleh kinerja model yang lebih baik saat melakukan klasifikasi menggunakan sejumlah fitur yang diperoleh dari hasil seleksi fitur, dibandingkan menggunakan keseluruhan fitur dalam data stroke. Selain itu, pada penerapan metode tersebut, diperoleh kinerja model yang baik dalam mengklasifikasi data kelas stroke iskemik, akan tetapi tidak cukup baik dalam mengklasifikasi data kelas stroke hemoragik. Hal ini dikarenakan proporsi jumlah data pada kelas stroke iskemik lebih banyak dibandingkan stroke hemoragik. Dalam hal ini dibutuhkan suatu metode penanganan agar kinerja model tetap optimal dalam mengklasifikasi data kelas stroke iskemik dan stroke hemoragik. Pada penelitian ini, Synthetic Minority Oversampling Technique (SMOTE) digunakan untuk menyeimbangkan kedua kelas data stroke guna memperoleh kinerja model yang optimal dalam mengklasifikasi kedua kelas data stroke. Berdasarkan penerapan metode RF dengan RFE serta SMOTE dalam mengklasifikasi data stroke, diperoleh kinerja model yang lebih baik dibandingkan melakukan klasifikasi pada data stroke yang tidak diseimbangkan dengan SMOTE.

Stroke is one of the diseases with the high risk of death and disability. Stroke generally can be classified into two types, namely ischemic stroke and hemorrhagic stroke. A quick and accurate stroke classification is needed to find the right treatment to prevent a dangerous effect on the stroke patients. In this study, the stroke classification was applied using a machine learning approach. The data used in this study is stroke data that consists of laboratory examinations. The data consists of various laboratory examination components, therefore, it might be possible that some of the components are less relevant and has less informative related in classifying stroke. If the data is not well handled, it might affect the performance and computation time of the model in classifying stroke. Therefore, in this study, Random Forest (RF) with Recursive Feature Elimination (RFE) method is used to classify the stroke data. The result showed that by applying the method in classifying several amounts of features obtained from the feature selection results has better performance rather than classifying the method using all features in stroke data. Moreover, based on applying this method, the result showed that the model has better performance in classifying ischemic stoke class data but not good enough in classifying hemorrhagic stroke class data. This result might occur because the proportion of numbers the ischemic stroke more than hemorrhagic stroke class data. Therefore, the handling method is needed to obtain optimal model performance in classifying ischemic stroke and hemorrhagic stroke class data. In this study, Synthetic Minority Oversampling Technique (SMOTE) is applied to balance the two classes of stroke data so optimal performance of the classification model can be obtained. Based on the application of the RF with RFE methods and SMOTE in the classification of stroke data, better model performance is obtained compared to classifying the stroke data that is not balanced with SMOTE.

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2020

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Ryan Fathurrachman

Model Ensemble Random Forest dan Support Vector Machine untuk Mendeteksi Penyakit Pneumonia Menggunakan Data Sekuens Protein = Ensemble Model of Random Forest and Support Vector Machine for Detecting Pneumonia Using Protein Sequence Data

"ISPA atau infeksi saluran pernapasan akut adalah infeksi yang menyerang saluran pernapasan, baik saluran pernapasan atas maupun bawah. Salah satu penyakit yang termasuk dalam ISPA adalah pneumonia. Pneumonia merupakan infeksi paru-paru yang dapat memengaruhi kesehatan manusia secara serius. Pneumonia memengaruhi paru-paru bagian bawah dan menjadi penyebab area tersebut dipenuhi cairan lendir atau nanah. Pneumonia dikarenakan oleh berbagai agen patogen seperti virus, bakteri, dan jamur. Bakteri yang paling sering menyebabkan pneumonia adalah Streptococcus pneumoniae. Selain itu, Mycobacterium tuberculosis juga merupakan bakteri penyebab pneumonia di beberapa negara Asia. Berdasarkan hasil radiologi, pneumonia mirip dengan pneumonia tuberkulosis. Diagnosis dini sangat berperan penting dalam pengelolaan dan pengobatan efektif untuk penyakit ini. Dengan adanya kemajuan di bidang bioinformatika, sekuens protein menjadi salah satu pendekatan yang potensial untuk mendeteksi pneumonia secara cepat dan akurat. Oleh karena itu, penelitian ini adalah pendeteksian penyakit pneumonia dengan sekuens protein. Ekstraksi fitur untuk menjadi data numerik dibutuhkan pada penelitian ini dengan metode discere Penelitian ini menggunakan metode ensemble dari model Random Forest dan Support Vector Machine (SVM) dengan weighted majority algorithm (WMA) untuk mendeteksi penyakit pneumonia menggunakan sekuens protein Streptococcus pneumoniae dan Mycobacterium tuberculosis sebagai pembanding yang didapatkan melalui situs UniProt. Hasil penelitian ini menunjukkan bahwa metode ensemble model Random Forest dan model SVM dengan metode WMA memiliki kinerja terbaik dengan perbandingan data training dan data testing sebesar 80:20 didapat nilai akurasi sebesar 99,17%, nilai sensitivitas sebesar 99,65%, nilai spesifisitas sebesar 97,56%, dan nilai ROC-AUC sebesar 98,61%.

Infection of Acute Respiratory (ARI) is an infection that attacks the respiratory tract, affecting both the upper and lower respiratory tracts. One of the diseases included in ARI is pneumonia. Pneumonia is a lung infection that can seriously impact human health. It affects the lower part of the lungs and causes the area to fill with mucus or pus. Pneumonia can be caused by various pathogens such as viruses, bacteria, and fungi. The bacterium most commonly causing pneumonia is Streptococcus pneumoniae. Additionally, Mycobacterium tuberculosis is also a bacterial cause of pneumonia in several Asian countries. Based on radiological results, pneumonia is similar to tuberculosis pneumonia. Early diagnosis is crucial in the management and effective treatment of this desease. With advancements in bioinformatics, protein sequence has become a potential approach for the rapid and accurate detection of pneumonia. Therefore, this research focuses on the detection of pneumonia using protein sequences. Feature extraction is required to convert the data into numerical form using discere method. This research uses an ensemble method combining Random Forest and Support Vector Machine (SVM) models with the weighted majority algorithm (WMA) to detect pneumonia using protein sequences of Streptococcus pneumoniae and Mycobacterium tuberculosis for comparison. This protein sequences obtained from the UniProt website. The results of this research indicate that the ensemble method of Random Forest and SVM with WMA achieved the best performance with a training to testing data ratio of 80:20 with 99,17% accuracy, 99,65% sensitivity, 97,56% specificity, and 98,61% ROC-AUC score."

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2024

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Vabiyana Safira Desdhanty

Klasifikasi Data Kanker Hati Menggunakan Metode Improved Random Forest-based Rule Extraction = Liver Cancer Classification Using Improved Random Forest-Based Rule Extraction

"Kanker adalah salah satu penyebab kematian utama di dunia,dengan jumlah kematian sekitar sepuluh juta kematian setiap tahun. Kanker hati menempati peringkat keenam untuk jenis kanker yang umum terjadi pada pria dan wanita. Menurut penelitian, pendeteksian dini penting untuk mencegah penyebaran kanker ke organ lain. Hal ini menyebabkan penggunaan machine learning di bidang medis untuk mengklasifikasikan data kanker agar manghasilkan diagnosis yang tepat. Namun ada kalanya dibutuhkan lebih dari satu algoritma untuk meningkatkan akurasi. Maka dari itu, penelitian ini bertujuan untuk menganalisis pengaruh Genetic Algorithm sebagai penyetelan hyperparameter untuk nilai akurasinya, Penggunaan Random Forest dengan Genetic Algorithm sebagai penyetel hyperparameter memberikan akurasi sebesar 85% dengan data testing 90%. Sementara untuk Random Forest saja, hasil akurasi tertinggi adalah 73% dengan data testing sebesar 40%.

Cancer is one of the leading causes of mortality worldwide, with approximately ten million deaths each year. Liver cancer is the sixth most common type that occurs in both men and women. According to scientific studies, early detection is important to prevent the spread of this ailment to other organs. This led to Machine Learning in medical fields for classifying cancer data to produce an accurate diagnosis. However, there are times where just one machine learning algorithm is not giving a good accuracy score. Therefore, this study aims to analyze the effect of using Genetic Algorithm as hyperparameter tuning in terms of the accuracy level. The usage of Random Forest with Genetic Algorithm as the hyperparameter tuning algorithm gives the accuracy of 85% with 90% data testing. Meanwhile, with Random Forest alone, the highest accuracy score is 73% with 40% testing data."

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2021

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Yoel Zabarro

Analisis Kinerja Metode Random Forest untuk Klasifikasi Multikelas Credit Scoring = Performance Analysis of the Random Forest Method for Credit Scoring Multiclass Classification

"Credit scoring adalah suatu proses dalam mengevaluasi kelayakan kredit dari suatu individu. Credit Scoring perlu dilakukan perusahaan keuangan untuk meminimalisir risiko kredit, karena credit scoring dapat menentukan kelayakan debitur. Salah satu perusahaan keuangan yang menyediakan jasa pinjaman berbasis P2P (Peer-to-Peer) yang menerapkan credit scoring dalam evaluasi debitur adalah LendingClub. Pada skripsi ini dilakukan klasifikasi multikelas credit scoring berdasarkan status pinjaman (loan status) yang terdiri dari 3 kelas, yaitu default, fully paid, dan late. Klasifikasi multikelas credit scoring dapat dilakukan dengan salah satu pendekatan machine learning, yaitu supervised learning. Metode supervised learning yang digunakan yaitu random forest. Random forest adalah suatu metode pencarian informasi berbasis tree dengan setiap tree memuat kumpulan variabel acak. Implementasi model random forest dilakukan dengan menggunakan tiga skenario strategy sampling SMOTE yang berbeda. Implementasi model pada tiap skenario dilakuan sebanyak 5 kali percobaan dan dievaluasi menggunakan precision, recall, f1-score, accuracy, dan AUC one vs all. Rata-rata accuracy terbaik adalah sebesar 0,78; dan rata-rata AUC one vs all terbaik adalah sebesar 0,679179. Sedangkan untuk hasil evaluasi berdasarkan tiap kelas, pada kelas default, precision terbaik adalah sebesar 0,39; recall terbaik adalah sebesar 0,27; dan f1-score terbaik adalah sebesar 0,28. Pada kelas fully paid, precision terbaik adalah sebesar 0,82; recall terbaik adalah sebesar 0,95; dan f1-score terbaik adalah sebesar 0,88. Pada kelas late, precision terbaik adalah sebesar 0,02; recall terbaik adalah sebesar 0,02; dan f1-score terbaik adalah sebesar 0,02. Secara keseluruhan, hasil evaluasi model pada ketiga skenario hanya baik dalam memprediksi kelas 1 (fully paid), tetapi kurang baik dalam memprediksi kelas 0 (default) dan kelas 2 (late). Hal tersebut diduga terjadi akibat dataset yang terdapat imbalance data dan class overlap.

Credit scoring is a process in evaluating the creditworthiness of an individual. Credit scoring needs to be done by financial companies to minimize credit risk, because credit scoring can determine the eligibility of debtors. One financial company that provides P2P (Peer-to-Peer) based loan services that applies credit scoring in debtor evaluation is LendingClub. In this thesis, a multiclass classification of credit scoring based on loan status was carried out consisting of 3 classes, namely default, fully paid, and late. Multiclass classification of credit scoring can be done with one of the machine learning approaches, namely supervised learning. The supervised learning method used is random forest. Random forest is a tree-based method of retrieving information with each tree containing a random set of variables. The implementation of the random forest model was carried out using three different SMOTE strategy sampling scenarios. Model implementation in each scenario was carried out 5 times and evaluated using precision, recall, f1-score, accuracy, and AUC one vs all. The best average accuracy is 0.78; and the best average AUC of one vs all is 0.679179. As for the evaluation results based on each class, in the default class, the best precision is 0.39; The best recall was 0.27; and the best F1-score is 0.28. In the fully paid class, the best precision is 0.82; The best recall is 0.95; and the best F1-score is 0.88. In the late class, the best precision is 0.02; The best recall is 0.02; and the best F1-score is 0.02. Overall, the results of model evaluation in all three scenarios were only good at predicting class 1 (fully paid), but less good at predicting class 0 (default) and class 2 (late). This is thought to occur due to datasets that contain data imbalances and class overlap"

Depok: Fakultas Matematika Dan Ilmu Pengetahuan Alam Universitas Indonesia, 2024

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Nathanael Matthew

Metode Robust untuk Mendeteksi Pothole Menggunakan Model Klasifikasi Random Forest = Robust Random Forest to Detect Potholes

"Smartphone telah dikembangkan sebagai alat deteksi pothole oleh berbagai penelitian karena potensinya dalam memberikan manfaat pengumpulan data secara crowdsourcing tanpa memerlukan suatu infrastruktur khusus dan mahal. Namun, metode deteksi pothole berbasis smartphone memiliki tantangan dalam menghadapi berbagai ketidakpastian intrinsik dalam mengukur sinyal yang dihasilkan oleh perangkat smartphone berbeda. Ketangguhan metode dalam menghadapi ketidakpastian intrinsik tersebut diperlukan agar potensi pengumpulan data secara crowdsourcing dapat tercapai. Meskipun telah banyak penelitian yang menghasilkan kinerja deteksi yang memuaskan, berbagai macam faktor ketidakpastian masih mencegah ketangguhan penuh dari metode deteksi pothole tersebut. Penelitian menanggapi faktor-faktor ketidakpastian potensial sebagai faktor prediktor dalam mengembangkan model deteksi berbasis algoritma Random Forest dengan memanfaatan sudut Euler untuk menyelaraskan percepatan akselerometer terhadap percepatan vektor gravitasi; menerapan profil matriks untuk mengurangi kesalahan pelabelan pothole dan memberikan apriori untuk klasifikasi secara efisien; dan diskritisasi temporal pada data sensor dengan penghalusan data tersegmentasi berdasarkan jarak roda platform deteksi (Zona Deteksi). Ketangguhan metode dibuktikan dengan eksperimen faktorial bertingkat dengan variasi spesifikasi perangkat sensor, variasi rute dan tingkatan pothole, serta variasi ketersediaan sensor. Eksperimen membuktikan bahwa faktor-faktor ketidakpastian memiliki efek signifikan secara statistik, namun tidak mempengaruhi kinerja model-model yang dihasilkan. Selain tangguh, kinerja model klasifikasi yang dihasilkan menunjukkan hasil serupa atau bahkan lebih baik dari metode lain yang ada saat ini.

Smartphones have been developed as a pothole detection tool by various studies due to their potential in providing crowdsourced data collection without the need for special and expensive infrastructure. However, a reliable smartphone-based pothole detection method is challenging to develop due to various uncertainties in measuring the signal generated by different smartphone devices. A robust method is needed to deal with said uncertainties so crowdsourced data collection potential can be achieved. Although many studies have yielded satisfactory performance, various uncertainty factors still prevent the full robustness of the existing pothole detection methods. This study endeavors to address the potential uncertainty factors as predictors in developing a pothole detection model with Random Forest algorithm. This is done by incorporating Euler angles to align the relevant sensor data to gravitational vector acceleration; matrix profile to reduce pothole labeling errors and provide a priori for efficient classification; and temporal discretization of sensor data with data segment-smoothing based on detection platform wheelbase (Detection Zone). The robustness of the proposed method is proven using multilevel factorial experiment with variations of sensor device specifications, variations in routes and levels of potholes, and variations in sensor availability. The conducted experiment proves the statistical significance of the simulated uncertainty factors does not affect the performance of the resulting models. Besides showing robustness, the performance of the resulting classification models shows promising results that are comparable to or better than other currently available smartphone-based pothole methods."

Depok: Fakultas Teknik Universitas Indonesia, 2022

T-pdf

UI - Tesis Membership Universitas Indonesia Library

Nadia Hartini Kusumawijaya

Komparasi Kinerja Metode Random Forest Regression dengan Metode Support Vector Regression untuk Memprediksi Usia Biologis pada Data Pemeriksaan Medis = Comparison of the Performance of the Random Forest Regression Method with the Support Vector Regression Method for Predicting Biological Age on Medical Examination Data

"Penuaan adalah salah satu faktor utama resiko terjadinya penyakit dan kematian. Laju

penuaan individu dengan usia kronologis yang sama terbukti bervariasi. Maka dari

itu, muncul kebutuhan untuk alat pengukuran penuaan yang lebih akurat, robust, dan

dapat diandalkan dibandingkan usia kronologis, yakni usia biologis. Pada penelitian

ini, penulis membangun model menggunakan Metode Random Forest Regression (RF)

dan Metode Support Vector Regression (SVR) untuk memprediksi umur biologis pada

data pemeriksaan medis, menilai dan mengevaluasi hasil kinerjanya, serta melakukan

komparasi kinerja kedua metode. Terkait metode yang digunakan, Metode RF adalah

metode yang mengaplikasikan Teknik Ensemble Learning dengan cara menggabungkan

beberapa decision tree untuk menghasilkan prediksi. Sedangkan, Metode SVR adalah

metode yang berkerja dengan cara membangun hyperplane atau kumpulan hyperplane

dalam ruang berdimensi tinggi yang dapat digunakan untuk regresi linier atau nonlinier.

Dataset yang digunakan adalah data medis yang berasal dari Kementrian Kesehatan

Republik Indonesia. Pada dataset dilakukan data preprocessing, yakni data diproses pada

aspek missing values handling, encoding, dan outliers detection and outliers handling.

Kemudian, dilakukan feature selection menggunakan Spearman’s Rank Correlation

Coefficient. Setelah itu, dilakukan pembangunan model dengan Metode RF dan model

dengan Metode SVR secara terpisah untuk masing - masing jenis kelamin. Terakhir,

performa model dievaluasi dan dibandingkan kinerjanya menggunakan metrik evaluasi

Root Mean Square Error (RMSE), Coefficient of Determination (R2), Adjusted R2, dan

running time. Metode RF menggunakan hyperparameter terbaik {’max depth’: 15,

’n estimators’: 1150} untuk dataset pria, dan {’max depth’: 15, ’n estimators’: 1250}

untuk dataset wanita. Sedangkan, Metode SVR menggunakan hyperparameter terbaik

{’C’: 2,’epsilon’: 0,2, ’gamma’: ’scale’, ’kernel’: ’rbf’, ’tol’: 0,005} untuk dataset pria,

dan {’C’: 3, ’epsilon’: 0,2, ’gamma’: ’scale’, ’kernel’: ’rbf’, ’tol’: 0,005} untuk dataset

wanita. Metode RF memiliki kinerja yang cukup baik, dengan nilai RMSE = 7,532; R2

= 0,403; Adjusted R2 = 0,351; running time = 0,154 untuk pria dan RMSE = 6,889;

R2 = 0,340; Adjusted R2 = 0,264; running time = 0,179 untuk wanita. Selain itu, SVR

juga memiliki performa yang cenderung sama namun sedikit lebih buruk, dengan nilai

RMSE = 7,692; R2 = 0,376; Adjusted R2 = 0,321; running time = 0,035 untuk pria dan

RMSE = 6,905; R2 = 0,337; Adjusted R2 = 0,306; running time = 0,080 untuk wanita.

Berdasarkan analisis kinerja model yang dilakukan pada penelitian ini model yang

dibangun dengan Metode Random Forest Regression lebih unggul dalam memprediksi

usia biologis dibandingkan dengan Metode Support Vector Regression.

Aging is one of the main risk factors for disease and death. The aging rate of individ- uals of the same chronological age has been shown to vary. So therefore, a need arises for a more accurate, robust, and reliable aging measurement tool than chronological age, namely biological age. In this research, the author build a model using the Random For- est Regression (RF) Method and the Support Vector Regression (SVR) Method to predict biological age from patient clinical data, assess and evaluate the performance results, and compare the performance of the two models. Regarding the method used, the Random Forest Regression Method is a method that applies the Ensemble Learning Technique by combining several decision trees to produce predictions. Meanwhile, the Support Vector Regression Method is a method that works by building a hyperplane or collection of hy- perplane in high-dimensional space which can be used for linear or nonlinear regression. The dataset used is medical data originating from the Ministry of Health of the Republic of Indonesia. On the dataset, data preprocessing is carried out, namely the data is processed in the aspects of missing values handling, encoding, and outliers detection and outliers handling. Then, feature selection is carried out using Spearman’s Rank Correlation Co- efficient. After that, machine learning model using RF Method and machine learning model using SVR Method were created separately for each gender. Finally, the model performance is evaluated and its performance compared using evaluation metrics, namely Root Mean Square Error (RMSE), Coefficient of Determination (R2), and Adjusted R2, as well as running time. The RF Method used best hyperparameters {’max depth’: 15, ’n estimators’: 1150} for the male dataset, and {’max depth’: 15, ’n estimators’: 1250 } for the female dataset. Meanwhile, the SVR Method used best hyperparameters {’C’: 2, ’epsilon’: 0.2, ’gamma’: ’scale’, ’kernel’: ’rbf’, ’toll’: 0.005} for the male dataset, and {’C’: 3, ’epsilon’: 0, 2, ’gamma’: ’scale’, ’kernel’: ’rbf’, ’toll’: 0.005} for female dataset. The result is that the model built using the RF Method has quite good performance, with an RMSE value of = 7.532; R2 = 0.403; Adjusted R2 = 0.351; running time = 0.154 for men and RMSE = 6.889; R2 = 0.340; Adjusted R2 = 0.264; running time = 0.179 for women. Apart from that, SVR also has performance that tends to be the same but slightly worse, with an RMSE value of = 7,692; R2 = 0.376; Adjusted R2 = 0.321; running time = 0.035 for men and RMSE = 6.905; R2 = 0.337; Adjusted R2 = 0.306; running time = 0.080 for women. Based on the model performance analysis carried out in this research, the model built using the Random Forest Regression Method is superior in predicting biological age compared to the Support Vector Regression Method."

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2024

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Aziz Setia Aji

Predictive Maintenance pada Sistem Integrasi Data Magnet Berbasis Machine Learning Menggunakan Metode Random Forest Regression = Predictive Maintenance on Magnet Data Integration System Based Machine Learning Using Random Forest Method.

"ABSTRAK

Badan Meteorologi Klimatologi dan Geofisika (BMKG) memiliki tugas pengamatan terhadap magnet bumi yang tersebar di Indonesia. Sensor magnetik bumi BMKG menghasilkan output data real-time. Penelitian ini berfokus pada model predictive maintenance pada sensor magnetik bumi berdasarkan output data sensor. Output data yang dihasilkan adalah dalam bentuk format delimited-space sehingga mudah untuk diproses. Komponen magnetik yang digunakan dalam penelitian ini adalah data komponen total magnet bumi (F) dari sensor. Pemrosesan data menggunakan bahasa pemograman python dan algoritma yang digunakan adalah metode random forest regression dengan membandingkan perbedaan nilai yang dihasilkan dengan data Indoesian Geomagnetic Maps for Epoch 2015.0 untuk kemudian dibuatkan model prediksi terhadap waktu. Proses tersebut digunakan untuk mengetahui apakah data yang dihasilkan masih dalam toleransi atau tidak. Tahapan dalam penelitian ini mulai dari pengumpulan data, pre-processing data, pembuatan model, hingga pengujian model dan validasi terhadap model. Penelitian ini menghasilkan estimasi waktu pemeliharan sebesar 14 hari pada data baseline nilai F dan sebesar 3 hari pada data delta F (ΔF).

ABSTRACT

The Meteorological, Climatological, and Geophysical Agency (BMKG) has the task of observing the earth magnets spread across Indonesia. Earth magnetic sensor of BMKG delivers real-time data output. The study focuses on the predictive maintenance model on the earth's magnetic sensor based on sensor data output. The resulting data output is in the form of delimited-space format so it is easy to process. The magnetic component used in this study is data on the earth's total magnetic component (F) from the sensor. Data processing uses python programming language and the algorithm used is a random forest regression method by comparing the value difference generated with the Indoesian Geomagnetic Maps for Epoch 2015.0 data for later created predictive models against time. The process is used to determine whether the resulting data is still in tolerance or not. The stages in this study range from data collection, pre-processing data, create model, model testing, and model validation. The study resulted in a 14-day maintenance time estimate of the baseline data F-value and 3-day in the delta F (ΔF) data."

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2020

T-Pdf

UI - Tesis Membership Universitas Indonesia Library

<< 1 2 3 4 5 6 7 8 9 10 >>

Hasil Pencarian :: Simpan CSV :: Kembali

Hasil Pencarian