Implementation of Naïve Bayes and K-NN Algorithms in Diagnosing Stunting in Children

. Indonesia faces a huge potential risk of stunting, as revealed in the Indonesian Nutrition Status Analysis according to 2022 data, the stunting rate reached 24.22% in 514 districts / cities throughout Indonesia. To prevent stunting in children, early detection can be done. This research was conducted to compare the performance of two algorithms Naive Bayes and K-NN to predict stunting cases in children, to get a better picture of how classification algorithms predict stunting cases with a better level of accuracy and responsiveness, comparison experiments of several algorithms are needed using specific datasets to develop an optimal classification model. Based on the results of performance testing on the K-Nearest Neighbor and Naive Bayes methods in testing the performance of accuracy, precision, recall, and f1-score, the results of performance testing on the naïve bayes method obtained performance values on 30% testing data are accuracy of 71%, precision 71%, recall 76%, and f1-score 73%. The performance results of the K-NN method using the euclidean distance measurement obtained the best performance value, namely accuracy of 97%, precision of 98%, recall of 96%, f1-score of 97% at a value of k = 3. Based on the performance results of the comparison of the Naive Bayes and K-NN methods, it shows that the best classification method on the stunting dataset is the K-NN method because it gets better performance than the Naive Bayes method.


Introduction
Stunting refers to impaired growth in children aged two years or less that occurs during the first thousand days in the womb and will affect how long children live.Stunting can lead to being underweight, risk of obesity, poor reproductive health, and reduced productive capacity.Income, education, maternal knowledge, and family size are among the factors that can lead to stunting [1].Indonesia faces a huge potential risk of stunting, as revealed in the Indonesian Nutrition Status Analysis.According to 2022 data, the stunting rate reached 24.22% in 514 districts/cities across Indonesia.Although the rate will decrease significantly, stunting is still considered a serious problem in Indonesia as the prevalence rate is always above 20% [2].
The percentage of children under five in a group who are stunted in their physical growth is called the prevalence of stunting.This indicator is used to evaluate the nutritional problems of children under the age of five in a region or country.Higher values indicate that the problem is more serious and requires greater effort to solve.The World Health Organization set a target of a stunting prevalence rate below 20% as part of the fulfillment of the Sustainable Development Goals (SDGs).In Indonesia, where there are many cases of inadequate nutrition in children under five, parents need to carefully understand their child's nutritional condition.Assessing a child's physical condition alone is not perfect for assessing their overall health.Therefore, the role of parents, especially mothers, is crucial in providing adequate food intake for their children, as healthy consumption patterns have a significant influence on children's growth and development [3].
To prevent stunting in children, early detection can be done.Prevention of stunting can be done by involving training for parents of toddlers, posyandu health teams, and related medical personnel.Training is expected to provide more understanding to parents about how to prevent stunting as well as enrich the insights of health cadres in identifying children under five who may experience stunting [4].In the current era, where technology plays a central role in various activities, building a system that can identify stunting is a relevant solution to overcome this problem.A system for diagnosis is a computer system that functions like a decision-making expert in a particular field [5].
To solve the problem of evaluating stunting in children, more creative evaluation methods, such as machine learning, are needed to predict the likelihood of a child being diagnosed with stunting and to enable preventive measures to be taken proactively and appropriately.For classification, ML uses many algorithms that can be used.To gain a better understanding of the specifics required to develop an optimal classification model.This research provides how classification algorithms predict stunting cases with better accuracy and responsiveness, comparison experiments of various algorithms using datasets that further understanding of how various algorithms work in stunting prediction, and emphasizes the importance of using ML in the diagnosis of stunting in children [6].
To improve these activities, which can safely store data, simplify data management, and have the ability to identify stunting symptoms in young children by utilizing the Bayes theorem approach, The Bayes theorem method is a method of drawing a conclusion and making a decision with a probability value [7].The study discusses the application of NBC on the nutritional status classification of stunting toddlers using K-Fold Cross Validation testing, resulting in the nutritional status classification of stunting toddlers can be recognized and classified properly at each iteration.The classification performance results of the NBC algorithm validated with 10-Fold Cross Validation show that the 8th iteration has the highest accuracy of 95.14%, and the 3rd iteration has the lowest accuracy of 81.73%.Overall, the average accuracy for each iteration is 88.53%.The results of the classification can be used as a model to make predictions with the same characteristics or data variables.If there is data with the same variables, GUI-R can classify the data more quickly and effectively [8].
The application of the K-NN method using a website as the basis for creating a classification system for growth and development disorders in infants can provide assistance to posyandu officers in monitoring the development of toddlers.In the context of the toddlers' stunting status clasification, the system managed to achieve an accuracy of 83% with an error rate of 0.167 [9].This research will produce a classification system for stunting status in toddlers that can help determine their stunting status.Then calculations are carried out to evaluate the performance of the Naive Bayes classification data model created using the confusion matrix.The results of the confusion matrix test indicate that the accuracy, precision, and recall scores are 58%, 68%, and 58%, respectively.30% of the training data and 70% of the testing data were used in the trials [10].
The implementation of the Naïve Bayes algorithm in the Kalitengah Village Posyandu in the classification of nutritional status of toddlers indicates that the Naive Bayes method can successfully identify the nutritional status of toddlers with an accuracy of 87.33% of the fourteen test data [11].In research that discusses the use of the K-NN algorithm in determining stunting conditions in children, the results of testing using 114 data indicate that the K-NN algorithm is able to categorize children's stunting status by considering anthropometric attributes such as age (U), body weight (BW), height (TB), and head circumference (LK).Tests were conducted to obtain the highest accuracy results for k values of 3, 5, 7, and 9.The results showed that the highest accuracy and lowest error rate were found at k = 3, reaching 83% with an error rate of 0.142 [12].
The evaluation results indicated that Random Forest produced the highest accuracy of 87.75%, then K-NN 84.8%, and Naive Bayes 83.2%.Although Random Forest produced good accuracy, K-NN stood out as they were able to get most of the actual stunting cases.However, Random Forest has a good balance between accuracy and recall, thus proving that due to the combination of good accuracy with the ability to find stunting cases better than other models in this evaluation, this model can be a good choice for diagnosing stunting [6].
By adding together all of the frequencies and value combinations from a certain collection of datasets, the Naive Bayes Classifier calculates a set of probabilities.The algorithm is based on Bayes' theorem and assumes that each attribute is considered independent or not interdependent, as given by the value of the class variable.K-Nearest Neighbor (K-NN) is one of the methods in the instance-based learning category and belongs to the lazylearning technique.The K-NN process involves finding the K objects in the training data that have the closest distance or highest similarity to objects in the new data or testing data [13].
By comparing the performance of two machine learning algorithms, namely Naive Bayes and K-Nearest Neighbor algorithms, for predicting child stunting cases, this study aims to determine the best classification model.The results will hopefully provide a better understanding of how classification algorithms work, which will enable the development of better stunting diagnosis models.

Metodologi
In this study, the data used is sourced from the Kaggle site, namely the Stunting Dataset.After the data is obtained, then the preprocessing stage is carried out, and then the data will be classified with the Naive Bayes and K-NN models.After being classified, it is continued with model evaluation using the Confusion Matrix testing method.Then the comparison of the two algorithms is carried out by comparing accuracy, precision, recall, and f1-score to find out which algorithm or method can perform the best classification.The following figure illustrates the process design flow.

Data Retrieval
This research uses data sourced from the Kaggle site, namely the Stunting Dataset.In the dataset there are 6500 data which contains 8 attributes, namely Sex, Age, Birth Weight, Birth Length, Body Weight, Body Length, Exclusive Breastfeeding, and Stunting.Of these attributes, the independent variable is 7 attributes and the dependent variable is 1 attribute.This data will be used to classify using the Naive Bayes and K-NN algorithms.

Preprocessing
Data processing is done at this point so that it can be utilised in the classification procedure.Data normalisation, data transformation, and data cleaning are the phases.The dataset that is used is split into two parts: 30% is used for testing and 70% is used for training. .Of the total 6500 data is divided into 4550 training data and 1950 testing data.

Naive Bayes
Naïve Bayes is used to estimate the probability of each class assuming that the classes are independent of each other.In this method, all attributes are considered independent and contribute to decision-making with equal weight.As a decision-making tool, Naive Bayes is also used to update the level of confidence in information [14].The use of the Naive Bayes classification method uses probability calculations.The basic concept of this method is Bayes' theorem, which is applied in statistics to determine the chances of the Naive Bayes classifier to calculate the class probability for each existing group and the attributes that determine the optimal class [4].The general form of the naive bayes method equation formula is shown in equation 1.

K-Nearest Neighbor
K-NN has several advantages, including being able to handle large datasets and being easy to use.The K-NN technique uses the following steps, namely inputting training data, training data labels, k, and testing data, calculating the distance between each testing data to each training data, determining k from the training data that has the closest distance to the testing data, then testing, then checking the k data labels, then determining the label with the highest frequency, then entering the testing data into the class with the highest frequency [15].Equation 2 applies a formula to determine the distance between two points, specifically training data points and testing data points. (2)

Evaluasi
Confusion Matrix is useful for understanding the differences between classes in classification cases.To measure the performance of Naive Bayes and K-Nearest Neighbor algorithms, Confusion Matrix is used in the evaluation stage.The Confusion Matrix table is a table containing predictions and actual values.The Confusion Matrix table shows the number of correct and incorrect data classified.True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN) are the four values generated in the confusion matrix table [10].The following figure shows an illustration of the confusion matrix table.

Figure 2. Confusion Matrix Table Illustration
Accuracy, Standard deviation, Precision, F1-Score, Recall and specificity are the evaluation criteria considered.In this study, the evaluation carried out is by calculating the value of accuracy, precision, recall, and F1-Score.[8].The general form of the equation can be shown in the following equations.

Result and Discussion
The research conducted uses a stunting dataset derived from Kaggle.At this stage, the data is divided into training data and testing data and is determined to be 70% training data and 30% testing data.The application of manual calculations for the Naive Bayes and K-NN algorithms uses 11 data samples, namely 10 training data and 1 testing data.Tables 1 and 2 show the 11 data samples.

Naive Bayes Implementation
The dataset used consists of 10 data as training data shown in Table 1 and Table 2 as testing data.After determining the next stage, the process of implementing the Naive Bayes method using equation 1.The data samples used for manual calculations are shown in Tables 1 and 2, The next process is data transformation or converting data into a form that is more in accordance with a format that can facilitate the stunting prediction process.The transformed category data can be seen in Tables 3 and 4 after the data is transformed.

Performance Testing Evaluation
In the tests conducted in this study using the Naive Bayes and K-Nearest Neighbor algorithms, the results are displayed with a confusion matrix and performance testing consists of calculating the accuracy, precision, recall, and f1-score values in equations 3, 4, 5, and 6.The test results with a total of 1950 testing data implemented in the Naive Bayes method are addressed in the form of a confusion matrix table using the python program on Google Colab in Table 7 and the K-NN method in Table 8 as follows.The results of performance testing on the naïve bayes method obtained an accuracy value of 71%, precision 71%, recall 76%, and f1-score 73%.The performance results of the K-NN method using euclidean distance measurement with a value of k = 3 obtained the best performance value, namely accuracy of 97%, precision of 98%, recall 96%, f1-score 97%.Based on the results of the comparison of the Naïve Bayes and K-NN algorithms, the accuracy, precision, recall, and F1-Score of the two algorithms concluded that the K-NN algorithm is the best algorithm in this classification.The results of the K-NN algorithm classification are implemented in the interface to predict stunting so that it can be used by the wider community.

Conclusion
This research compares Naive Bayes and K-NN algorithms on stunting datasets.Based on the evaluation results of performance testing of the K-Nearest Neighbor and Naive Bayes methods in calculating accuracy, precision, recall, and f1-score.The results of performance testing on the naïve bayes method obtained performance values on 30% testing data are accuracy of 71%, precision 71%, recall 76%, and f1-score 73%.The performance results of the K-NN method using euclidean distance measurement obtained the best performance value on 30% testing data, namely accuracy of 97%, precision 98%, recall 96%, f1-score 97% at a value of k = 3.Based on the performance results of the comparison of the Naive Bayes and K-NN methods, it shows that the best classification method on the stunting dataset is the K-NN method because it gets better performance than the Naive Bayes method.Based on this research, further research is needed to expand the scope of algorithms and techniques used in stunting diagnosis.In further research, it can be done by adding variables to the dataset for testing in order to get a better accuracy value and also develop a website display to be more attractive and complete.

Table 1 .
Sample Dataset: Training Data

Table 3 .
Category Data from Data Transformation

Table 4 .
After Data Transformation

Table 7 .
Confusion Matrix of Naive Bayes Method Testing Results

Table 7 .
Confusion Matrix of K-NN Method Testing Results