Mango Leaf Disease Detective: Machine Learning & Deep Learning Approaches

asa5604
Apr 30, 2023
7 min read

OVERVIEW

In this project, I use a Kaggle dataset to create a classifier for mango leaf diseases. I compare the performance of various machine learning and deep learning algorithms. After slicing the data into training and validation sets, I load the dataset and prepare the images.

I employ the subsequent algorithms:

Naive Bayes

Support Vector Machine (SVM)

K-Nearest Neighbors (KNN)

Random Forest

Convolutional Neural Network (CNN)

I train the model for each algorithm using the training set and test its effectiveness using the validation set. Each algorithm's accuracy scores are recorded in a dictionary, and the results are displayed as a bar plot.

KAGGLE NOTEBOOK AND GITHUB LINK

YOUTUBE LINK

DATASET LINK

Description

Type of data: 240x320 mango leaf images. Data format: JPG Number of images: 4000 images. Of these, around 1800 are of distinct leaves and the rest are prepared by zooming and rotating where deemed necessary. Diseases considered: Seven diseases, namely Anthracnose, Bacterial Canker, Cutting Weevil, Die Back, Gall Midge, Powdery Mildew, and Sooty Mould. A number of classes: Eight (including the healthy category). Distribution of instances: Each of the eight categories contains 500 images.

ALGORITHM 1 NAIVE BAYES CLASSIFIER

Naive Bayes: I utilized the Gaussian Naive Bayes algorithm from the skle arn library for classification. Naive Bayes is a probabilistic classifier based on the Bayes theorem, which assumes that features are conditionally independent given the class label. Despite its simplicity, it performed relatively poorly on the dataset compared to other methods, possibly due to the violation of the independence assumption among image features.

I am using Gaussian Naive Bayes in this particular situation, which presupposes that the data inside each class have a Gaussian (normal) distribution. The model is used to make predictions on a validation dataset (X_val_std) after being trained on a standardized dataset (X_train_std) with the matching class labels (y_train_classes). In order to determine the model's accuracy, predictions (gnb_pred) are compared to the actual labels found in the validation dataset (y_val_classes). A dictionary called results contains the results.

A confusion matrix is plotted in order to better understand how well the model performed. This matrix displays the number of samples for each class that were successfully and incorrectly categorized. Additionally, the model parameters are printed out to provide more details regarding the specific Gaussian configuration.

Accuracy Naive Bayes: 0.53875

ALGORITHM 2 SUPPORT VECTOR MACHINE

For classification and regression tasks, the Support Vector Machine (SVM) model used in this study is a potent and well-known machine-learning technique. It operates by identifying the best decision boundary, also known as a hyperplane, that divides the various classes in the feature space. Increasing the margin between the classes, or the distance between the decision boundary and the closest samples from each class, is the main goal of SVM. The decision boundary is supported by the closest samples, which are referred to as support vectors.

When working with nonlinearly separable data, SVMs frequently use radial basis function (RBF) kernels, which I am using in this particular instance. In order to find a linear choice, the data must be transformed into a higher-dimensional space with the use of the kernel function.

The model is used to make predictions on a validation dataset (X_val_std) after being trained on a standardized dataset (X_train_std) with the matching class labels (y_train_classes). In order to determine the model's accuracy, predictions (svm_pred) are compared to the actual labels found in the validation dataset (y_val_classes). A dictionary called results contains the results.

A confusion matrix is plotted in order to better understand how well the model performed. This matrix displays the number of samples for each class that were successfully and incorrectly categorized. Additionally, the model parameters, such as the RBF kernel, regularization parameter C, gamma value, and random state, are printed out to provide more details about the specific configuration of the SVM model that was used.

Accuracy SVM: 0.82125

ALGORITHM 3 K NEAREST NEIGHBOURS

This study uses the non-parametric, instance-based k-Nearest Neighbors (kNN) model, which is a popular technique for both classification and regression problems. The fundamental principle of kNN is to predict the output label based on a majority vote of the labels of the nearest neighbors by using the input attributes of a new instance to look for the k most similar instances in the training dataset.

With k set to 3 in this particular implementation, the kNN classifier takes into account the three nearest neighbors when determining the majority vote. The model is used to make predictions on the validation dataset (X_val_std) after being trained on the standardized dataset (X_train_std) with the matching class labels (y_train_classes).

It is calculated how accurate the model is. By contrasting the model's predictions (knn_pred) with the actual labels in the validation dataset (y_val_classes), the accuracy of the model is determined. A dictionary called results contains the results.

By contrasting the model's predictions (knn_pred) with the actual labels in the validation dataset (y_val_classes), the accuracy of the model is determined. A dictionary called results contains the results.

Accuracy KNN: 0.675

ALGORITHM 4 RANDOM FOREST

This particular Random Forest model is an ensemble learning approach to categorization that is frequently employed in machine learning. During the training phase, a large number of decision trees are built, and the class that is produced is the mode of the classes predicted by the individual trees. The RandomForestClassifier class from the Scikit-learn library is used to create the model.

Since I've set the number of estimators (trees) in this particular implementation to 100, the Random Forest model will include 100 decision trees. For repeatability, the random_state parameter is set to 42, ensuring that the same outcomes are obtained each time the model is used with the same input data.

The model has been used to make predictions on the validation data after being trained using the standardized training data. By contrasting the predicted class labels with the actual labels from the validation dataset, the accuracy of the model is determined.

For the purpose of visualizing the model's performance, a confusion matrix has been displayed. By displaying the true positives, true negatives, false positives, and false negatives, this matrix enables you to assess how successfully the model categorizes each class.

The get_params() method is used to print out the model parameters, which show the hyperparameters and their values utilized in this particular instance of the RandomForestClassifier.

Accuracy Random Forest: 0.875

ALGORITHM 1 CONVOLUTIONAL NEURAL NETWORK (CNN)

Convolutional Neural Network (CNN): In this project, I implemented a CNN using TensorFlow and Keras. CNNs are particularly effective for image classification tasks due to their ability to automatically learn hierarchical feature representations from the input images. My CNN architecture consists of multiple convolutional and max-pooling layers, followed by dropout and fully connected layers. I used the Adam optimizer and categorical cross-entropy loss during training. The model achieved the highest accuracy among all the algorithms I used, making it the most suitable choice for this classification task.

This model is a "Convolutional Neural Network" (CNN)—a class of computer program. It is intended to spot patterns in pictures, such as a disease in a photograph of a mango leaf. It operates by focusing on discrete areas of the image and picking out key elements, like colors and shapes. As information passes through the model's layers, it becomes able to comprehend ever-more complex patterns.

First, there are a few layers that examine discrete areas of the image and look for fundamental elements like lines and edges. They are referred to as Conv2D layers. I have a MaxPooling2D layer after each Conv2D layer, which reduces the image's size so that the model may concentrate on its most crucial components.

The Flatten layer takes all the data from the previous levels and converts it into a lengthy list of numbers after I have gone through all the Conv2D and MaxPooling2D layers. This aids the model's comprehension of the entire image.

To ensure that the model doesn't rely too heavily on any one area of the image, I then create a Dropout layer that randomly removes some of the information. The model becomes more precise and versatile as a result.

The model can learn more complex patterns with the aid of a Dense layer with 512 units. Finally, I have a Dense layer with a number of units corresponding to the variety of diseases I want the model to be able to identify. The ailment the model believes is present in the image is revealed in the final layer.

By presenting the model with several images of mango leaves suffering from various diseases and providing the proper response, I train the model. The model improves over time as a result of its mistakes. In order to prevent the model from learning too much from any one image and to ensure that it can generalize effectively to new images that it hasn't seen before, I also employ some unique strategies like early halting and model checkpoints.

RESULTS FOR CNN

Training Accuracy After 10 Epochs:- 97%

Validation Accuracy After 10 Epochs:- 92%

COMPARISON & RESULTS

Accuracy Scores:
Naive Bayes: 0.53875
SVM: 0.82125
KNN: 0.675
Random Forest: 0.875
CNN: 0.92

In conclusion, the performance of the five different algorithms on the image classification task can be compared based on their accuracy scores:

Naive Bayes: 0.53875

SVM: 0.82125

KNN: 0.675

Random Forest: 0.875

CNN: 0.92

The CNN model achieves the highest accuracy (0.92) due to its ability to learn image features automatically, making it the best choice for this task.
The Random Forest classifier, an ensemble method, performs well with an accuracy of 0.875 and can be further fine-tuned.
The SVM model has a respectable accuracy of 0.82125, with its performance dependent on kernel function and hyperparameters.
The KNN algorithm's lower accuracy (0.675) is influenced by distance metric and neighbor count.
while the Naive Bayes algorithm's lowest accuracy (0.53875) is due to its assumption of feature independence.

In conclusion, the CNN model is the most effective for this task, but it's crucial to consider complexity, training time, and interpretability when selecting the best algorithm for a specific problem.

CHALLENGES

ENORMOUS DATASET:- Due to the huge size of the dataset, it was almost impossible to run this project in the collab, therefore instead of the collab I directly implemented it on Kaggle where all the data is already imported.
DATA PREPROCESSING:- used one hot encoding method for the CNN model. and for the other models used the original labels.
OVERFITTING IN CNN:- I added dropout layers to the CNN model to reduce overfitting.
STANDARDIZATION:- I used the StandardScaler from sklearn to standardize the image data.

MY CONTRIBUTIONS TO EXISTING REFERENCES ARE:

Using multiple algorithms for classification and comparing their performance.
Tuning hyperparameters to improve the performance of the models.
Adding dropout layers in the CNN model to reduce overfitting.
Standardizing the data for non-deep learning algorithms.

EXPERIMENT AND FINDINGS:

Tuning hyperparameters for SVM:- I found that using the 'rbf' kernel, C=1, and gamma='scale' provided the best performance.
Choosing the optimal number of neighbors for KNN:- I found that using 3 neighbors provided the best performance.
Choosing the optimal CONV2D layers and activation functions for the CNN used in this project.

REFERENCES

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org/
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830. https://scikit-learn.org/stable/
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. https://www.tensorflow.org/
Chollet, F., & others. (2015). Keras. https://keras.io/
Brownlee, J. (n.d.). Machine Learning Mastery. https://machinelearningmastery.com/

Mango Leaf Disease Detective: Machine Learning & Deep Learning Approaches

Recent Posts

Comments