Image by Editor
Scikit-learn is one of the most commonly used machine-learning libraries built in python. Its popularity can be attributed to its easy and consistent code structure which is friendly for beginner developers. Also, there is a high level of support available along with flexibility to integrate third-party functionalities which makes the library robust and suitable for production. The library contains multiple machine learning models for classification, regression, and clustering. In this tutorial, we will explore the problem of multiclass classification through various algorithms. Let’s dive right into it and build our scikit-learn models.
pip install scikit-learn
We will use the “Wine” dataset available in the datasets module of scikit-learn. This dataset consists of 178 samples and 3 classes in total. The dataset is already pre-processed and converted to feature vectors hence, we can directly use it to train our models.
from sklearn.datasets import load_wine X, y = load_wine(return_X_y=True)
We will keep 67% of the data for training and the rest 33% for testing.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42
)
Now, we will experiment with 5 different models of differing complexities and evaluate their results on our dataset.
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test) print("Accuracy Score: ", accuracy_score(y_pred_lr, y_test))
print(classification_report(y_pred_lr, y_test))
Output
Accuracy Score: 0.9830508474576272 precision recall f1-score support 0 1.00 0.95 0.98 21 1 0.96 1.00 0.98 23 2 1.00 1.00 1.00 15 accuracy 0.98 59 macro avg 0.99 0.98 0.98 59
weighted avg 0.98 0.98 0.98 59
model_knn = KNeighborsClassifier(n_neighbors=1)
model_knn.fit(X_train, y_train)
y_pred_knn = model_knn.predict(X_test) print("Accuracy Score:", accuracy_score(y_pred_knn, y_test))
print(classification_report(y_pred_knn, y_test))
Output
Accuracy Score: 0.7796610169491526 precision recall f1-score support 0 0.90 0.78 0.84 23 1 0.75 0.82 0.78 22 2 0.67 0.71 0.69 14 accuracy 0.78 59 macro avg 0.77 0.77 0.77 59
weighted avg 0.79 0.78 0.78 59
Upon changing the parameter ‘n_neighbors=2’ we observe a decrease in the value of accuracy. Hence, it shows that the data is simple enough and achieves better learning with a single neighbor to consider.
Output
Accuracy Score: 0.6949152542372882 precision recall f1-score support 0 0.90 0.72 0.80 25 1 0.75 0.69 0.72 26 2 0.33 0.62 0.43 8 accuracy 0.69 59 macro avg 0.66 0.68 0.65 59
weighted avg 0.76 0.69 0.72 59
from sklearn.naive_bayes import GaussianNB model_nb = GaussianNB()
model_nb.fit(X_train, y_train)
y_pred_nb = model_nb.predict(X_test) print("Accuracy Score:", accuracy_score(y_pred_nb, y_test))
print(classification_report(y_pred_nb, y_test))
Output
Accuracy Score: 1.0 precision recall f1-score support 0 1.00 1.00 1.00 20 1 1.00 1.00 1.00 24 2 1.00 1.00 1.00 15 accuracy 1.00 59 macro avg 1.00 1.00 1.00 59
weighted avg 1.00 1.00 1.00 59
from sklearn.tree import DecisionTreeClassifier model_dtclassifier = DecisionTreeClassifier()
model_dtclassifier.fit(X_train, y_train)
y_pred_dtclassifier = model_dtclassifier.predict(X_test) print("Accuracy Score:", accuracy_score(y_pred_dtclassifier, y_test))
print(classification_report(y_pred_dtclassifier, y_test))
Output
Accuracy Score: 0.9661016949152542 precision recall f1-score support 0 0.95 0.95 0.95 20 1 1.00 0.96 0.98 25 2 0.93 1.00 0.97 14 accuracy 0.97 59 macro avg 0.96 0.97 0.97 59
weighted avg 0.97 0.97 0.97 59
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV def get_best_parameters(): params = { "n_estimators": [10, 50, 100], "max_features": ["auto", "sqrt", "log2"], "max_depth": [5, 10, 20, 50], "min_samples_split": [2, 4, 6], "min_samples_leaf": [2, 4, 6], "bootstrap": [True, False], } model_rfclassifier = RandomForestClassifier(random_state=42) rf_randomsearch = RandomizedSearchCV( estimator=model_rfclassifier, param_distributions=params, n_iter=5, cv=3, verbose=2, random_state=42, ) rf_randomsearch.fit(X_train, y_train) best_parameters = rf_randomsearch.best_params_ print("Best Parameters:", best_parameters) return best_parameters parameters_rfclassifier = get_best_parameters() model_rfclassifier = RandomForestClassifier( **parameters_rfclassifier, random_state=42
) model_rfclassifier.fit(X_train, y_train) y_pred_rfclassifier = model_rfclassifier.predict(X_test) print("Accuracy Score:", accuracy_score(y_pred_rfclassifier, y_test))
print(classification_report(y_pred_rfclassifier, y_test))
Output
Best Parameters: {'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 5, 'bootstrap': True}
Accuracy Score: 0.9830508474576272 precision recall f1-score support 0 1.00 0.95 0.98 21 1 0.96 1.00 0.98 23 2 1.00 1.00 1.00 15 accuracy 0.98 59 macro avg 0.99 0.98 0.98 59
weighted avg 0.98 0.98 0.98 59
In this algorithm, we performed some hyperparameter tuning to achieve the best accuracy. We defined a parameter grid consisting of multiple values to choose from for each parameter. Further, we used the Randomized Search CV algorithm to search the best parameter space for the model. Finally we feed the obtained parameters to the classifier and train the model.
Models | Accuracy | Observations |
Logistic Regression | 98.30% | Achieves great accuracy. Model is able to generalize well on the test dataset. |
K-Nearest Neighbors | 77.96% | The algorithm is not able to learn the data representation well. |
Naive Bayes | 100% | The model is less complex hence it overfits the data to obtain absolute accuracy. |
Decision Tree Classifier | 96.61% | Achieves decent accuracy. |
Random Forest Classifier | 98.30% | Being an ensemble-based approach it performs better than Decision Tree. Performing hyperparameter tuning makes it achieve similar accuracy to logistic regression. |
In this tutorial, we learned how to get started to build and train machine learning models in scikit-learn. We implemented and evaluated a few algorithms to get a basic idea about their performance. One can always adopt advanced strategies for feature engineering, hyperparameter tuning or training to improve performance. To read more about the functionalities that scikit-learn offers, head over to the official documentation - Introduction to machine learning with scikit-learn, Machine Learning in Python with scikit-learn.
Yesha Shastri is a passionate AI developer and writer pursuing Master’s in Machine Learning from Université de Montréal. Yesha is intrigued to explore responsible AI techniques to solve challenges that benefit society and share her learnings with the community.
- SEO Powered Content & PR Distribution. Get Amplified Today.
- Platoblockchain. Web3 Metaverse Intelligence. Knowledge Amplified. Access Here.
- Source: https://www.kdnuggets.com/getting-started-with-scikit-learn-for-classification-in-machine-learning.html?utm_source=rss&utm_medium=rss&utm_campaign=getting-started-with-scikit-learn-for-classification-in-machine-learning
- 1
- 10
- 100
- 67
- 77
- 84
- 9
- 98
- a
- Able
- About
- Absolute
- accuracy
- Achieve
- adopt
- advanced
- AI
- algorithm
- algorithms
- already
- always
- and
- approach
- auto
- available
- basic
- benefit
- BEST
- Better
- Bootstrap
- build
- built
- challenges
- changing
- Choose
- classes
- classification
- clustering
- code
- commonly
- community
- complex
- complexities
- Consider
- consistent
- Consisting
- contains
- converted
- data
- datasets
- decision
- decrease
- Developer
- developers
- different
- differing
- directly
- documentation
- each
- Engineering
- enough
- Ether (ETH)
- evaluate
- evaluated
- experiment
- explore
- Feature
- few
- Finally
- Flexibility
- forest
- friendly
- from
- functionalities
- further
- get
- getting
- great
- Grid
- head
- High
- How
- How To
- HTML
- HTTPS
- Hyperparameter Tuning
- idea
- implemented
- import
- improve
- in
- install
- integrate
- IT
- Keep
- LEARN
- learned
- learning
- Level
- libraries
- Library
- machine
- machine learning
- Macro
- MAKES
- master’s
- model
- models
- module
- more
- most
- multiple
- observe
- obtained
- Offers
- official
- ONE
- parameter
- parameters
- passionate
- performance
- performing
- performs
- plato
- Plato Data Intelligence
- PlatoData
- popularity
- Precision
- Problem
- Production
- Python
- Randomized
- Read
- representation
- responsible
- REST
- Results
- return
- robust
- scikit-learn
- Search
- Share
- Shows
- similar
- Simple
- single
- Society
- SOLVE
- some
- Space
- started
- strategies
- structure
- suitable
- support
- techniques
- test
- Testing
- The
- their
- third-party
- Through
- to
- Total
- Train
- Training
- true
- tutorial
- use
- value
- Values
- various
- which
- will
- writer
- X
- zephyrnet