LightGBM Binary Classification, Multi-Class Classification, Regression using Python

Nitin
4 min readApr 22, 2020

--

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient as compared to other boosting algorithms. A model that can be used for comparison is XGBoost which is also a boosting method and it performs exceptionally well when compared to other algorithms.

However XGBoost is a good algorithm for datasets less than 10000 rows, for large datasets, it is not recommended. While LightGBM can handle a large amount of data, less memory usage, has parallel and GPU learning, good accuracy, faster training speed and efficiency. So what makes LightGBM a better model, well for one it grows the tree Leaf Wise while other algorithms grow level wise.

LightGBM Tree Leaf Growth
Other Algorithm Level Growth

I am not going into the details of LightGBM as this article is focused to get you started using the algorithm in Python. If you want to understand more about the algorithm’s working please refer to the links at the end.

Installation of LightGBM:

If you are using Anaconda: conda install -c conda-forge lightgbm For any other installation guide refer to this link.

Types of Operation supported by LightGBM:

  1. Regression
  2. Binary Classification
  3. Multi-Class Classification
  4. Cross-Entropy
  5. Lambdrank

In this article, I will show you how to perform Binary-classification, Multi-Class classification and Regression. Before we get started I would like to remind you that the dataset we will use are toy datasets(less records), hence they are prone to overfitting in LightGBM. To escape overfitting in we can play with the max_depth value. You may get a doubt that max_depth is used for level-wise growth, rest assured the tree will grow leaf-wise even if the max_depth is specified.

#importing libraries
import numpy as np
from collections import Counter
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer,load_boston,load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import mean_squared_error,roc_auc_score,precision_score
pd.options.display.max_columns = 999
  1. Binary Classification using the Breast Cancer Dataset(link):
#loading the breast cancer dataset
X=load_breast_cancer()
df=pd.DataFrame(X.data,columns=X.feature_names)
Y=X.target
#scaling the features using Standard Scaler
sc=StandardScaler()
sc.fit(df)
X=pd.DataFrame(sc.fit_transform(df))
#train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=0)
#converting the dataset into proper LGB format
d_train=lgb.Dataset(X_train, label=y_train)
#Specifying the parameter
params={}
params['learning_rate']=0.03
params['boosting_type']='gbdt' #GradientBoostingDecisionTree
params['objective']='binary' #Binary target feature
params['metric']='binary_logloss' #metric for binary classification
params['max_depth']=10
#train the model
clf=lgb.train(params,d_train,100) #train the model on 100 epocs
#prediction on the test set
y_pred=clf.predict(X_test)

We have created, trained and tested the model. Now we will check how the model performed using roc_auc_score metric from sklearn. The predictions stored in y_pred looks something like this [0.04558262, 0.89328757, 0.97349586, 0.97226278, 0.950874] so we need to convert them into the proper format.

if>=0.5 ---> 1
else ---->0
#rounding the values
y_pred=y_pred.round(0)
#converting from float to integer
y_pred=y_pred.astype(int)
#roc_auc_score metric
roc_auc_score(y_pred,y_test)
#0.9672167056074766

2.Multi-Class Classification using the Wine dataset(link)

#loading the dataset
X1=load_wine()
df_1=pd.DataFrame(X1.data,columns=X1.feature_names)
Y_1=X1.target
#Scaling using the Standard Scaler
sc_1=StandardScaler()
sc_1.fit(df_1)
X_1=pd.DataFrame(sc_1.fit_transform(df_1))
#train-test-split
X_train,X_test,y_train,y_test=train_test_split(X_1,Y_1,test_size=0.3,random_state=0)
#Converting the dataset in proper LGB format
d_train=lgb.Dataset(X_train, label=y_train)
#setting up the parameters
params={}
params['learning_rate']=0.03
params['boosting_type']='gbdt' #GradientBoostingDecisionTree
params['objective']='multiclass' #Multi-class target feature
params['metric']='multi_logloss' #metric for multi-class
params['max_depth']=10
params['num_class']=3 #no.of unique values in the target class not inclusive of the end value
#training the model
clf=lgb.train(params,d_train,100) #training the model on 100 epocs
#prediction on the test dataset
y_pred_1=clf.predict(X_test)
#printing the predictions
y_pred_1
[0.95819859, 0.02205037, 0.01975104],
[0.05465546, 0.09575231, 0.84959223],
[0.20955298, 0.69498737, 0.09545964],
[0.95852959, 0.02096561, 0.02050481],
[0.04243184, 0.92053949, 0.03702867],....

In a multi-class problem, the model produces num_class(3) probabilities as shown in the output above. We can use numpy.argmax() method to print the class which has the most reasonable result.

#argmax() method 
y_pred_1 = [np.argmax(line) for line in y_pred_1]
#printing the predictions
[0,2,1,0,1,0,0,2,...]
#using precision score for error metrics
precision_score(y_pred_1,y_test,average=None).mean()
# 0.9545454545454546

3. Regression using the Boston dataset(link):

#loading the Boston Dataset
X=load_boston()
df=pd.DataFrame(X.data,columns=X.feature_names)
Y=X.target
#Scaling using the Standard Scaler
sc=StandardScaler()
sc.fit(df)
X=pd.DataFrame(sc.fit_transform(df))
#train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=0)
#Converting the data into proper LGB Dataset Format
d_train=lgb.Dataset(X_train, label=y_train)
#Declaring the parameters
params={}
params['learning_rate']=0.03
params['boosting_type']='gbdt' #GradientBoostingDecisionTree
params['objective']='regression'#regression task
params['n_estimators']=100
params['max_depth']=10
#model creation and training
clf=lgb.train(params,d_train,100)
#model prediction on X_test
y_pred=clf.predict(X_test)
#using RMSE error metric
mean_squared_error(y_pred,y_test)
# 0.9672167056074766

I have covered the basics of model creation using LightGBM. I would recommend that you practice using the same dataset, tweak the parameters and try to improve the model predictions. I also recommend that you get familiar with the different parameters of the model, so you can move more freely while using the API’s. Once you get to know LightGBM I assure you this will become your go-to algorithm for any task as it is fast, light and deadly accurate.

In the next article, I will try to explain some of the more advanced features of LightGBM model like feature_importance and early stopping. Thank you for reading guys, if you found this article helpful please leave an upvote as it motivates me to write more frequently. If you find any mistakes or errors please do let me know in the comment section, your opinions are most welcome.

Sources:

Sklearn Dataset: https://scikit-learn.org/stable/datasets/index.html#toy-datasets

LightGBM Docs: https://lightgbm.readthedocs.io/en/latest/Features.html

LightGBM Python API: https://lightgbm.readthedocs.io/en/latest/Python-API.html

An article which I found helpful: https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/

--

--