mentioned about classification metrics like ROC-AUC and Kolmogorov-Smirnov (KS) Statistic in earlier blogs.
On this weblog, we’ll discover one other necessary classification metric referred to as the Gini Coefficient.
Why do we now have a number of classification metrics?
Each classification metric tells us the mannequin efficiency from a unique angle. We all know that ROC-AUC offers us the general rating skill of a mannequin, whereas KS Statistic exhibits us the place the utmost hole between two teams happens.
On the subject of the Gini Coefficient, it tells us how significantly better our mannequin is than random guessing at rating the positives increased than the negatives.
First, let’s see how the Gini Coefficient is calculated.
For this, we once more use the German Credit score Dataset.
Let’s use the identical pattern information that we used to know the calculation of Kolmogorov-Smirnov (KS) Statistic.
This pattern information was obtained by making use of logistic regression on the German Credit score dataset.
Because the mannequin outputs chances, we chosen a pattern of 10 factors from these chances to reveal the calculation of the Gini coefficient.
Calculation
Step 1: Kind the info by predicted chances.
The pattern information is already sorted descending by predicting chances.
Step 2: Compute Cumulative Inhabitants and Cumulative Positives.
Cumulative Inhabitants: The cumulative variety of information thought of as much as that row.
Cumulative Inhabitants (%): The proportion of the overall inhabitants coated to date.
Cumulative Positives: What number of precise positives (class 2) we’ve seen up so far.
Cumulative Positives (%): The proportion of positives captured to date.

Step 3: Plot X and Y values
X = Cumulative Inhabitants (%)
Y = Cumulative Positives (%)
Right here, let’s use Python to plot these X and Y values.
Code:
import matplotlib.pyplot as plt
X = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
Y = [0.0, 0.25, 0.50, 0.75, 0.75, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00]
# Plot curve
plt.determine(figsize=(6,6))
plt.plot(X, Y, marker='o', coloration="cornflowerblue", label="Mannequin Lorenz Curve")
plt.plot([0,1], [0,1], linestyle="--", coloration="grey", label="Random Mannequin (Diagonal)")
plt.title("Lorenz Curve from Pattern Information", fontsize=14)
plt.xlabel("Cumulative Inhabitants % (X)", fontsize=12)
plt.ylabel("Cumulative Positives % (Y)", fontsize=12)
plt.legend()
plt.grid(True)
plt.present()
Plot:

The curve we get after we plot Cumulative Inhabitants (%) and Cumulative Positives (%) is known as the Lorenz curve.
Step 4: Calculate the world underneath the Lorenz curve.
Once we mentioned ROC-AUC, we discovered the world underneath the curve utilizing the trapezoid system.
Every area between two factors was handled as a trapezoid, its space was calculated, after which all areas have been added collectively to get the ultimate worth.
The identical technique is utilized right here to calculate the world underneath the Lorenz curve.
Space underneath the Lorenz curve
Space of Trapezoid:
$$
textual content{Space} = frac{1}{2} instances (y_1 + y_2) instances (x_2 – x_1)
$$
From (0.0, 0.0) to (0.1, 0.25):
[
A_1 = frac{1}{2}(0+0.25)(0.1-0.0) = 0.0125
]
From (0.1, 0.25) to (0.2, 0.50):
[
A_2 = frac{1}{2}(0.25+0.50)(0.2-0.1) = 0.0375
]
From (0.2, 0.50) to (0.3, 0.75):
[
A_3 = frac{1}{2}(0.50+0.75)(0.3-0.2) = 0.0625
]
From (0.3, 0.75) to (0.4, 0.75):
[
A_4 = frac{1}{2}(0.75+0.75)(0.4-0.3) = 0.075
]
From (0.4, 0.75) to (0.5, 1.00):
[
A_5 = frac{1}{2}(0.75+1.00)(0.5-0.4) = 0.0875
]
From (0.5, 1.00) to (0.6, 1.00):
[
A_6 = frac{1}{2}(1.00+1.00)(0.6-0.5) = 0.100
]
From (0.6, 1.00) to (0.7, 1.00):
[
A_7 = frac{1}{2}(1.00+1.00)(0.7-0.6) = 0.100
]
From (0.7, 1.00) to (0.8, 1.00):
[
A_8 = frac{1}{2}(1.00+1.00)(0.8-0.7) = 0.100
]
From (0.8, 1.00) to (0.9, 1.00):
[
A_9 = frac{1}{2}(1.00+1.00)(0.9-0.8) = 0.100
]
From (0.9, 1.00) to (1.0, 1.00):
[
A_{10} = frac{1}{2}(1.00+1.00)(1.0-0.9) = 0.100
]
Whole Space Underneath Lorenz Curve:
[
A = 0.0125+0.0375+0.0625+0.075+0.0875+0.100+0.100+0.100+0.100+0.100 = 0.775
]
We calculated the world underneath the Lorenz curve, which is 0.775.
Right here, we plotted Cumulative Inhabitants (%) and Cumulative Positives (%), and we will observe that the world underneath this curve exhibits how shortly the positives (class 2) are being captured as we transfer down the sorted checklist.
In our pattern dataset, we now have 4 positives (class 2) and 6 negatives (class 1).
For an ideal mannequin, by the point we attain 40% of the inhabitants, it captures 100% of the positives.
The curve seems like this for an ideal mannequin.

Space underneath the lorenz curve for the proper mannequin.
[
begin{aligned}
text{Perfect Area} &= text{Triangle (0,0 to 0.4,1)} + text{Rectangle (0.4,1 to 1,1)} [6pt]
&= frac{1}{2} instances 0.4 instances 1 ;+; 0.6 instances 1 [6pt]
&= 0.2 + 0.6 [6pt]
&= 0.8
finish{aligned}
]
We even have one other technique to calculate the Space underneath the curve for the proper mannequin.
[
text{Let }pi text{ be the proportion of positives in the dataset.}
]
[
text{Perfect Area} = frac{1}{2}pi cdot 1 + (1-pi)cdot 1
]
[
= frac{pi}{2} + (1-pi)
]
[
= 1 – frac{pi}{2}
]
For our dataset:
Right here, we now have 4 positives out of 10 information, so: π = 4/10 = 0.4.
[
text{Perfect Area} = 1 – frac{0.4}{2} = 1 – 0.2 = 0.8
]
We calculated the world underneath the lorenz curve for our pattern dataset and in addition for the proper mannequin with identical variety of positives and negatives.
Now, if we undergo the dataset with out sorting, the positives are evenly unfold out. This implies the speed at which we gather positives is similar as the speed at which we transfer by the inhabitants.
That is the random mannequin, and it at all times offers an space underneath the curve of 0.5.

Step 5: Calculate the Gini Coefficient
[
A_{text{model}} = 0.775
]
[
A_{text{random}} = 0.5
]
[
A_{text{perfect}} = 0.8
]
[
text{Gini} = frac{A_{text{model}} – A_{text{random}}}{A_{text{perfect}} – A_{text{random}}}
]
[
= frac{0.775 – 0.5}{0.8 – 0.5}
]
[
= frac{0.275}{0.3}
]
[
approx 0.92
]
We obtained Gini = 0.92, which suggests virtually all of the positives are concentrated on the prime of the sorted checklist. This exhibits that the mannequin does an excellent job of separating positives from negatives, coming near good.
As we now have seen how the Gini Coefficient is calculated, let’s take a look at what we really did throughout the calculation.
We thought of a pattern of 10 factors consisting of output chances from logistic regression.
We sorted the chances in descending order.
Subsequent, we calculated Cumulative Inhabitants (%) and Cumulative Positives (%) after which plotted them.
We obtained a curve referred to as the Lorenz curve, and we calculated the world underneath it, which is 0.775.
Now let’s perceive what’s 0.775?
Our pattern consists of 4 positives (class 2) and 6 negatives (class 1).
The output chances are for sophistication 2, which suggests the upper the likelihood, the extra seemingly the shopper belongs to class 2.
In our pattern information, the positives are captured inside 50% of the inhabitants, which suggests all of the positives are ranked on the prime.
If the mannequin is ideal, then the positives are captured inside the first 4 rows, i.e., inside the first 40% of the inhabitants, and the world underneath the curve for the proper mannequin is 0.8.
However we obtained AUC = 0.775, which is sort of good.
Right here, we try to calculate the effectivity of the mannequin. If extra positives are concentrated on the prime, it means the mannequin is nice at classifying positives and negatives.
Subsequent, we calculated the Gini Coefficient, which is 0.92.
[
text{Gini} = frac{A_{text{model}} – A_{text{random}}}{A_{text{perfect}} – A_{text{random}}}
]
The numerator tells us how significantly better our mannequin is than random guessing.
The denominator tells us the utmost potential enchancment over random.
The ratio places these two collectively, so the Gini coefficient at all times falls between 0 (random) and 1 (good).
Gini is used to measure how shut the mannequin is to being good in separating constructive and unfavorable courses.
However we might get a doubt about why we calculated Gini and why we didn’t cease after 0.775.
0.775 is the world underneath the Lorenz curve for our mannequin. It doesn’t inform us how shut the mannequin is to being good with out evaluating it to 0.8, which is the world for the proper mannequin.
So, we calculate Gini to standardize it in order that it falls between 0 and 1, which makes it simple to check fashions.
Banks additionally use Gini Coefficient to judge credit score threat fashions alongside ROC-AUC and KS Statistic. Collectively, these measures give an entire image of mannequin efficiency.
Now, let’s calculate ROC-AUC for our pattern information.
import pandas as pd
from sklearn.metrics import roc_auc_score
# Pattern information
information = {
"Precise": [2, 2, 2, 1, 2, 1, 1, 1, 1, 1],
"Pred_Prob_Class2": [0.92, 0.63, 0.51, 0.39, 0.29, 0.20, 0.13, 0.10, 0.05, 0.01]
}
df = pd.DataFrame(information)
# Convert Precise: class 2 -> 1 (constructive), class 1 -> 0 (unfavorable)
y_true = (df["Actual"] == 2).astype(int)
y_score = df["Pred_Prob_Class2"]
# Calculate ROC-AUC
roc_auc = roc_auc_score(y_true, y_score)
roc_auc
We obtained AUC = 0.9583
Now, Gini = (2 * AUC) – 1 = (2 * 0.9583) – 1 = 0.92
That is the relation between Gini & ROC-AUC.
Now let’s calculate Gini Coefficient on a full dataset.
Code:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
# Load dataset
file_path = "C:/german.information"
information = pd.read_csv(file_path, sep=" ", header=None)
# Rename columns
columns = [f"col_{i}" for i in range(1, 21)] + ["target"]
information.columns = columns
# Options and goal
X = pd.get_dummies(information.drop(columns=["target"]), drop_first=True)
y = information["target"]
# Convert goal: make it binary (1 = good, 0 = dangerous)
y = (y == 2).astype(int)
# Practice-test break up
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Practice logistic regression
mannequin = LogisticRegression(max_iter=10000)
mannequin.match(X_train, y_train)
# Predicted chances
y_pred_proba = mannequin.predict_proba(X_test)[:, 1]
# Calculate ROC-AUC
auc = roc_auc_score(y_test, y_pred_proba)
# Calculate Gini
gini = 2 * auc - 1
auc, gini
We obtained Gini = 0.60
Interpretation:
Gini > 0.5: acceptable.
Gini = 0.6–0.7: good mannequin.
Gini = 0.8+: wonderful, not often achieved.
Dataset
The dataset used on this weblog is the German Credit score dataset, which is publicly out there on the UCI Machine Studying Repository. It’s supplied underneath the Artistic Commons Attribution 4.0 Worldwide (CC BY 4.0) License. This implies it may be freely used and shared with correct attribution.
I hope you discovered this weblog helpful.
In case you loved studying, think about sharing it together with your community, and be happy to share your ideas.
In case you haven’t learn my earlier blogs on ROC-AUC and Kolmogorov Smirnov Statistic, you may verify them out right here.
Thanks for studying!