构建预测模型：Python 中的逻辑回归 - KDnuggets

由柏拉图重新发布

关注： 0

构建预测模型：Python 中的逻辑回归
图片作者

When you are getting started with machine learning, logistic regression is one of the first algorithms you’ll add to your toolbox. It’s a simple and robust algorithm, commonly used for binary classification tasks.

考虑类别 0 和 1 的二元分类问题。逻辑回归将逻辑或 sigmoid 函数拟合到输入数据，并预测查询数据点属于类别 1 的概率。有趣，是吗？

在本教程中，我们将从头开始学习逻辑回归，包括：

逻辑（或 sigmoid）函数
我们如何从线性回归转向逻辑回归
逻辑回归的工作原理

最后，我们将构建一个简单的逻辑回归模型对电离层雷达回波进行分类.

Before we learn more about logistic regression, let’s review how the logistic function works. The logistic (or sigmoid function) is given by:

构建预测模型：Python 中的逻辑回归

当您绘制 sigmoid 函数时，它看起来像这样：

构建预测模型：Python 中的逻辑回归

从剧情中我们可以看出：

当 x = 0 时，σ(x) 的值为 0.5。
当 x 接近 +∞ 时，σ(x) 接近 1。
当 x 接近 -∞ 时，σ(x) 接近 0。

因此，对于所有实数输入，sigmoid 函数会将它们压缩为 [0, 1] 范围内的值。

Let’s first discuss why we cannot use linear regression for a binary classification problem.

在二元分类问题中，输出是分类标签（0 或 1）。由于线性回归预测的连续值输出可能小于 0 或大于 1，因此它对于当前的问题没有意义。

此外，当输出标签属于两个类别之一时，直线可能不是最佳拟合。

构建预测模型：Python 中的逻辑回归
图片作者

那么我们如何从线性回归转向逻辑回归呢？在线性回归中，预测输出由下式给出：

构建预测模型：Python 中的逻辑回归

其中 βs 是系数，X_is 是预测变量（或特征）。

不失一般性，我们假设 X_0 = 1：

构建预测模型：Python 中的逻辑回归

所以我们可以有一个更简洁的表达方式：

构建预测模型：Python 中的逻辑回归

在逻辑回归中，我们需要[0,1]区间内的预测概率p_i。我们知道逻辑函数会压缩输入，使其呈现 [0,1] 区间内的值。

因此，将此表达式代入逻辑函数中，我们的预测概率为：

构建预测模型：Python 中的逻辑回归

那么我们如何找到给定数据集的最佳拟合逻辑曲线呢？为了回答这个问题，让我们了解最大似然估计。

最大似然估计（MLE） is used to estimate the parameters of the logistic regression model by maximizing the likelihood function. Let’s break down the process of MLE in logistic regression and how the cost function is formulated for optimization using gradient descent.

分解最大似然估计

正如所讨论的，我们将二元结果发生的概率建模为一个或多个预测变量（或特征）的函数：

构建预测模型：Python 中的逻辑回归

Here, the βs are the model parameters or coefficients. X_1, X_2,…, X_n are the predictor variables.

MLE 旨在找到使观测数据的可能性最大化的 β 值。似然函数表示为 L(β)，表示在逻辑回归模型下观察给定预测变量值的给定结果的概率。

制定对数似然函数

To simplify the optimization process, it’s common to work with the log-likelihood function. Because it transforms products of probabilities into sums of log probabilities.

逻辑回归的对数似然函数由下式给出：

构建预测模型：Python 中的逻辑回归

Now that we know the essence of log-likelihood, let’s proceed to formulate the cost function for logistic regression and subsequently gradient descent for finding the best model parameters

逻辑回归的成本函数

为了优化逻辑回归模型，我们需要最大化对数似然。因此，我们可以使用负对数似然作为成本函数，在训练期间最小化。负对数似然，通常称为逻辑损失，定义为：

构建预测模型：Python 中的逻辑回归

因此，学习算法的目标是找到 ? 的值。最小化这个成本函数。梯度下降是一种常用的优化算法，用于寻找该成本函数的最小值。

逻辑回归中的梯度下降

梯度下降是一种迭代优化算法，以与成本函数相对于 β 的梯度相反的方向更新模型参数 β。使用梯度下降的逻辑回归在步骤t+1的更新规则如下：

构建预测模型：Python 中的逻辑回归

其中 α 是学习率。

偏导数可以使用链式法则计算。梯度下降迭代更新参数，直到收敛，旨在最大限度地减少逻辑损失。当它收敛时，它会找到使观测数据的可能性最大化的 β 的最佳值。

现在您已经了解了逻辑回归的工作原理，接下来让我们使用 scikit-learn 库构建一个预测模型。

我们将使用来自 UCI 机器学习存储库的电离层数据集对于本教程。该数据集包含 34 个数字特征。输出是二进制的，“好”或“坏”之一（用“g”或“b”表示）。输出标签“良好”是指雷达回波已检测到电离层中的某些结构。

第 1 步 – 加载数据集

首先，下载数据集并将其读入 pandas 数据框：

import pandas as pd
import urllib

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/iphere.data"
data = urllib.request.urlopen(url)
df = pd.read_csv(data, header=None)

第 2 步 – 探索数据集

Let’s take a look at the first few rows of the dataframe:

# Display the first few rows of the DataFrame
df.head()

构建预测模型：Python 中的逻辑回归
df.head() 的截断输出

Let’s get some information about the dataset: the number of non-null values and the data types of each of the columns:

# Get information about the dataset
print(df.info())

构建预测模型：Python 中的逻辑回归 df.info() 的截断输出

因为我们拥有所有数字特征，所以我们还可以使用以下方法获得一些描述性统计数据： describe() 数据框上的方法：

# Get descriptive statistics of the dataset
print(df.describe())

构建预测模型：Python 中的逻辑回归
df.describe() 的截断输出

列名称当前为 0 到 34 — 包括标签。由于数据集不提供列的描述性名称，因此如果您希望重命名数据框的列，它只会将它们引用为 attribute_1 到 attribute_34，如下所示：

column_names = [
"attribute_1", "attribute_2", "attribute_3", "attribute_4", "attribute_5",
"attribute_6", "attribute_7", "attribute_8", "attribute_9", "attribute_10",
"attribute_11", "attribute_12", "attribute_13", "attribute_14", "attribute_15",
"attribute_16", "attribute_17", "attribute_18", "attribute_19", "attribute_20",
"attribute_21", "attribute_22", "attribute_23", "attribute_24", "attribute_25",
"attribute_26", "attribute_27", "attribute_28", "attribute_29", "attribute_30",
"attribute_31", "attribute_32", "attribute_33", "attribute_34", "class_label"
]
df.columns = column_names

注意：此步骤完全是可选的。如果您愿意，可以继续使用默认列名称。

# Display the first few rows of the DataFrame
df.head()

构建预测模型：Python 中的逻辑回归
df.head() 的输出被截断[重命名列之后]

步骤 3 – 重命名类标签并可视化类分布

因为输出类标签是 'g' 和 'b'，所以我们需要将它们分别映射到 1 和 0 。你可以使用 map() or replace():

# Convert the class labels from 'g' and 'b' to 1 and 0, respectively
df["class_label"] = df["class_label"].replace({'g': 1, 'b': 0})

我们还可以可视化类标签的分布：

import matplotlib.pyplot as plt

# Count the number of data points in each class
class_counts = df['class_label'].value_counts()

# Create a bar plot to visualize the class distribution
plt.bar(class_counts.index, class_counts.values)
plt.xlabel('Class Label')
plt.ylabel('Count')
plt.xticks(class_counts.index)
plt.title('Class Distribution')
plt.show()

构建预测模型：Python 中的逻辑回归
类别标签的分布

我们看到分配不平衡。属于类别 1 的记录多于属于类别 0 的记录。我们将在构建逻辑回归模型时处理这种类别不平衡问题。

步骤 5 – 预处理数据集

让我们像这样收集特征和输出标签：

X = df.drop('class_label', axis=1)  # Input features
y = df['class_label']               # Target variable

将数据集分为训练集和测试集后，我们需要对数据集进行预处理。

当有许多数字特征时（每个特征的尺度可能不同），我们需要对数字特征进行预处理。一种常见的方法是对它们进行变换，使它们遵循均值为零、单位方差为零的分布。

StandardScaler scikit-learn 的预处理模块帮助我们实现了这一点。

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Get the indices of the numerical features
numerical_feature_indices = list(range(34))  # Assuming the numerical features are in columns 0 to 33

# Initialize the StandardScaler
scaler = StandardScaler()

# Normalize the numerical features in the training set
X_train.iloc[:, numerical_feature_indices] = scaler.fit_transform(X_train.iloc[:, numerical_feature_indices])

# Normalize the numerical features in the test set using the trained scaler from the training set
X_test.iloc[:, numerical_feature_indices] = scaler.transform(X_test.iloc[:, numerical_feature_indices])

第 6 步 – 构建逻辑回归模型

现在我们可以实例化一个逻辑回归分类器。这 LogisticRegression 类是 scikit-learn 的 Linear_model 模块的一部分。

请注意，我们已经设置了 class_weight 参数为“平衡”。这将帮助我们解决类别不平衡的问题。通过为每个类别分配权重，与类别中的记录数量成反比。

实例化该类后，我们可以将模型拟合到训练数据集：

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

步骤 7 – 评估逻辑回归模型

您可以拨打 predict() 方法来获得模型的预测。

除了准确率分数之外，我们还可以获得包含精度、召回率和 F1 分数等指标的分类报告。

from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

classification_rep = classification_report(y_test, y_pred)
print("Classification Report:n", classification_rep)

构建预测模型：Python 中的逻辑回归