处理文本数据中的噪声标签

由柏拉图重新发布

关注： 0

处理文本数据中的噪声标签
图片由编辑

随着对自然语言处理的兴趣越来越高，越来越多的从业者碰壁不是因为他们不能构建或微调 LLM，而是因为他们的数据很乱！

我们将展示简单但非常有效的编码程序，用于修复文本数据中的噪声标签。我们将处理现实世界文本数据中的 2 个常见场景：

有一个类别包含来自其他几个类别的混合示例。我喜欢将这种类别称为元类别。
有 2 个或更多类别应合并为 1 个类别，因为属于它们的文本指的是同一主题。

我们将使用为本教程创建的 ITSM（IT 服务管理）数据集（CCO 许可证）。它可以从下面的链接在 Kaggle 上获得：

https://www.kaggle.com/datasets/nikolagreb/small-itsm-dataset

是时候开始导入所有需要的库和检查基本数据了。振作起来，代码来了！

import pandas as pd
import numpy as np
import string from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics df = pd.read_excel("ITSM_data.xlsx")
df.info()


RangeIndex: 118 entries, 0 to 117
Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_request 118 non-null int64 1 Text 117 non-null object 2 Category 115 non-null object 3 Solution 115 non-null object 4 Date_request_recieved 118 non-null datetime64[ns] 5 Date_request_solved 118 non-null datetime64[ns] 6 ID_agent 118 non-null int64 dtypes: datetime64[ns](2), int64(2), object(3)
memory usage: 6.6+ KB

每行代表 ITSM 数据库中的一个条目。我们将尝试根据用户编写的工单文本来预测工单的类别。让我们更深入地研究所描述的业务用例的最重要领域。

for text, category in zip(df.Text.sample(3, random_state=2), df.Category.sample(3, random_state=2)): print("TEXT:") print(text) print("CATEGORY:") print(category) print("-"*100)

TEXT:
I just want to talk to an agent, there are too many problems on my pc to be explained in one ticket. Please call me when you see this, whoever you are. (talk to agent)
CATEGORY:
Asana
----------------------------------------------------------------------------------------------------
TEXT:
Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.
CATEGORY:
Help Needed
----------------------------------------------------------------------------------------------------
TEXT:
My mail stopped to work after I updated Windows.
CATEGORY:
Outlook
----------------------------------------------------------------------------------------------------

如果我们看一下前两张工单，虽然其中一张是德文的，但我们可以看到描述的问题指的是同一个软件？——？Asana，但它们带有不同的标签。这是我们类别的开始分布：

df.Category.value_counts(normalize=True, dropna=False).mul(100).round(1).astype(str) + "%"

Outlook 19.1%
Discord 13.9%
CRM 12.2%
Internet Browser 10.4%
Mail 9.6%
Keyboard 9.6%
Asana 8.7%
Mouse 8.7%
Help Needed 7.8%
Name: Category, dtype: object

所需的帮助看起来很可疑，比如可以包含来自多个其他类别的工单的类别。此外，类别 Outlook 和 Mail 听起来很相似，也许它们应该合并为一个类别。在深入研究上述类别之前，我们将去除我们感兴趣的列中的缺失值。

important_columns = ["Text", "Category"]
for cat in important_columns: df.drop(df[df[cat].isna()].index, inplace=True)
df.reset_index(inplace=True, drop=True)

没有有效的替代方法可以用肉眼检查数据。在 pandas 中执行此操作的奇特函数是 .sample()，因此我们将再次执行此操作，现在针对可疑类别：

meta = df[df.Category == "Help Needed"] for text in meta.Text.sample(5, random_state=2): print(text) print("-"*100)

Discord emojis aren't available to me, I would like to have this option enabled like other team members have.
---------------------------------------------------------------------------
Bitte reparieren Sie mein Hubspot CRM. Seit gestern funktioniert es nicht mehr
---------------------------------------------------------------------------
My headphones aren't working. I would like to order new.
---------------------------------------------------------------------------
Bundled problems with Office since restart:
Messages not sent
Outlook does not connect, mails do not arrive
Error 0x8004deb0 appears when Connection attempt, see attachment
The company account is affected: AB123
Access via Office.com seems to be possible.
---------------------------------------------------------------------------
Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.
---------------------------------------------------------------------------

显然，我们有谈论 Discord、Asana 和 CRM 的门票。因此，类别的名称应从“需要帮助”更改为现有的更具体的类别。对于重新分配过程的第一步，我们将创建新列“关键字”，如果工单包含“文本”列中类别列表中的单词，则提供信息。

words_categories = np.unique([word.strip().lower() for word in df.Category]) # list of categories def keywords(row): list_w = [] for word in row.translate(str.maketrans("", "", string.punctuation)).lower().split(): if word in words_categories: list_w.append(word) return list_w df["Keywords"] = df.Text.apply(keywords) # since our output is in the list, this function will give us better looking final output. def clean_row(row): row = str(row) row = row.replace("[", "") row = row.replace("]", "") row = row.replace("'", "") row = string.capwords(row) return row df["Keywords"] = df.Keywords.apply(clean_row)

另外，请注意，使用“if word in str(words_categories)”而不是“if word in words_categories”会从超过 1 个单词的类别中捕获单词（在我们的例子中是 Internet 浏览器），但也需要更多的数据预处理。为了让事情简单明了，我们将使用仅由一个词组成的类别的代码。这就是我们的数据集现在的样子：

df.head(2)

输出为图像：

XXXXX

提取关键字列后，我们将假设门票的质量。我们的假设：

文本字段中只有 1 个关键字且与工单所属类别相同的工单很容易分类。
在文本字段中包含多个关键字的工单，其中至少一个关键字与工单所属的类别相同，在大多数情况下很容易分类。
有关键字的票证，但没有一个等于票证所属类别的名称，这可能是一个嘈杂的标签案例。
其他门票是基于关键字的中性。

cl_list = [] for category, keywords in zip(df.Category, df.Keywords): if category.lower() == keywords.lower() and keywords != "": cl_list.append("easy_classification") elif category.lower() in keywords.lower(): # to deal with multiple keywords in the ticket cl_list.append("probably_easy_classification") elif category.lower() != keywords.lower() and keywords != "": cl_list.append("potential_problem") else: cl_list.append("neutral") df["Ease_classification"] = cl_list
df.Ease_classification.value_counts(normalize=True, dropna=False).mul(100).round(1).astype(str) + "%"

neutral 45.6%
easy_classification 37.7%
potential_problem 9.6%
probably_easy_classification 7.0%
Name: Ease_classification, dtype: object

我们进行了新的分发，现在是检查归类为潜在问题的票证的时候了。在实践中，接下来的步骤将需要更多的采样并用肉眼查看更大的数据块，但基本原理是相同的。你应该找到有问题的票并决定是否可以提高它们的质量或者是否应该将它们从数据集中删除。当你面对一个大数据集时保持冷静，不要忘记数据检查和数据准备通常比构建 ML 算法花费更多的时间！

pp = df[df.Ease_classification == "potential_problem"] for text, category in zip(pp.Text.sample(5, random_state=2), pp.Category.sample(3, random_state=2)): print("TEXT:") print(text) print("CATEGORY:") print(category) print("-"*100)

TEXT:
outlook issue , I did an update Windows and I have no more outlook on my notebook ? Please help !
 
Outlook
CATEGORY:
Mail
-------------------------------------------------------------------- TEXT:
Please relase blocked attachements from the mail I got from name.surname@company.com. These are data needed for social media marketing campaing.
CATEGORY:
Outlook
--------------------------------------------------------------------
TEXT:
Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.
CATEGORY:
Help Needed
--------------------------------------------------------------------

我们了解到来自 Outlook 和邮件类别的工单与同一个问题相关，因此我们将合并这两个类别并改进我们未来 ML 分类算法的结果。

mail_categories_to_merge = ["Outlook", "Mail"] sum_mail_cluster = 0
for x in mail_categories_to_merge: sum_mail_cluster += len(df[df["Category"] == x]) print("Number of categories to be merged into new cluster: ", len(mail_categories_to_merge))
print("Expected number of tickets in the new cluster: ", sum_mail_cluster) def rename_to_mail_cluster(category): if category in mail_categories_to_merge: category = "Mail_CLUSTER" else: category = category return category df["Category"] = df["Category"].apply(rename_to_mail_cluster) df.Category.value_counts()

Number of categories to be merged into new cluster: 2
Expected number of tickets in the new cluster: 33
Mail_CLUSTER 33
Discord 15
CRM 14
Internet Browser 12
Keyboard 11
Asana 10
Mouse 10
Help Needed 9
Name: Category, dtype: int64

最后但同样重要的是，我们想将元类别“需要帮助”中的一些票重新标记为正确的类别。

df.loc[(df["Category"] == "Help Needed") & ([set(x).intersection(words_categories) for x in df["Text"].str.lower().str.replace("[^ws]", "", regex=True).str.split()]), "Category"] = "Change" def cat_name_change(cat, keywords): if cat == "Change": cat = keywords else: cat = cat return cat df["Category"] = df.apply(lambda x: cat_name_change(x.Category, x.Keywords), axis=1)
df["Category"] = df["Category"].replace({"Crm":"CRM"}) df.Category.value_counts(dropna=False)

Mail_CLUSTER 33
Discord 16
CRM 15
Internet Browser 12
Asana 11
Keyboard 11
Mouse 10
Help Needed 6
Name: Category, dtype: int64

我们进行了数据重新标记和清理，但如果我们不进行至少一项科学实验并测试我们的工作对最终分类的影响，我们就不应该称自己为数据科学家。我们将通过在 sklearn 中实现补语朴素贝叶斯分类器来做到这一点。随意尝试其他更复杂的算法。另外，请注意可以进行进一步的数据清理——例如，我们也可以删除所有留在“需要帮助”类别中的工单。

model = make_pipeline(TfidfVectorizer(), ComplementNB()) # old df
df_o = pd.read_excel("ITSM_data.xlsx") important_categories = ["Text", "Category"]
for cat in important_categories: df_o.drop(df_o[df_o[cat].isna()].index, inplace=True) df_o.name = "dataset just without missing"
df.name = "dataset after deeper cleaning" for dataframe in [df_o, df]: # Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(dataframe.Text, dataframe.Category, test_size=0.2, random_state=1) # Training the model with train data model.fit(X_train, y_train) # Predict the response for test dataset y_pred = model.predict(X_test) print(f"Accuracy of Complement Naive Bayes classifier model on {dataframe.name} is: {round(metrics.accuracy_score(y_test, y_pred),2)}")

Accuracy of Complement Naive Bayes classifier model on dataset just without missing is: 0.48
Accuracy of Complement Naive Bayes classifier model on dataset after deeper cleaning is: 0.65

相当令人印象深刻，对吧？我们使用的数据集很小（故意的，所以你可以很容易地看到每一步发生了什么）所以不同的随机种子可能会产生不同的结果，但在绝大多数情况下，模型在清洗后的数据集上的表现会明显更好到原始数据集。我们做得很好！

尼古拉格雷布 从事编码工作四年多，在过去的两年里，他专注于 NLP。在转向数据科学之前，他在销售、人力资源、写作和国际象棋方面都很成功。