テキストデータ内のノイズの多いラベルの処理

プラトン再発行

フォロワー： 0

テキストデータ内のノイズの多いラベルの処理
編集者による画像

自然言語処理への関心が高まる中、LLM を構築または微調整できないためではなく、データが乱雑であるために、ますます多くの実践者が壁にぶつかっています。

テキストデータ内のノイズの多いラベルを修正するための、シンプルでありながら非常に効果的なコーディング手順を示します。現実世界のテキストデータにおける 2 つの一般的なシナリオを扱います。

いくつかの他のカテゴリからの例が混在するカテゴリを持つ。私は、この種のカテゴリをメタカテゴリと呼ぶのが好きです。
それらに属するテキストが同じトピックを参照しているため、2 つのカテゴリにマージする必要がある 1 つ以上のカテゴリを持つ。

このチュートリアル用に作成された ITSM (IT Service Management) データセット (CCO ライセンス) を使用します。以下のリンクから Kaggle で入手できます。

https://www.kaggle.com/datasets/nikolagreb/small-itsm-dataset

必要なすべてのライブラリのインポートと基本的なデータの調査から始めましょう。気を引き締めて、コードがやってくる！

import pandas as pd
import numpy as np
import string from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics df = pd.read_excel("ITSM_data.xlsx")
df.info()


RangeIndex: 118 entries, 0 to 117
Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_request 118 non-null int64 1 Text 117 non-null object 2 Category 115 non-null object 3 Solution 115 non-null object 4 Date_request_recieved 118 non-null datetime64[ns] 5 Date_request_solved 118 non-null datetime64[ns] 6 ID_agent 118 non-null int64 dtypes: datetime64[ns](2), int64(2), object(3)
memory usage: 6.6+ KB

各行は、ITSM データベース内の XNUMX つのエントリーを表します。ユーザーが書いたチケットのテキストに基づいて、チケットのカテゴリを予測しようとします。説明したビジネスユースケースの最も重要な分野をさらに詳しく調べてみましょう。

for text, category in zip(df.Text.sample(3, random_state=2), df.Category.sample(3, random_state=2)): print("TEXT:") print(text) print("CATEGORY:") print(category) print("-"*100)

TEXT:
I just want to talk to an agent, there are too many problems on my pc to be explained in one ticket. Please call me when you see this, whoever you are. (talk to agent)
CATEGORY:
Asana
----------------------------------------------------------------------------------------------------
TEXT:
Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.
CATEGORY:
Help Needed
----------------------------------------------------------------------------------------------------
TEXT:
My mail stopped to work after I updated Windows.
CATEGORY:
Outlook
----------------------------------------------------------------------------------------------------

最初の XNUMX つのチケットを見ると、XNUMX つのチケットはドイツ語ですが、説明されている問題は同じソフトウェア (Asana) に関するものですが、ラベルが異なることがわかります。これにより、カテゴリの配布が開始されます。

df.Category.value_counts(normalize=True, dropna=False).mul(100).round(1).astype(str) + "%"

Outlook 19.1%
Discord 13.9%
CRM 12.2%
Internet Browser 10.4%
Mail 9.6%
Keyboard 9.6%
Asana 8.7%
Mouse 8.7%
Help Needed 7.8%
Name: Category, dtype: object

他の複数のカテゴリからのチケットを含むカテゴリのように、必要なヘルプは疑わしいように見えます。また、Outlook とメールのカテゴリは似ているように聞こえますが、XNUMX つのカテゴリに統合する必要があるかもしれません。前述のカテゴリをさらに深く掘り下げる前に、関心のある列の欠損値を取り除きます。

important_columns = ["Text", "Category"]
for cat in important_columns: df.drop(df[df[cat].isna()].index, inplace=True)
df.reset_index(inplace=True, drop=True)

肉眼によるデータの検査に代わる有効な方法はありません。 pandas でこれを行うためのファンシーな関数は .sample() であるため、疑わしいカテゴリに対して、もう一度正確に実行します。

meta = df[df.Category == "Help Needed"] for text in meta.Text.sample(5, random_state=2): print(text) print("-"*100)

Discord emojis aren't available to me, I would like to have this option enabled like other team members have.
---------------------------------------------------------------------------
Bitte reparieren Sie mein Hubspot CRM. Seit gestern funktioniert es nicht mehr
---------------------------------------------------------------------------
My headphones aren't working. I would like to order new.
---------------------------------------------------------------------------
Bundled problems with Office since restart:
Messages not sent
Outlook does not connect, mails do not arrive
Error 0x8004deb0 appears when Connection attempt, see attachment
The company account is affected: AB123
Access via Office.com seems to be possible.
---------------------------------------------------------------------------
Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.
---------------------------------------------------------------------------

明らかに、Discord、Asana、CRM に関するチケットがあります。そのため、カテゴリの名前を「Help Needed」から既存のより具体的なカテゴリに変更する必要があります。再割り当てプロセスの最初のステップとして、「テキスト」列のカテゴリのリストからの単語がチケットに含まれている場合に情報を提供する新しい列「キーワード」を作成します。

words_categories = np.unique([word.strip().lower() for word in df.Category]) # list of categories def keywords(row): list_w = [] for word in row.translate(str.maketrans("", "", string.punctuation)).lower().split(): if word in words_categories: list_w.append(word) return list_w df["Keywords"] = df.Text.apply(keywords) # since our output is in the list, this function will give us better looking final output. def clean_row(row): row = str(row) row = row.replace("[", "") row = row.replace("]", "") row = row.replace("'", "") row = string.capwords(row) return row df["Keywords"] = df.Keywords.apply(clean_row)

また、「if word in words_categories」の代わりに「if word in str(words_categories)」を使用すると、複数の単語を含むカテゴリ (この場合はインターネットブラウザー) から単語が検出されますが、より多くのデータ前処理が必要になることにも注意してください。物事を単純かつ簡潔にするために、1 つの単語だけで構成されるカテゴリのコードを使用します。データセットは次のようになります。

df.head(2)

画像として出力:

XXXXX

キーワード列を抽出した後、チケットの品質を想定します。私たちの仮説:

チケットが属するカテゴリと同じキーワードがテキストフィールドに 1 つだけ含まれているチケットは、簡単に分類できます。
テキストフィールドに複数のキーワードが含まれるチケットで、少なくとも XNUMX つのキーワードが、チケットが属するカテゴリと同じである場合、ほとんどの場合、簡単に分類できます。
キーワードはあるが、チケットが属するカテゴリの名前と一致しないチケットは、おそらく騒々しいラベルのケースです。
他のチケットは、キーワードに基づいて中立です。

cl_list = [] for category, keywords in zip(df.Category, df.Keywords): if category.lower() == keywords.lower() and keywords != "": cl_list.append("easy_classification") elif category.lower() in keywords.lower(): # to deal with multiple keywords in the ticket cl_list.append("probably_easy_classification") elif category.lower() != keywords.lower() and keywords != "": cl_list.append("potential_problem") else: cl_list.append("neutral") df["Ease_classification"] = cl_list
df.Ease_classification.value_counts(normalize=True, dropna=False).mul(100).round(1).astype(str) + "%"

neutral 45.6%
easy_classification 37.7%
potential_problem 9.6%
probably_easy_classification 7.0%
Name: Ease_classification, dtype: object

新しいディストリビューションを作成したので、潜在的な問題として分類されたチケットを調べる時が来ました。実際には、次の手順ではより多くのサンプリングが必要になり、より大きなデータのチャンクを肉眼で見ることができますが、理論的根拠は同じです。問題のあるチケットを見つけて、その品質を改善できるかどうか、またはデータセットから削除する必要があるかどうかを判断する必要があります。大規模なデータセットに直面しているときは、冷静に対処してください。通常、データの調査とデータの準備には、ML アルゴリズムの構築よりもはるかに時間がかかることを忘れないでください。

pp = df[df.Ease_classification == "potential_problem"] for text, category in zip(pp.Text.sample(5, random_state=2), pp.Category.sample(3, random_state=2)): print("TEXT:") print(text) print("CATEGORY:") print(category) print("-"*100)

TEXT:
outlook issue , I did an update Windows and I have no more outlook on my notebook ? Please help !
 
Outlook
CATEGORY:
Mail
-------------------------------------------------------------------- TEXT:
Please relase blocked attachements from the mail I got from name.surname@company.com. These are data needed for social media marketing campaing.
CATEGORY:
Outlook
--------------------------------------------------------------------
TEXT:
Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.
CATEGORY:
Help Needed
--------------------------------------------------------------------

Outlook とメールのカテゴリのチケットは同じ問題に関連していることを理解しているため、これら 2 つのカテゴリを統合して、将来の ML 分類アルゴリズムの結果を改善します。

mail_categories_to_merge = ["Outlook", "Mail"] sum_mail_cluster = 0
for x in mail_categories_to_merge: sum_mail_cluster += len(df[df["Category"] == x]) print("Number of categories to be merged into new cluster: ", len(mail_categories_to_merge))
print("Expected number of tickets in the new cluster: ", sum_mail_cluster) def rename_to_mail_cluster(category): if category in mail_categories_to_merge: category = "Mail_CLUSTER" else: category = category return category df["Category"] = df["Category"].apply(rename_to_mail_cluster) df.Category.value_counts()

Number of categories to be merged into new cluster: 2
Expected number of tickets in the new cluster: 33
Mail_CLUSTER 33
Discord 15
CRM 14
Internet Browser 12
Keyboard 11
Asana 10
Mouse 10
Help Needed 9
Name: Category, dtype: int64

最後に、いくつかのチケットのラベルをメタカテゴリ「Help Needed」から適切なカテゴリに変更します。

df.loc[(df["Category"] == "Help Needed") & ([set(x).intersection(words_categories) for x in df["Text"].str.lower().str.replace("[^ws]", "", regex=True).str.split()]), "Category"] = "Change" def cat_name_change(cat, keywords): if cat == "Change": cat = keywords else: cat = cat return cat df["Category"] = df.apply(lambda x: cat_name_change(x.Category, x.Keywords), axis=1)
df["Category"] = df["Category"].replace({"Crm":"CRM"}) df.Category.value_counts(dropna=False)

Mail_CLUSTER 33
Discord 16
CRM 15
Internet Browser 12
Asana 11
Keyboard 11
Mouse 10
Help Needed 6
Name: Category, dtype: int64

私たちはデータの再ラベル付けとクリーニングを行いましたが、少なくとも XNUMX つの科学実験を行い、最終的な分類に対する私たちの作業の影響をテストしないのであれば、自分たちをデータサイエンティストと呼ぶべきではありません。これは、sklearn に The Complement Naive Bayes 分類子を実装することで実現します。他のより複雑なアルゴリズムを自由に試してみてください。また、さらなるデータクリーニングが行われる可能性があることに注意してください。たとえば、「ヘルプが必要」カテゴリに残っているすべてのチケットを削除することもできます。

model = make_pipeline(TfidfVectorizer(), ComplementNB()) # old df
df_o = pd.read_excel("ITSM_data.xlsx") important_categories = ["Text", "Category"]
for cat in important_categories: df_o.drop(df_o[df_o[cat].isna()].index, inplace=True) df_o.name = "dataset just without missing"
df.name = "dataset after deeper cleaning" for dataframe in [df_o, df]: # Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(dataframe.Text, dataframe.Category, test_size=0.2, random_state=1) # Training the model with train data model.fit(X_train, y_train) # Predict the response for test dataset y_pred = model.predict(X_test) print(f"Accuracy of Complement Naive Bayes classifier model on {dataframe.name} is: {round(metrics.accuracy_score(y_test, y_pred),2)}")

Accuracy of Complement Naive Bayes classifier model on dataset just without missing is: 0.48
Accuracy of Complement Naive Bayes classifier model on dataset after deeper cleaning is: 0.65

かなり印象的ですよね？使用したデータセットは小さい (故意に、各ステップで何が起こるかを簡単に確認できるようにするため) ため、異なるランダムシードが異なる結果を生成する可能性がありますが、ほとんどの場合、モデルはクリーニング後のデータセットで比較して大幅に優れたパフォーマンスを発揮します。元のデータセットに。よくやった！

ニコラ・グレブ XNUMX 年以上コーディングを行っており、過去 XNUMX 年間は NLP を専門としています。データサイエンスに転向する前は、営業、人事、執筆、チェスで成功を収めていました。