텍스트 데이터의 노이즈 레이블 처리

플라톤에 의해 재발행

팔로워 : 0

텍스트 데이터의 노이즈 레이블 처리
편집자별 이미지

자연어 처리에 대한 관심이 높아짐에 따라 LLM을 구축하거나 미세 조정할 수 없기 때문이 아니라 데이터가 지저분하기 때문에 점점 더 많은 실무자가 벽에 부딪히고 있습니다!

텍스트 데이터에서 잡음이 많은 레이블을 수정하는 간단하지만 매우 효과적인 코딩 절차를 보여줍니다. 실제 텍스트 데이터에서 두 가지 일반적인 시나리오를 다룰 것입니다.

몇 가지 다른 범주의 혼합 예를 포함하는 범주가 있습니다. 저는 이런 종류의 카테고리를 메타 카테고리라고 부르는 것을 좋아합니다.
2개 이상의 카테고리에 속하는 텍스트가 동일한 주제를 참조하기 때문에 하나의 카테고리로 병합되어야 합니다.

이 자습서(CCO 라이선스)를 위해 생성된 ITSM(IT 서비스 관리) 데이터 세트를 사용합니다. 아래 링크를 통해 Kaggle에서 사용할 수 있습니다.

https://www.kaggle.com/datasets/nikolagreb/small-itsm-dataset

필요한 모든 라이브러리 가져오기 및 기본 데이터 검사부터 시작할 때입니다. 마음 단단히 먹으세요. 코드가 오고 있습니다!

import pandas as pd
import numpy as np
import string from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics df = pd.read_excel("ITSM_data.xlsx")
df.info()


RangeIndex: 118 entries, 0 to 117
Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_request 118 non-null int64 1 Text 117 non-null object 2 Category 115 non-null object 3 Solution 115 non-null object 4 Date_request_recieved 118 non-null datetime64[ns] 5 Date_request_solved 118 non-null datetime64[ns] 6 ID_agent 118 non-null int64 dtypes: datetime64[ns](2), int64(2), object(3)
memory usage: 6.6+ KB

각 행은 ITSM 데이터베이스의 한 항목을 나타냅니다. 사용자가 작성한 티켓의 텍스트를 기반으로 티켓의 범주를 예측하려고 합니다. 설명된 비즈니스 사용 사례에 대한 가장 중요한 필드를 더 자세히 살펴보겠습니다.

for text, category in zip(df.Text.sample(3, random_state=2), df.Category.sample(3, random_state=2)): print("TEXT:") print(text) print("CATEGORY:") print(category) print("-"*100)

TEXT:
I just want to talk to an agent, there are too many problems on my pc to be explained in one ticket. Please call me when you see this, whoever you are. (talk to agent)
CATEGORY:
Asana
----------------------------------------------------------------------------------------------------
TEXT:
Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.
CATEGORY:
Help Needed
----------------------------------------------------------------------------------------------------
TEXT:
My mail stopped to work after I updated Windows.
CATEGORY:
Outlook
----------------------------------------------------------------------------------------------------

처음 두 티켓을 살펴보면 한 티켓이 독일어로 되어 있지만 설명된 문제가 동일한 소프트웨어(?—?Asana)와 관련이 있지만 레이블이 다르다는 것을 알 수 있습니다. 이것은 우리 카테고리의 배포를 시작합니다.

df.Category.value_counts(normalize=True, dropna=False).mul(100).round(1).astype(str) + "%"

Outlook 19.1%
Discord 13.9%
CRM 12.2%
Internet Browser 10.4%
Mail 9.6%
Keyboard 9.6%
Asana 8.7%
Mouse 8.7%
Help Needed 7.8%
Name: Category, dtype: object

다른 여러 범주의 티켓을 포함할 수 있는 범주와 같이 필요한 도움이 의심스러워 보입니다. 또한 Outlook과 Mail 범주는 비슷하게 들리므로 하나의 범주로 병합해야 할 수도 있습니다. 언급된 범주에 대해 자세히 알아보기 전에 관심 있는 열에서 누락된 값을 제거합니다.

important_columns = ["Text", "Category"]
for cat in important_columns: df.drop(df[df[cat].isna()].index, inplace=True)
df.reset_index(inplace=True, drop=True)

육안으로 데이터를 검사하는 것을 대체할 수 있는 유효한 방법은 없습니다. pandas에서 그렇게 하는 멋진 함수는 .sample()이므로 의심스러운 범주에 대해 정확히 한 번 더 수행할 것입니다.

meta = df[df.Category == "Help Needed"] for text in meta.Text.sample(5, random_state=2): print(text) print("-"*100)

Discord emojis aren't available to me, I would like to have this option enabled like other team members have.
---------------------------------------------------------------------------
Bitte reparieren Sie mein Hubspot CRM. Seit gestern funktioniert es nicht mehr
---------------------------------------------------------------------------
My headphones aren't working. I would like to order new.
---------------------------------------------------------------------------
Bundled problems with Office since restart:
Messages not sent
Outlook does not connect, mails do not arrive
Error 0x8004deb0 appears when Connection attempt, see attachment
The company account is affected: AB123
Access via Office.com seems to be possible.
---------------------------------------------------------------------------
Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.
---------------------------------------------------------------------------

분명히 Discord, Asana 및 CRM에 대한 티켓이 있습니다. 따라서 카테고리 이름을 "Help Needed"에서 기존의 보다 구체적인 카테고리로 변경해야 합니다. 재할당 프로세스의 첫 번째 단계에서는 티켓에 "텍스트" 열의 범주 목록에 있는 단어가 있는지 정보를 제공하는 새 열 "키워드"를 만듭니다.

words_categories = np.unique([word.strip().lower() for word in df.Category]) # list of categories def keywords(row): list_w = [] for word in row.translate(str.maketrans("", "", string.punctuation)).lower().split(): if word in words_categories: list_w.append(word) return list_w df["Keywords"] = df.Text.apply(keywords) # since our output is in the list, this function will give us better looking final output. def clean_row(row): row = str(row) row = row.replace("[", "") row = row.replace("]", "") row = row.replace("'", "") row = string.capwords(row) return row df["Keywords"] = df.Keywords.apply(clean_row)

또한 "if word in words_categories" 대신 "if word in str(words_categories)"를 사용하면 단어가 1개 이상인 범주(이 경우 인터넷 브라우저)에서 단어를 포착하지만 더 많은 데이터 전처리가 필요합니다. 일을 단순하고 간단하게 유지하기 위해 한 단어로 구성된 범주 코드를 사용합니다. 이것이 우리 데이터 세트의 현재 모습입니다.

df.head(2)

이미지로 출력:

XXXXX

키워드 열을 추출한 후 티켓의 품질을 가정합니다. 우리의 가설:

티켓이 속한 카테고리와 동일한 Text 필드에 키워드가 1개만 있는 티켓은 쉽게 분류할 수 있습니다.
텍스트 필드에 여러 키워드가 있는 티켓은 키워드 중 하나 이상이 티켓이 속한 카테고리와 동일하므로 대부분의 경우 쉽게 분류할 수 있습니다.
키워드가 있지만 그 중 어느 것도 티켓이 속한 범주의 이름과 같지 않은 티켓은 아마도 잡음이 많은 레이블 케이스일 것입니다.
다른 티켓은 키워드에 따라 중립적입니다.

cl_list = [] for category, keywords in zip(df.Category, df.Keywords): if category.lower() == keywords.lower() and keywords != "": cl_list.append("easy_classification") elif category.lower() in keywords.lower(): # to deal with multiple keywords in the ticket cl_list.append("probably_easy_classification") elif category.lower() != keywords.lower() and keywords != "": cl_list.append("potential_problem") else: cl_list.append("neutral") df["Ease_classification"] = cl_list
df.Ease_classification.value_counts(normalize=True, dropna=False).mul(100).round(1).astype(str) + "%"

neutral 45.6%
easy_classification 37.7%
potential_problem 9.6%
probably_easy_classification 7.0%
Name: Ease_classification, dtype: object

우리는 새로운 배포판을 만들었고 이제 잠재적인 문제로 분류된 티켓을 검사할 때입니다. 실제로 다음 단계에서는 훨씬 더 많은 샘플링이 필요하고 육안으로 더 큰 데이터 덩어리를 살펴보지만 그 근거는 동일합니다. 문제가 있는 티켓을 찾아 품질을 개선할 수 있는지 또는 데이터 세트에서 삭제해야 하는지 결정해야 합니다. 대규모 데이터 세트에 직면한 경우 침착함을 유지하고 데이터 검사 및 데이터 준비가 일반적으로 ML 알고리즘을 구축하는 것보다 훨씬 더 많은 시간이 걸린다는 것을 잊지 마십시오!

pp = df[df.Ease_classification == "potential_problem"] for text, category in zip(pp.Text.sample(5, random_state=2), pp.Category.sample(3, random_state=2)): print("TEXT:") print(text) print("CATEGORY:") print(category) print("-"*100)

TEXT:
outlook issue , I did an update Windows and I have no more outlook on my notebook ? Please help !
 
Outlook
CATEGORY:
Mail
-------------------------------------------------------------------- TEXT:
Please relase blocked attachements from the mail I got from name.surname@company.com. These are data needed for social media marketing campaing.
CATEGORY:
Outlook
--------------------------------------------------------------------
TEXT:
Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.
CATEGORY:
Help Needed
--------------------------------------------------------------------

Outlook 및 메일 범주의 티켓이 동일한 문제와 관련되어 있음을 이해하므로 이 두 범주를 병합하고 향후 ML 분류 알고리즘의 결과를 개선할 것입니다.

mail_categories_to_merge = ["Outlook", "Mail"] sum_mail_cluster = 0
for x in mail_categories_to_merge: sum_mail_cluster += len(df[df["Category"] == x]) print("Number of categories to be merged into new cluster: ", len(mail_categories_to_merge))
print("Expected number of tickets in the new cluster: ", sum_mail_cluster) def rename_to_mail_cluster(category): if category in mail_categories_to_merge: category = "Mail_CLUSTER" else: category = category return category df["Category"] = df["Category"].apply(rename_to_mail_cluster) df.Category.value_counts()

Number of categories to be merged into new cluster: 2
Expected number of tickets in the new cluster: 33
Mail_CLUSTER 33
Discord 15
CRM 14
Internet Browser 12
Keyboard 11
Asana 10
Mouse 10
Help Needed 9
Name: Category, dtype: int64

마지막으로 메타 범주 "도움이 필요함"에서 적절한 범주로 일부 티켓의 레이블을 다시 지정하려고 합니다.

df.loc[(df["Category"] == "Help Needed") & ([set(x).intersection(words_categories) for x in df["Text"].str.lower().str.replace("[^ws]", "", regex=True).str.split()]), "Category"] = "Change" def cat_name_change(cat, keywords): if cat == "Change": cat = keywords else: cat = cat return cat df["Category"] = df.apply(lambda x: cat_name_change(x.Category, x.Keywords), axis=1)
df["Category"] = df["Category"].replace({"Crm":"CRM"}) df.Category.value_counts(dropna=False)

Mail_CLUSTER 33
Discord 16
CRM 15
Internet Browser 12
Asana 11
Keyboard 11
Mouse 10
Help Needed 6
Name: Category, dtype: int64

우리는 데이터 레이블 재지정 및 정리 작업을 수행했지만 적어도 하나의 과학적 실험을 수행하지 않고 작업이 최종 분류에 미치는 영향을 테스트하지 않는다면 스스로를 데이터 과학자라고 부를 수 없습니다. 우리는 sklearn에서 Complement Naive Bayes 분류기를 구현하여 그렇게 할 것입니다. 더 복잡한 다른 알고리즘을 사용해 보십시오. 또한 추가 데이터 정리가 수행될 수 있다는 점에 유의하십시오. 예를 들어 "도움이 필요함" 범주에 남아 있는 모든 티켓을 삭제할 수도 있습니다.

model = make_pipeline(TfidfVectorizer(), ComplementNB()) # old df
df_o = pd.read_excel("ITSM_data.xlsx") important_categories = ["Text", "Category"]
for cat in important_categories: df_o.drop(df_o[df_o[cat].isna()].index, inplace=True) df_o.name = "dataset just without missing"
df.name = "dataset after deeper cleaning" for dataframe in [df_o, df]: # Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(dataframe.Text, dataframe.Category, test_size=0.2, random_state=1) # Training the model with train data model.fit(X_train, y_train) # Predict the response for test dataset y_pred = model.predict(X_test) print(f"Accuracy of Complement Naive Bayes classifier model on {dataframe.name} is: {round(metrics.accuracy_score(y_test, y_pred),2)}")

Accuracy of Complement Naive Bayes classifier model on dataset just without missing is: 0.48
Accuracy of Complement Naive Bayes classifier model on dataset after deeper cleaning is: 0.65

꽤 인상적이지, 그렇지? 우리가 사용한 데이터 세트는 작기 때문에(의도적으로 각 단계에서 어떤 일이 발생하는지 쉽게 확인할 수 있음) 다른 무작위 시드가 다른 결과를 생성할 수 있지만 대부분의 경우 모델은 청소 후 데이터 세트에서 비교하여 훨씬 더 잘 수행됩니다. 원본 데이터 세트로. 우리는 잘했어!

니콜라 그렙 XNUMX년 이상 코딩을 했고 지난 XNUMX년 동안 NLP를 전문으로 했습니다. 데이터 과학으로 전환하기 전에는 영업, HR, 글쓰기 및 체스 분야에서 성공했습니다.