افلاطون کے ذریعہ دوبارہ شائع کیا گیا۔

فالونگ: 0

شروع سے BERT ماڈل کی تربیت کیسے کریں۔

= پچھلا پیغام

ٹیگز: برٹ, گلے لگانے والا چہرہ, ینیلپی, ازگر, ٹریننگ

BERT کے اطالوی کزن، FiliBERto سے ملیں۔

تبصروں

By جیمز بریگز، ڈیٹا سائنسدان

BERT, but in Italy — image by author

Many of my articles have been focused on BERT — the model that came and dominated the world of natural language processing (NLP) and marked a new age for language models.

For those of you that may not have used transformers models (eg what BERT is) before, the process looks a little like this:

pip install transformers
Initialize a pre-trained transformers model — from_pretrained.
Test it on some data.
شاید fine-tune the model (train it some more).

Now, this is a great approach, but if we only ever do this, we lack the understanding behind creating our own transformers models.

And, if we cannot create our own transformer models — we must rely on there being a pre-trained model that fits our problem, this is not always the case:

A few comments asking about non-English BERT models

So in this article, we will explore the steps we must take to build our own transformer model — specifically a further developed version of BERT, called RoBERTa.

ایک جائزہ

There are a few steps to the process, so before we dive in let’s first summarize what we need to do. In total, there are four key parts:

ڈیٹا حاصل کرنا
Building a tokenizer
Creating an input pipeline
ماڈل کی تربیت

Once we have worked through each of these sections, we will take the tokenizer and model we have built — and save them both so that we can then use them in the same way we usually would with from_pretrained.

Getting The Data

As with any machine learning project, we need data. In terms of data for training a transformer model, we really are spoilt for choice — we can use almost any text data.

Video walkthrough for downloading OSCAR dataset using HuggingFace’s datasets library

And, if there’s one thing that we have plenty of on the internet — it’s unstructured text data.

One of the largest datasets in the domain of text scraped from the internet is the OSCAR dataset.

The OSCAR dataset boasts a huge number of different languages — and one of the clearest use-cases for training from scratch is so that we can apply BERT to some less commonly used languages, such as Telugu or Navajo.

Unfortunately, the only language I can speak with any degree of competency is English — but my girlfriend is Italian, and so she — Laura, will be assessing the results of our Italian-speaking BERT model — FiliBERTo.

So, to download the Italian segment of the OSCAR dataset we will be using HuggingFace’s datasets library — which we can install with pip install datasets. Then we download OSCAR_IT with:

آئیے ایک نظر ڈالیں dataset اعتراض

Great, now let’s store our data in a format that we can use when building our tokenizer. We need to create a set of plaintext files containing just the text feature from our dataset, and we will split each نمونہ using a newline n.

Over in our data/text/oscar_it directory we will find:

A screenshot displaying a Windows explorer window full of .txt files — representing the plaintext OSCAR data

The directory containing our plaintext OSCAR files

Building a Tokenizer

Next up is the tokenizer! When using transformers we typically load a tokenizer, alongside its respective transformer model — the tokenizer is a key component in the process.

Video walkthrough for building our custom tokenizer

When building our tokenizer we will feed it all of our OSCAR data, specify our vocabulary size (number of tokens in the tokenizer), and any special tokens.

Now, the RoBERTa special tokens look like this:

So, we make sure to include them within the special_tokens parameter of our tokenizer’s train method call.

Our tokenizer is now ready, and we can save it file for later use:

Now we have two files that define our new FiliBERTo tokenizer:

merges.txt — performs the initial mapping of text to tokens
vocab.json — maps the tokens to token IDs

And with those, we can move on to initializing our tokenizer so that we can use it as we would use any other from_pretrained ٹوکنائزر

Initializing the Tokenizer

We first initialize the tokenizer using the two files we built before — using a simple from_pretrained:

Now our tokenizer is ready, we can try encoding some text with it. When encoding we use the same two methods we would typically use, encode اور encode_batch.

From the encodings object tokens we will be extracting the input_ids اور attention_mask tensors for use with FiliBERTo.

Creating the Input Pipeline

The input pipeline of our training process is the more complex part of the entire process. It consists of us taking our raw OSCAR training data, transforming it, and loading it into a DataLoader ready for training.

Video walkthrough of the MLM input pipeline

ڈیٹا کی تیاری

We’ll start with a single sample and work through the preparation logic.

First, we need to open our file — the same files that we saved as .TXT files earlier. We split each based on newline characters n as this indicates the individual samples.

Then we encode our data using the tokenizer — making sure to include key parameters like max_length, padding، اور truncation.

And now we can move onto creating our tensors — we will be training our model through masked-language modeling (MLM). So, we need three tensors:

input_ids — our token_ids with ~15% of tokens masked using the mask token <mask>.
توجہ_ماسک — a tensor of 1اور 0s, marking the position of ‘real’ tokens/padding tokens — used in attention calculations.
لیبل — our token_ids ساتھ نہیں ماسکنگ

If you’re not familiar with MLM, I’ve explained it یہاں.

ہماری attention_mask اور labels tensors are simply extracted from our batch. input_ids tensors require more attention however, for this tensor we mask ~15% of the tokens — assigning them the token ID 3.

In the final output, we can see part of an encoded input_ids tensor. The very first token ID is 1 - [CLS] token. Dotted around the tensor we have several 3 token IDs — these are our newly added [MASK] ٹوکن.

Building the DataLoader

Next, we define our Dataset class — which we use to initialize our three encoded tensors as PyTorch torch.utils.data.Dataset اشیاء.

آخر میں، ہمارے dataset is loaded into a PyTorch DataLoader object — which we use to load our data into our model during training.

ماڈل کی تربیت

We need two things for training, our DataLoader and a model. The DataLoader we have — but no model.

Initializing the Model

For training, we need a raw (not pre-trained) BERTLMHeadModel. To create that, we first need to create a RoBERTa config object to describe the parameters we’d like to initialize FiliBERTo with.

Then, we import and initialize our RoBERTa model with a language modeling (LM) head.

تربیت کی تیاری

Before moving onto our training loop we need to set up a few things. First, we set up GPU/CPU usage. Then we activate the training mode of our model — and finally, initialize our optimizer.

ٹریننگ

Finally — training time! We train just as we usually would when training via PyTorch.

If we head on over to Tensorboard we’ll find our loss over time — it looks promising.

Loss / time — multiple training sessions have been threaded together in this chart

اصلی ٹیسٹ

Now it’s time for the real test. We set up an MLM pipeline — and ask Laura to assess the results. You can watch the video review at 22:44 here:

We first initialize a pipeline object, using the 'fill-mask' argument. Then begin testing our model like so:

“ciao کس طرح va?” is the right answer! That’s as advanced as my Italian gets — so, let’s hand it over to Laura.

ہم کے ساتھ شروع “buongiorno, come va?” - یا “good day, how are you?”:

The first answer, “buongiorno, chi va?” means “good day, who is there?” — eg nonsensical. But, our second answer is correct!

Next up, a slightly harder phrase, “ciao, dove ci incontriamo oggi pomeriggio?” - یا “hi, where are we going to meet this afternoon?”:

And we return some more positive results:

✅ "hi, where do we see each other this afternoon?"
✅ "hi, where do we meet this afternoon?"
❌ "hi, where here we are this afternoon?"
✅ "hi, where are we meeting this afternoon?"
✅ "hi, where do we meet this afternoon?"

Finally, one more, harder sentence, “cosa sarebbe successo se avessimo scelto un altro giorno?” — or “what would have happened if we had chosen another day?”:

We return a few good more good answers here too:

✅ "what would have happened if we had chosen another day?"
✅ "what would have happened if I had chosen another day?"
✅ "what would have happened if they had chosen another day?"
✅ "what would have happened if you had chosen another day?"
❌ "what would have happened if another day was chosen?"

Overall, it looks like our model passed Laura’s tests — and we now have a competent Italian language model called FiliBERTo!

That’s it for this walkthrough of training a BERT model from scratch!

We’ve covered a lot of ground, from getting and formatting our data — all the way through to using language modeling to train our raw BERT model.

I hope you enjoyed this article! If you have any questions, let me know via ٹویٹر or in the comments below. If you’d like more content like this, I post on یو ٹیوب پر بھی.

پڑھنے کے لئے شکریہ!

70% Off! Natural Language Processing: NLP With Transformers in Python

Transformer models are the de-facto standard in modern NLP. They have proven themselves as the most expressive…

*All images are by the author except where stated otherwise

بیو: جیمز بریگز ایک ڈیٹا سائنسدان ہے جو قدرتی زبان کی پروسیسنگ میں مہارت رکھتا ہے اور لندن، برطانیہ میں مقیم فنانس سیکٹر میں کام کرتا ہے۔ وہ ایک فری لانس سرپرست، مصنف، اور مواد تخلیق کار بھی ہیں۔ آپ ای میل کے ذریعے مصنف تک پہنچ سکتے ہیں (jamescalam94@gmail.com).

حقیقی. اجازت کے ساتھ دوبارہ پوسٹ کیا۔

متعلقہ:

= پچھلا پیغام

گزشتہ 30 دنوں کی اہم خبریں۔

سب سے زیادہ مقبول
6 میں ٹاپ 2021 ڈیٹا سائنس آن لائن کورسز ڈیٹا سائنسدان اور ایم ایل انجینئر لگژری ملازم ہیں۔ گوگل کے ڈائریکٹر ریسرچ سے ڈیٹا سائنس سیکھنے کے لیے مشورہ GitHub Copilot اوپن سورس متبادل گہری سیکھنے کی ہندسی بنیادیں۔

زیادہ سے زیادہ مشترکہ

ماخذ: https://www.kdnuggets.com/2021/08/train-bert-model-scratch.html

ٹائم اسٹیمپ: اگست 13، 2021

سے زیادہ KDnuggets

اپنی ڈیٹا سائنس کی مہارتوں کو بہتر بنانے کے لیے ChatGPT کا استعمال کیسے کریں۔

KDnuggets

ماخذ نوڈ: 1959123

ٹائم اسٹیمپ: فروری 15، 2023

جنریٹو AI بلبلہ جلد ہی پھٹ جائے گا – KDnuggets

ماخذ کلسٹر:

KDnuggets

ماخذ نوڈ: 2939045

ٹائم اسٹیمپ: اکتوبر 16، 2023

شروع سے BERT ماڈل کی تربیت کیسے کریں۔

افلاطون کے ذریعہ دوبارہ شائع کیا گیا۔

شروع سے BERT ماڈل کی تربیت کیسے کریں۔

ایک جائزہ

Getting The Data

Building a Tokenizer

Initializing the Tokenizer

Creating the Input Pipeline

ڈیٹا کی تیاری

Building the DataLoader

ماڈل کی تربیت

Initializing the Model

تربیت کی تیاری

ٹریننگ

اصلی ٹیسٹ

70% Off! Natural Language Processing: NLP With Transformers in Python

سے زیادہ KDnuggets

مشین لرننگ ماڈل خاموشی سے کیوں مر جاتے ہیں؟

ReactPy – KDnuggets کے ساتھ شروعات کرنا

AI میں اس ہفتہ، 18 اگست: مالی پریشانی میں OpenAI • استحکام AI نے StableCode کا اعلان کیا – KDnuggets

ڈیٹا سائنس میں مفروضے کی جانچ

ہمارے متعلق

عمودی تلاش اور Ai

پلیٹ فارم

مربوط رہو

اکاؤنٹ