Măsuri de similaritate în NLP

Nodul sursă: 1852346

By James Briggs, Data Scientist



Imagine de autor

 

When we convert language into a machine-readable format, the standard approach is to use dense vectors.

A neural network typically generates dense vectors. They allow us to convert words and sentences into high-dimensional vectors — organized so that each vector’s geometric position can attribute meaning.



The well-known language arithmetic example showing that Queen = King — Man + Woman

 

There is a particularly well-known example of this, where we take the vector of Rege, subtract the vector Om, and add the vector Femeie. The closest matching vector to the resultant vector is Regină.

We can apply the same logic to longer sequences, too, like sentences or paragraphs — and we will find that similar meaning corresponds with proximity/orientation between those vectors.

So, similarity is important — and what we will cover here are the three most popular metrics for calculating that similarity.

Distanta euclidiana

 
Euclidean distance (often called L2 norm) is the most intuitive of the metrics. Let’s define three vectors:



Three vector examples

 

Just by looking at these vectors, we can confidently say that a și b are nearer to each other — and we see this even clearer when visualizing each on a chart:



Vectorii a și b are close to the origin, vector c is much more distant

 

Clar, a și b are closer together — and we calculate that using Euclidean distance:



Euclidean distance formula

 

To apply this formula to our two vectors, a și b, we do:



Calculation of Euclidean distance between vectors a și b

 

And we get a distance of 0.014, performing the same calculation for d(a, c) Returnează 1.145, și d(b, c) Returnează 1.136. Clar, a și b are nearer in Euclidean space.

Produs punct

 
One drawback of Euclidean distance is the lack of orientation considered in the calculation — it is based solely on magnitude. And this is where we can use our other two metrics. The first of those is the dot product.

The dot product considers direction (orientation) and also scales with vector magnitude.

We care about orientation because similar meaning (as we will often find) can be represented by the direction of the vector — not necessarily the magnitude of it.

For example, we may find that our vector’s magnitude correlates with the frequency of a word that it represents in our dataset. Now, the word hi înseamnă la fel ca Alo, and this may not be represented if our training data contained the word hi de 1000 ori și Alo just twice.

So, vectors’ orientation is often seen as being just as important (if not more so) as distance.

The dot product is calculated using:



Dot product formula

 

The dot product considers the angle between vectors, where the angle is ~0, the cosθ component of the formula equals ~1. If the angle is nearer to 180 (orthogonal/perpendicular), the cosθ component equals ~0.

De aceea cosθ component increases the result where there is less of an angle between the two vectors. So, a higher dot-product correlates with higher orientation.

Again, let’s apply this formula to our two vectors, a și b:



Calculation of dot product for vectors a și b

 

Clearly, the dot product calculation is straightforward (the simplest of the three) — and this gives us benefits in terms of computation time.

However, there is one drawback. It is not normalized — meaning larger vectors will tend to score higher dot products, despite being less similar.

For example, if we calculate a·a — we would expect a higher score than a·c (a is an exact match to a). But that’s not how it works, unfortunately.



The dot product isn’t so great when our vectors have differing magnitudes.

 

So, in reality, the dot-product is used to identify the general orientation of two vectors — because:

  • Two vectors that point in a similar direction return a pozitiv dot-product.
  • Two perpendicular vectors return a dot-product of zero.
  • Vectors that point in opposing directions return a negativ dot-product.

Asemănarea cosinusului

 
Cosine similarity considers vector orientation, independent of vector magnitude.



Cosine similarity formula

 

The first thing we should be aware of in this formula is that the numerator is, in fact, the dot product — which considers both mărime și direcţie.

In the denominator, we have the strange double vertical bars — these mean ‘the length of’. So, we have the length of u multiplied by the length of v. The length, of course, considers mărime.

When we take a function that considers both mărime și direcţie and divide that by a function that considers just mărime — those two magnitudini cancel out, leaving us with a function that considers direcţie independent of magnitude.

We can think of cosine similarity as a normalizat dot product! And it clearly works. The cosine similarity of a și b e aproape 1 (perfect):



Calculation of cosine similarity for vectors a și b

 

Și folosind sklearn implementation of cosine similarity to compare a și c again gives us much better results:



Cosine similarity can often provide much better results than the dot product.

 

That’s all for this article covering the three distance/similarity metrics — Euclidean distance, dot product, and cosine similarity.

It’s worth being aware of how each works and their pros and cons — as they’re all used heavily in machine learning, and particularly NLP.

You can find Python implementations of each metric in acest caiet.

Sper că ți-a plăcut articolul. Anunțați-mă dacă aveți întrebări sau sugestii prin Twitter or in the comments below. If you’re interested in more content like this, I post on YouTube prea.

Vă mulțumim pentru citirea!

 
*Toate imaginile sunt ale autorului, cu excepția cazurilor în care se specifică altfel

 
Bio: James Briggs este un cercetător de date specializat în procesarea limbajului natural și lucrează în sectorul financiar, cu sediul în Londra, Marea Britanie. De asemenea, este un mentor, scriitor și creator de conținut independent. Puteți contacta autorul prin e-mail (jamescalam94@gmail.com).

Original. Repostat cu permisiunea.

Related:

Source: https://www.kdnuggets.com/2021/05/similarity-metrics-nlp.html

Timestamp-ul:

Mai mult de la KDnuggets