Search By Synonyms In Any Text Using NLP!

Republished By Plato

Followers: 0

This article was published as a part of the Data Science Blogathon

Introduction

language processing is to learn how to parse the texts of user queries and find the entities of the working model in them. The entities like geo, date, money, etc. can be found easily by choosing a suitable NER component and using its functionality. But, if you need to find an element specific to your particular model or you need an improved search quality for an optimal element, you will have to create your own NER component or train an existing one for your purposes.

If you work with systems like Alexa or Google Dialogflow- the learning process is reduced to the creation of a simple working composition. For each entity in the model, you must create a synonym list. Next, neural networks come into play. It’s fast, simple, very convenient, everything will work right away. Of the minuses – there is no control over the settings of neural networks, and one usual problem for these systems is the probabilistic nature of the search. All these disadvantages may be completely unimportant for your model, especially if it is looking for one or more subsequently different entities. But if there are a lot of model elements, and especially if they overlap in some way, the problem becomes more serious.

If you are designing your own system, training and configuring search elements, for example from Apache OpenNlp, Stanford NLP, Google Language API, Spacy, or Apache NlpCraft for finding your own elements, of course, there are a few more problems, but the control over such a system is remarkably higher.

Below we will talk about how entities are going to be searched by neural networks in the Apache NlpCraft project. To begin with, we will briefly describe all the search capabilities in the system.

Finding Custom Entities in Apache NlpCraft

When building systems based on Apache NlpCraft, you can use the following options to find your own items:

Built-in search components based on the configuration of item synonyms. Using any of the aforementioned external components, integration with them is already provided, you just connect them in the configuration.
Using Composite Entities. The bottom line is the ability to build new NER components based on existing ones.
The lowest-level option is programming and connecting your own parser to the system. This task boils down to implementing the interface and adding it to the system via IoC. At the input of the implemented component, there is everything for writing the logic for finding entities in the request: the request itself and its NLP representation, the model, and all the entities already found in the request text by other components. This implementation is a place for connecting neural networks, using your own algorithms, integrating with any external systems, etc., that is, the point of complete control over the search.

The first approach, based on setting up synonyms, requiring no programming or text corpora to train the model, is the simplest, most versatile, and fastest to develop.

Below is a fragment of the configuration of the “ smart home ” model (you can read more about macros and synonym DSL here ).

macros: - name: "" macro: "{turn|switch|dial|control|let|set|get|put}" - name: "" macro: "{entire|full|whole|total|*}" - name: "" macro: "{all|*} {it|them|light|illumination|lamp|lamplight}"
...
- id: "ls:on" description: "Light switch ON action." synonyms: - " {on|up|*} {on|up|*}" - " {on|up}"

The “ ls: on ” element is described very compactly, and this description contains over 3000 synonyms. Here is a small part of them: “set lamp”, “light on”, “control lamp”, “put them”, “switch it all” … Synonyms are configured in a very compressed form and are quite readable.

A few notes:

Of course, when working with search by synonyms, the early forms of words (lemmas, stems), stop words of the query text are taken into consideration, support for the discontinuity of verbose synonyms is configured, etc., etc.
Some of the generated synonyms will not make any mindful sense, this is quite an expected payment for the compactness of the recording. If memory usage becomes a bottleneck (here we are talking about millions or more variants of synonyms for an entity), it is worth thinking about optimization. You will receive all necessary warnings at the system startup.

So you completely control the process of searching for your elements in the query text, this process is deterministic, that is, as a result, debuggable, controllable, and has opportunities for accordant improvement. Now all you need to do is consciously create a sufficient list of synonyms. At the start of the project, you can limit yourself to just a few basic synonyms per element, just add one or two words, everything will work, but in the end, of course, you want to maintain the most complete list of synonyms to ensure the highest quality recognition process.

What can tell us the missing synonyms in the configuration, as the quality of our system will be directly based on its completeness?

Expanding the list of synonyms

The first obvious direction is to manually tracklogs and analyze unanswered questions.

The second is to look in a dictionary of synonyms, which can be useful enough for obvious cases. One of the most famous dictionaries in wordnet.

Working in manual mode can bring some benefits, but the process of finding and configuring additional synonyms of elements here clearly should not be automated.

In addition, developers of Apache NlpCraft added to the project tool sugsyn, offers, during use, additional synonyms for model elements.

The sugsyn component works in a standard way via the REST API by interacting with an additional server supplied in binary releases – ContextWordServer.

Description ContextWordServer for Synonyms

ContextWordServer allows you to search for synonyms for a word in a given context. As a request, the user sends a sentence with a marked word for which you want to find synonyms, and the server, using the selected model, returns a list of the most suitable substitute words.

Initially, word2vec (skip-gram model) was used as a basic model, which allows building vector representations of words (or embeddings). The idea was to calculate the word embeddings and, based on the resulting vector space, select the words that are closest to the target.

In general, this approach worked satisfactorily, but the context of words was not taken into account well enough, even with large N values for n-grams. Then it was proposed to use Bert to solve the problem of masked language modeling (finding the most suitable words that can be substituted instead of a mask in a sentence). In the sentence the user-submitted, the target word was masked, and issuing Bert was the answer. The disadvantage of using Bert alone was also getting a not entirely satisfactory list of words that fit the context of use but are not close synonyms of the given one.

The solution to this problem was to combine the two methods. Now, Bert’s output is filtered out by discarding words with a vector distance to the target word above a certain threshold.

Further, according to the results of experiments, word2vec was replaced by fasttext (the logical successor to word2vec), which shows the best results. A ready-made model, pre-trained on articles from Wikipedia, was used, and the Bert model was replaced with an improved model – Roberta, see the link to the pre-trained model.

The final algorithm does not require the use of fasttext or Roberta as components. Both subsystems can be replaced with alternative ones capable of solving similar problems. In addition, pretrained models can be fine-tuned or trained from scratch for better results.

Sugsyn: the tool description

Working with sugsyn, the most convenient way is through the CLI. You need to download the latest version of the binary release, using NlpCraft via maven central in this case will not be enough.

The general principle of sugsyn is simple enough. It collects use cases for all the intents of the requested model and combines them with all configured synonyms for each model element. The generated hybrid sentences are sent to the ContextWordServer, which returns recommendations for the use of additional words in the requested context. Please note that training the neural network requires prepared and labeled data, but sugsyn also requires prepared data – examples of use, the advantage is that very little data is needed for sugsyn, in addition, we use already existing examples from system tests.

Example. Let a certain element of some model be configured using two synonyms: “ping“ and “buzz” (a short version of the model taken from the example with an alarm clock ), and let the only intent of the model contain two examples: “Ping me in 3 minutes” and “ In an hour and 15mins, buzz me. ” Then 2 requests will send to ContextWordServer by sugsyn:

text=” ping me in 3 minutes”, index=0
text=” buzz me in 3 minutes”, index=0

This means the following – for each sentence (“text” field), suggests to me some additional suitable words, in addition to the already existing words in the “index” position. ContextWordServer will return several options sorted by total weight (combinations of the results of both models, the effect of each model can be configured), then sorted by repeatability.

Using the sugsyn tool Getting

Started

Launches ContextWordServer, which is a server with ready-made, pre-trained models.
> cd ~ / apache / incubator-nlpcraft / nlpcraft / src / main / python / ctxword
> ./bin/start_server.sh

Please note that ContextWordServer must be preinstalled and requires python versions 3.6 – 3.8 to work successfully. See “install_dependencies.sh“ for Linux and mac or the installation manual for windows. Keep in mind that the installation process entails downloading model files of substantial size, be careful.
We launch the CLI, start the server, and probe in it with a prepared model, see the Quick Start section for details. For initial experiments, you can prepare your own model or use the examples already provided in the delivery.
We use the sugsyn command, with one required parameter – the identifier of the model, which should already be launched in the probe. The second, optional parameter is the value of the minimum coefficient of reliability of the result, we will talk about it below.
Everything described above is described step by step in the manual at the link.

Getting Results for Different Models

Let’s start with a weather forecast example. The wt: phen element is configured with many synonyms, including rain, storm, sun, sunshine, cloud, dry, etc., and the wt: first element with using “future”, “forecast”, “prognosis”, “prediction”, etc.

Here is part of sugsyn’s answer to a query with minScore = 0.

> sugsyn --md = nlpcraft.weather.ex --minScore = 0
"wt:phen": [ { "score": 1.00000, "synonym": "flooding" }, ... { "score": 0.55013, "synonym": "freezing" }, ... { "score": 0.09613, "synonym": "stop"}, { "score": 0.09520, "synonym": "crash" }, { "score": 0.09207, "synonym": "radar" },

…

"wt:fcast": [ { "score": 1.00000, "synonym": "outlook" }, { "score": 0.69549, "synonym": "news" }, { "score": 0.68009, "synonym": "trend" }, ... { "score": 0.04898, "synonym": "where" }, { "score": 0.04848, "synonym": "classification" }, { "score": 0.04826, "synonym": "the" }, ...

As we can see from the answer received, the higher the value of the coefficient, the higher the value of the proposed synonyms for the elements of the model. We can say that interesting synonyms to use here begin with a coefficient value of 0.5. The coefficient of reliability of the result is an integral indicator that takes into account the coefficients of the issuance of models and the frequency with which the system proposes a synonym in different contexts.

Let’s see what will be suggested as additional synonyms for the elements of the example “ smart home ”. For “ ls: loc ”, which describes the arrangement of lighting elements (synonyms: “kitchen”, “library”, “closet”, etc.), the proposed options with high coefficient values greater than 0.5 also look noteworthy:

"lc:loc": [ { "score": 1.00000, "synonym": "apartment" }, { "score": 0.96921, "synonym": "bed" }, { "score": 0.93816, "synonym": "area" }, { "score": 0.91766, "synonym": "hall" }, ... { "score": 0.53512, "synonym": "attic" }, { "score": 0.51609, "synonym": "restroom" }, { "score": 0.51055, "synonym": "street" }, { "score": 0.48782, "synonym": "lounge" },

…

But near the coefficient of 0.5 for our model we already come across outright rubbish – “street”.

For the element “ x: alarm ” of the model there is an alarm clock, with synonyms: “ping”, “buzz”, “wake”, “call”, “hit”, etc. we have the following result:

"x:alarm": [ { "score": 1.00000, "synonym": "ask" }, { "score": 0.94770, "synonym": "join" }, { "score": 0.73308, "synonym": "remember" }, ... { "score": 0.51398, "synonym": "stop" }, { "score": 0.51369, "synonym": "kill" }, { "score": 0.50011, "synonym": "send" },

…

That is, for the elements of this model, with a decrease in the coefficient, the quality of the proposed synonyms decreases much faster.

For the element ” x: time ” of the model “current time” ( 1, 2 ), with synonyms like “what time”, “clock”, “date-time”, “date and time”, etc. we have the following result:

"x:time": [ { "score": 1.00000, "synonym": "night" }, { "score": 0.92325, "synonym": "year" }, { "score": 0.58671, "synonym": "place" }, { "score": 0.55458, "synonym": "month" }, { "score": 0.54937, "synonym": "events" }, { "score": 0.54466, "synonym": "pictures"},

…

The quality of the proposed synonyms turned out to be unsatisfactory even with high coefficients.

Evaluation of results

Let us list the factors that determine the quality of synonyms suggested by sugsyn for the searching of model elements in the text:

The number of user-defined synonyms in the element configuration.
The number and quality of examples of requests to intents. Quality refers to the naturalness and prevalence of added examples. Ask Google “what time is it” and the number of results returned will be “approximately 2,630,000”. The text of this query is of higher quality and is better suited for an example.
The most important and unpredictable – the quality depends on the type of the element itself and the type of its model.

In other words, even other things being equal, such as a limited number of pre-configured synonyms and use cases, for some types of entities the neural network will provide a better set of synonyms, and for some, a lower quality one, and this depends on the nature of the entities and models themselves. That is, in order to get a parallel level of search quality, choosing from the list of proposed synonyms for use, for different elements and different models, we must use different values of the minimum coefficient of reliability of the proposed options. Generally, this is quite understandable, since even the same words in different semantic contexts can be replaced by slightly different sets of substitute words. Pretrained model, the default with ContectWordServer can be tailored for some entity types better than others, but model reconfiguration may not improve the final result. Because Apache NlpCraft is an open-source solution, you can always change all model settings, both parameters, and the model itself, taking into consideration the specifics of our domain.

When working with sugsyn, in addition to the model identifier, you can use only one parameter – minScore by simply bounding the sample size of synonyms sorted in order of decreasing quality. This simplification is used to facilitate the work of those who are responsible for expanding the list of synonyms for the elements of a custom model. The abundance of configuration parameters inherent in working with neural networks in this case can only confuse the users of the system.

Conclusion

If you completely delegate the process of searching your entities to neural networks, then for different elements and different types of models you will receive results that differ consequently in the quality of recognition. However, you will not even be able to admire this difference, especially if you do not control the settings of the network. However, self-configuration to set the wanted trigger threshold will not be a smooth task. The reason, as a rule, is the unavailability of enough data needed for correct training and testing of the network. Using the suggested Apache NlpCraft a mechanism for finding entities through a set of synonyms and a tool that provides options for enriching this list, you can independently control the degree of reliability of the synonyms of your model and select from the network provided only suitable options that you approved.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Source: https://www.analyticsvidhya.com/blog/2021/09/search-by-synonyms-in-the-text-using-nlp/

Time Stamp: September 29, 2021