Text Analysis in Natural Language Processing using Julia

Source Node: 1162446

This article was published as a part of the Data Science Blogathon

Overview of Text Analysis in Julia

The article majorly focuses on how to make you comfortable with the outline of Julia text processing tools with a brief explanation of their use in projects. Here we see the TextAnalysis package of Julia with the functionalities and Documents supported by it. We will try to balance between a brief listing of the possibilities for those who are pro in the NLP topic but would like to see exactly the Julia tools, and more detailed explanations and examples of use for those who decided to dive into the field of NLP (Natural Language Processing) as a newbie. If you are not aware of Julia you can refer to Analytics Vidhya

Table of Contents

  1. Introduction

  2. Text Analysis in Julia

  3. Documents

3.1. File Document

3.2. Token Document

3.3. String Document

3.4. NGram Document

  1. Document preprocessing

  2. Classification of documents

  3. Conclusion

Introduction

Julia is one of the fastest-growing languages in the field of data science. Here we not only use Julia to learn NLP but also look at step-by-step implementation to the text processing in NLP. The base of Natural Language Processing is text processing which includes text analysis, feature extraction, data cleaning, tokenization, etc. In NLP the analysis and processing of texts is a constantly running task that has been solved, is being solved, and will be solved. Here, we will also talk about the parameters of solving this problem, specifically, in the Julia language. The reason behind the usage of Julia in NLP is its syntactic simplicity and the unbeatable mathematical tools that make it simple to immerse yourself in the tasks of clustering and classifying texts.

Julia vs Python | Text Analysis in Julia

Text Analysis in Julia

The primary library that is used to implement a minimal set of typical text processing functions is the TextAnalysis.jl package of Julia. We can proceed with this library by taking some common examples from its official documentation.

Official document:- https://juliatext.github.io/TextAnalysis.jl/dev/

Note: .jl is used as an extension of Julia.

Text Documents Supported by Julia

There are 4 types of documents that are supported by Julia for the analysis purpose:

3.1. File Document – It is a document that represented as a plain text file on disk

julia> pathname = "/usr/share/dict/words" "/usr/share/dict/words"
julia> fd = FileDocument(pathname)
A FileDocument Language: Languages.English() * Title: /usr/share/dict/words * Author: Unknown Author * Timestamp: Unknown Time * Snippet: S S’ SD SD SD’S SHIA

3.2. Token Document- It is a document that is a sequence of UTF-8 tokens (highlighted words). The structure TokenDocument stores a set of tokens, however we can’t recover the full text without any loss.

julia> my_tokens = String["To", "clean", "or", "not", "to", "clean..."]
6-element Array{String,1}: "To" "clean" "or" "not" "to" "clean..."
julia> td = TokenDocument(my_tokens)
A TokenDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

3.3. String Document- It is a document represented in the form of UTF-8 string and stored in primary memory or RAM. The String Document structure provides storage availability for the entire text.

julia> stringI = "cute or not cute..." "cute or not cute..."
julia> sd = StringDocument(stringI)
A StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: cute or not cute...

3.4. NGram Document – It is a document displayed as a set of n-grams in UTF8 representation, that is, a sequence of n different UTF-8 characters, and a counter to keep the record of their occurrence. This variant of document presentation is one of the easiest possible ways to avoid some problems of language morphology, typos, and peculiarities of language constructions in the analyzed texts.

julia> my_ngrams = Dict{String, Int}("To" => 1, "clean" => 2, "or" => 1, "not" => 1, "to" => 1, "clean..." => 1)
Dict{String,Int64} with 6 entries: "or" => 1 "clean..." => 1 "not" => 1 "to" => 1 "To" => 1 "clean" => 2
julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

We can simply create a document using the generic Document constructor, and the library will find the appropriate implementation of the document.

julia> Document("cute or not cute...")
A StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: cute or not cute...
julia> Document("/usr/share/dict/words")
A FileDocument
 * Language: Languages.English() * Title: /usr/share/dict/words * Author: Unknown Author * Timestamp: Unknown Time * Snippet: S S’ SD SD SD’S SHIA
julia> Document(String["To", "clean", "or", "not", "to", "clean..."])
A TokenDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
julia> Document(Dict{String, Int}("a" => 1, "b" => 3))
A NGramDocument{AbstractString} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

In the body of the document we can see everything like texts, tokens, even the metadata. But, what if you don’t want anything rather than text from this document? To obtain the text of the document you have to use the method text(..):

julia> td = TokenDocument("cute or not cute...")
TokenDocument{String}(["cute", "or", "not", "cute"],
TextAnalysis.DocumentMetadata(
Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))
julia> text(td)
julia> tokens(td)
6-element Array{String,1}: "cute" "or" "not" "cute"

The automatically parsed tokens are shown in the form of an example in the official document. We can easily observe that the function call text(td)issued a warning since its Token Document does not store word separators. The call tokens(td)made it possible to get just the highlighted words.

This is the way to request metadata from a document:

julia> StringDocument("This document has too many words")
A StringDocument{String} * Language: Languages.English() * Title: Untitled Document * Author: Unknown Author * Timestamp: Unknown Time * Snippet: This document has too many words
julia> language(sd)
Languages.English()
julia> title(sd) "Untitled Document"
julia> author(sd) "Unknown Author"
julia> timestamp(sd) "Unknown Time"

And all of them can be changed with the corresponding functions. For modifying functions, Julia uses the same notations as Ruby. A function that modifies an object having an exclamation mark(!) as a suffix:

julia> using TextAnalysis.Languages
julia> language!(sd, Languages.English())
Languages.English ()
julia> title! (sd, "Document") "Document"
julia> author! (sd, "Shia") "Shia"
julia> import Dates:now
julia> timestamp!(sd, string(now())) "2021-10-11T22:53:38.383"

Text Document preprocessing in Julia

In document preprocessing we will perform various operations like removal of punctuation marks, conversion to lowercase, removal of garbage words, etc.

A). Removal of punctuation marks– If the text of the document was received from some external representation, then it is quite possible that there could be encoding errors in the byte stream. The remove_corrupt_utf8! (sd)function is used to eliminate them. The prepare!(..) method is the function that acts as the soul for document processing in the TextAnalysis package of Julia. We can remove punctuation marks from the text using this:

julia> str = StringDocument("high usage of punctuations here !!!...")
julia> prepare!(str, strip_punctuation)
julia> text(str) "high usage of punctuations here"

B). Lowercase Conversion-In text processing the conversion of all letters to lowercase is very important, as this makes it easier to further compare words with each other. In this case, we must understand that we can lose important information about the text, for example, the fact that a word is a proper name or a word is a sentence boundary. But it all depends on the model of further processing. Lowercase conversion is done by a function remove_case!().

julia> sd = StringDocument("Rachit is mad")
A StringDocument{String}
julia> remove_case!(sd)
julia> text(sd) "Rachit is mad"

C). Garbage Removal-In TextAnalysis.jl, we can remove garbage words, that is, those that are not useful in information search and analysis for coincidences. This can be done explicitly using the function remove_words!()and an array of these stop words.

julia> remove_words!(sd, ["Rachit"])
julia> text(sd) " is mad"

D). Others- In the sequence of the words to be deleted, there are many entities that are useless and need to be removed like, articles, prepositions, pronouns, numbers and just stop words, which are parasitic in the frequency of occurrence. In the Languages.jl package many individual dictionaries are set based on the specific languages. Numbers are the critical cases as they trouble us in the future model of a term document, they can greatly increase the dimension of the matrix, without improving, for example, the clustering of texts. However, in search tasks, for example, it is no longer always possible to drop numbers.

The options available for the cleaning methods are as follows:

  • prepare!(stringDocument, strip_html_tags)#to remove html tags

  • prepare!(stringDocument, strip_indefinite_articles)#to remove indefinite articles

  • prepare!(stringDocument, strip_definite_articles)#to remove definite articles

  • prepare!(stringDocument, strip_preposition)#to remove prepositions

  • prepare!(stringDocument, strip_pronouns)#to remove pronoun

  • prepare!(stringDocument, strip_stopwords)#to remove stopwords

  • prepare!(stringDocument, strip_numbers)#to remove numbers

  • prepare!(stringDocument, strip_non_letters)#to remove non-letters

  • prepare!(stringDocument, strip_articles)#to remove articles

  • prepare!(stringDocument, strip_spares_terms)#to remove spares terms

  • prepare!(stringDocument, strip_frequent_terms)#to remove frequent terms

You can also combine two or more options together in a single function. For example, in one call, prepare! simultaneously remove articles, numbers and Html tags -prepare!(sd, strip_articles| strip_numbers| strip_html_tags)

Another type of processing is the selection of the stem of words, removing endings and suffixes. This allows you to combine different word forms and dramatically reduce the dimension of the document presentation model. This requires dictionaries, so the language of the documents must be clearly indicated. An example of processing in English:

julia> sd = StringDocument ("cat gnawed on sweet cakes")
StringDocument {String} ("cat gnawed sweet cakes", TextAnalysis.DocumentMetadata (Languages.English (), "Untitled Document", "Unknown Author", "Unknown Time"))
julia> language! (sd, Languages.English ())
Languages.English()
julia> stem! (sd)
julia> text (sd) "the cat was gnawing sweet drying"

Text Documents Classification in Julia

Naive Bayes Classifier is the only in-built classifier offered by TextAnalysis.jl. If you are not familiar with the in-built classifiers, please refer to the official document. You can easily connect it to your project without any extra knowledge. In order to use this classifier, all you need is a model with a vocabulary and classes to which the samples should be related. Use the constructor NaiveBayesClassifier()to create this model and method fit!() to train the model.

using TextAnalysis: NaiveBayesClassifier, fit!, predict
m = NaiveBayesClassifier([:crazy, :cute])
fit!(m, "this is a crazy cat", :crazy)
fit!(m, "this is a cute cat", :cute)

In order to check the training results, you have to use the method predict:

julia> predict(m, "we can consider that this cat is cute")
Dict{Symbol,Float64} with 2 entries: :cute => 0.666667 :crazy => 0.333333

From the above data, you can easily conclude that cute is more likely present in the text which is absolutely correct. You might be thinking if Julia is so rich in packages then why Julia is having only one classifier. It is not like that, Julia has all the classifiers like AdaBoost Classifier, Bagging Classifier, Bernoulli NB Classifier, Complement NB Classifier, Constant Classifier, XGBoost Classifier, Decision Tree Classifier, and many more. But here we are only talking about the Text Analysis package of Julia, which contains only one classifier. All other classifiers are present in other packages of Julia like MLJ.jl, Flux.jl, etc.

Co

The text Analysis package of Julia provides you with many ready-made functions which makes NLP very easy for you. And remove the stress of learning very difficult techniques to clean the data like web scraping.

Image Source

1. https://www.fiverr.com/jesiel_/help-you-with-python-scripts-julia-and-data-cleaning-tasks

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Source: https://www.analyticsvidhya.com/blog/2021/10/text-analysis-in-natural-language-processing-using-julia/

Time Stamp:

More from Analytics Vidhya