Python String Matching Without Complex RegEx Syntax

Újra kiadta Platón

Követő: 0

Python karakterlánc-illesztés összetett RegEx szintaxis nélkül
A kép szerzője

I have a love-and-hate relationship with regular expressions (RegEx), especially in Python. I love how you can extract or match strings without writing multiple logical functions. It is even better than the String search function.

What I don’t like is how it is hard for me to learn and understand RegEx patterns. I can deal with simple String matching, such as extracting all alpha-numerical characters and cleaning the text for NLP tasks. Things get harder when it comes to extracting IP addresses, emails, and IDs from junk text. You have to write a complex RegEx String pattern to extract the required item.

To make complex RegEx tasks simple, we will learn about a simple Python Package called pregex. Furthermore, we will also look at a few examples of extracting dates and emails from a long string of text.

Pregex is a higher-level API built on top of the `re` module. It is a RegEx without complex RegEx patterns that make it easy for any programmer to understand and remember regular expressions. Moreover, you don’t have to group patterns or escape metacharacters, and it is modular.

You can simply install the library using PIP.

pip install pregex

To test the powerful functionality of PRegEx, we will use modified sample code from the dokumentáció.

In the example below, we are extracting either HTTP URL or an IPv4 address with a port number. We don’t have to create complex logic for it. We can use built-in functions `HttpUrl` and `IPv4`.

Create a port number using AnyDigit(). The first digit of the port should not be zero, and the next three digits can be any number.
Use Either() to add multiple logics to extract, either HTTP URL or IP address with a port number.

from pregex.core.pre import Pregex
from pregex.core.classes import AnyDigit
from pregex.core.operators import Either
from pregex.meta.essentials import HttpUrl, IPv4 port_number = (AnyDigit() - '0') + 3 * AnyDigit() pre = Either( HttpUrl(capture_domain=True, is_extensible=True), IPv4(is_extensible=True) + ':' + port_number
)

We will use a long string of text with characters and descriptions.

text = """IPV4--192.168.1.1:8000-- address--https://www.abid.works-- website--https://kdnuggets.com--text"""

Before we extract the matching string, let’s look at the RegEx pattern.

regex_pattren = pre.get_pattern()
print(regex_pattren)

teljesítmény

As we can see, it is hard to read or even understand what is going on. This is where PRegEx shines. To provide you with a human-friendly API for performing complex regular expression tasks.

(?:https?://)?(?:www.)?(?:[a-zdA-Z][a-z-dA-Z]{,61}[a-zdA-Z].)*([a-zdA-Z][a-z-dA-Z]{,61}[a-zdA-Z]).[a-z]{2,6}(?::d{1,4})?(?:/[!-.0-~]+)*/?(?:(?=[!-/[-`{-~:-@])|(?=w))|(?:(?:d|[1-9]d|1d{2}|2(?:[0-4]d|5[0-5])).){3}(?:d|[1-9]d|1d{2}|2(?:[0-4]d|5[0-5])):[1-9]d{3}

Just like `re.match`, we will use `.get_matches(text)` to extract the required string.

results = pre.get_matches(text)
print(results)

teljesítmény

We have extracted both the IP address with port number and two web URLs.

['192.168.1.1:8000', 'https://www.abid.works', 'https://kdnuggets.com']

Let’s look at a couple of examples where we can understand the full potential of PRegEx.

In this example, we will be extracting certain kinds of date patterns from the text below.

text = """ 04-15-2023 2023-08-15 06-20-2023 06/24/2023 """

By using Exactly() and AnyDigit(), we will create the day, month, and year of the date. The day and month have two digits, whereas the year has 4 digits. They are separated by “-” dashes.

After creating the pattern, we will run `get_match` to extract the matching String.

from pregex.core.classes import AnyDigit
from pregex.core.quantifiers import Exactly day_or_month = Exactly(AnyDigit(), 2) year = Exactly(AnyDigit(), 4) pre = ( day_or_month + "-" + day_or_month + "-" + year
) results = pre.get_matches(text)
print(results)

teljesítmény

['04-15-2023', '06-20-2023']

Let’s look at the RegEx pattern by using the `get_pattern()` function.

regex_pattren = pre.get_pattern()
print(regex_pattren)

teljesítmény

As we can see, it has a simple RegEx syntax.

d{2}-d{2}-d{4}

The second example is a bit complex, where we will extract valid email addresses from junk text.

text = """ user1@abid.works editorial@@kdnuggets.com lover@python.gg. editorial1@kdnuggets.com """

Hozzon létre egy használó pattern with `OneOrMore()`. We will use `AnyButFrom()` to remove “@” and space from the logic.
Hasonló a használó pattern we create a vállalat pattern by removing the additional character “.” from the logic.
a domain, we will use `MatchAtLineEnd()` to start the search from the end with any two or more characters except “@”, space, and full stop.
Combine all three to create the final pattern: user@company.domain.

from pregex.core.classes import AnyButFrom
from pregex.core.quantifiers import OneOrMore, AtLeast
from pregex.core.assertions import MatchAtLineEnd user = OneOrMore(AnyButFrom("@", ' '))
company = OneOrMore(AnyButFrom("@", ' ', '.'))
domain = MatchAtLineEnd(AtLeast(AnyButFrom("@", ' ', '.'), 2)) pre = ( user + "@" + company + '.' + domain
) results = pre.get_matches(text)
print(results)

teljesítmény

As we can see, PRegEx has identified two valid email address.

['user1@abid.works', 'editorial1@kdnuggets.com']

Jegyzet: both code examples are modified versions of work by A PyCoach.

If you are a data scientist, analyst, or NLP enthusiast, you should use PRegEx to clean the text and create simple logic. It will reduce your dependency on NLP frameworks as most of the matching can be done using simple API.

In this mini tutorial, we have learned about the Python package PRegEx and its use cases with examples. You can learn more by reading the official dokumentáció or solving a wordle problem using programmable regular expressions.

Abid Ali Awan (@1abidaliawan) okleveles adattudós szakember, aki szereti a gépi tanulási modellek építését. Jelenleg tartalomkészítéssel foglalkozik, és technikai blogokat ír a gépi tanulásról és az adattudományi technológiákról. Abid mesterdiplomát szerzett technológiamenedzsmentből és alapdiplomát távközlési mérnökből. Elképzelése az, hogy egy MI-terméket hozzon létre egy gráf neurális hálózat segítségével a mentális betegséggel küzdő diákok számára.