Brief Introduction to Natural Language Processing & Regular Expressions (Part 1)

in #nlp7 years ago (edited)

alt text
image originally from Tertiary Courses

0.0 Setup

This guide was written in Python 3.6.

0.1 Python & Anaconda

Download Python and Pip.

0.2 Libraries

We'll be working with the re library for regular expressions and nltk for natural language processing techniques, so make sure to install them! To install these libraries, enter the following commands into your terminal:

pip3 install re
pip3 install nltk

0.3 Other

Since we'll be working on textual analysis, we'll be using datasets that are already well established and widely used. To gain access to these datasets, enter the following command into your command line: (Note that this might take a few minutes!)

sudo python3 -m nltk.downloader all

Sentence boundary detection requires the dependency parse, which requires data to be installed, so enter the following command in your terminal.

python3 -m spacy.en.download all

Lastly, download the data we'll be working with in this example!

Positive Tweets

Negative Tweets

Now you're all set to begin!

1.0 Background

1.1 What is NLP?

Natural Language Processing, or NLP, is an area of computer science that focuses on developing techniques to produce machine-driven analyses of text.

1.2 Why is Natural Language Processing Important?

NLP expands the sheer amount of data that can be used for insight. Since so much of the data we have available is in the form of text, this is extremely important to data science!

A specific common application of NLP is each time you use a language conversion tool. The techniques used to accurately convert text from one language to another very much falls under the umbrella of "natural language processing."

1.3 Why is NLP a "hard" problem?

Language is inherently ambiguous. Once person's interpretation of a sentence may very well differ from another person's interpretation. Because of this inability to consistently be clear, it's hard to have an NLP technique that works perfectly.

1.4 Glossary

Here is some common terminology that we'll encounter throughout the workshop:

Corpus: (Plural: Corpora) a collection of written texts that serve as our datasets.

nltk: (Natural Language Toolkit) the python module we'll be using repeatedly; it has a lot of useful built-in NLP techniques.

Token: a string of contiguous characters between two spaces, or between a space and punctuation marks. A token can also be an integer, real, or a number with a colon.

2.0 Regular Expressions

A regular expression is a sequence of characters that define a string.

2.1 Simplest Form

The simplest form of a regular expression is a sequence of characters contained within two backslashes. For example, python would be

\python

2.2 Case Sensitivity

Regular Expressions are case sensitive, which means

\p and \P

are distinguishable from eachother. This means python and Python would have to be represented differently, as follows:

\python and \Python

We can check these are different by running:

import re
re1 = re.compile('python')
print(bool(re1.match('Python')))

2.3 Disjunctions

If you want a regular expression to represent both python and Python, however, you can use brackets or the pipe symbol as the disjunction of the two forms. For example,

[Pp]ython or \Python|python

could represent either python or Python. Likewise,

[0123456789]

would represent a single integer digit. The pipe symbols are typically used for interchangable strings, such as in the following example:

\dog|cat

2.4 Ranges

If we want a regular expression to express the disjunction of a range of characters, we can use a dash. For example, instead of the previous example, we can write

[0-9]

Similarly, we can represent all characters of the alphabet with

[a-z]

2.5 Exclusions

Brackets can also be used to represent what an expression cannot be if you combine it with the caret sign. For example, the expression

[^p]

represents any character, special characters included, but p.

2.6 Question Marks

Question marks can be used to represent the expressions containing zero or one instances of the previous character. For example,

<i>\colou?r

represents either color or colour. Question marks are often used in cases of plurality. For example,

<i>\computers?

can be either computers or computer. If you want to extend this to more than one character, you can put the simple sequence within parenthesis, like this:

\Feb(ruary)?

This would evaluate to either February or Feb.

2.7 Kleene Star

To represent the expressions containing zero or more instances of the previous character, we use an asterisk as the kleene star. To represent the set of strings containing a, ab, abb, abbb, ..., the following regular expression would be used:

\ab*

2.8 Wildcards

Wildcards are used to represent the possibility of any character and symbolized with a period. For example,

\beg.n

From this regular expression, the strings begun, begin, began, etc., can be generated.

2.9 Kleene+

To represent the expressions containing at least one or more instances of the previous character, we use a plus sign. To represent the set of strings containing ab, abb, abbb, ..., the following regular expression would be used:

\ab+

Resources In The Meantime

Natural Language Processing With Python

Regular Expressions Cookbook

NLP from Scratch

Glove