Introduction
Again, continuing this tutorial series on Natural language processing, I'll introduce wordnets with the Python module, nltk. For reference, you can check out previous posts here and here.
0.0 Setup
This guide was written in Python 3.6.
0.1 Python & Pip
If you haven't already, please download Python and Pip.
0.2 Libraries
We'll be working with the re library for regular expressions and nltk for natural language processing techniques, so make sure to install them! To install these libraries, enter the following commands into your terminal:
pip3 install nltk==3.2.4
0.3 Other
Sentence boundary detection requires the dependency parse, which requires data to be installed, so enter the following command in your terminal.
python3 -m spacy.en.download all
Cool, now we're ready to start!
1.0 Background
1.1 Polarity Flippers
Polarity flippers are words that change positive expressions into negative ones or vice versa.
1.1.1 Negation
Negations directly change an expression's sentiment by preceding the word before it. An example would be
The cat is not nice.
1.1.2 Constructive Discourse Connectives
Constructive Discourse Connectives are words which indirectly change an expression's meaning with words like "but". An example would be
I usually like cats, but this cat is evil.
1.2 Multiword Expressions
Multiword expressions are important because, depending on the context, can be considered positive or negative. For example,
This song is shit.
is definitely considered negative. Whereas
This song is the shit.
is actually considered positive, simply because of the addition of 'the' before the word 'shit'.
1.3 WordNet
WordNet is an English lexical database with emphasis on synonymy - sort of like a thesaurus. Specifically, nouns, verbs, adjectives and adjectives are grouped into synonym sets.
1.3.1 Synsets
nltk has a built-in WordNet that we can use to find synonyms. We import it as such:
from nltk.corpus import wordnet as wn
If we feed a word to the synsets() method, the return value will be the class to which belongs. For example, if we call the method on motorcycle,
print(wn.synsets('motorcar'))
we get:
[Synset('car.n.01')]
Awesome stuff! But if we want to take it a step further, we can. We've previously learned what lemmas are - if you want to obtain the lemmas for a given synonym set, you can use the following method:
print(wn.synset('car.n.01').lemma_names())
This will get you:
['car', 'auto', 'automobile', 'machine', 'motorcar']
Even more, you can do things like get the definition of a word:
print(wn.synset('car.n.01').definition())
Again, pretty neat stuff.
'a motor vehicle with four wheels; usually propelled by an internal combustion engine'
1.3.2 Negation
With WordNet, we can easily detect negations. This is great because it's not only fast, but it requires no training data and has a fairly good predictive accuracy. On the other hand, it's not able to handle context well or work with multiple word phrases.
1.4 SentiWordNet
Based on WordNet synsets, SentiWordNet is a lexical resource for opinion mining, where each synset is assigned three sentiment scores: positivity, negativity, and objectivity.
from nltk.corpus import sentiwordnet as swn
cat = swn.senti_synset('cat.n.03')
cat.pos_score()
cat.neg_score()
cat.obj_score()
1.5 Stop Words
Stop words are extremely common words that would be of little value in our analysis are often excluded from the vocabulary entirely. Some common examples are determiners like the, a, an, another, but your list of stop words (or stop list) depends on the context of the problem you're working on.
2.0 Information Extraction
Information Extraction is the process of acquiring meaning from text in a computational manner.
2.1 Data Forms
2.1.1 Structured Data
Structured Data is when there is a regular and predictable organization of entities and relationships.
2.1.2 Unstructured Data
Unstructured data, as the name suggests, assumes no organization. This is the case with most written textual data.
2.2 What is Information Extraction?
With that said, information extraction is the means by which you acquire structured data from a given unstructured dataset. There are a number of ways in which this can be done, but generally, information extraction consists of searching for specific types of entities and relationships between those entities.
An example is being given the following text,
Martin received a 98% on his math exam, whereas Jacob received a 84%. Eli, who also took the same test, received an 89%. Lastly, Ojas received a 72%.
This is clearly unstructured. It requires reading for any logical relationships to be extracted. Through the use of information extraction techniques, however, we could output structured data such as the following:
Name Grade
Martin 98
Jacob 84
Eli 89
Ojas 72
Final Words
In the next tutorial, we'll go deeper into information extraction, named entity extraction, and relationship extraction. Stay tuned for more!
Hi are you for real or what? Your github is a 404. I like your stuff but you need to make sure you put your own twist on things. Otherwise @Cheetah and worse, @steemcleaners will visit you ;)