Software Requirements:
Pycharm(Or any preferred code editor)
Repository: Python Open Source Repository
What you would learn
In this tutorial you would learn how to make your own word counter including a dictionary in python.Using;
-Modules
-Functions
-Arrays
Difficulty: Intermediate
Tutorial
In this tutorial you would learn how to make a word counter with an included dictionary.This could be used in any pdf document or website of your choice to check the number of times specific words appear by displaying a numerical value beside the word.This proves really efficient in research work.
The modules we would be using include the requests
,
BeautifulsSoup
, operator
modules which would be imported below
import requests
from bs4 import BeautifulSoup
import operator
The requests
module is an inbuilt module used for http calls to the internet and is a major resource for such code.
The BeautifulSoup
module is an awesome resource used for parsing html content to a readable format for reading and editing.
Theoperator
module is an inbuilt module for carrying out basic operating functions such as addition subtraction and many more of which we would use.
Necessary Functions To Be Used And How To Make Them
We would then make functions(the most vital part of our code).
First is the def search(url):
(NOTE:You can use any array name of your choice) In this function we would make an array to store the data we would be getting from the website or document.(which is the list = []
in this code).
Then we request for the text which we would be using and assign it to a variable (data = requests.get(url).text)and assign it to BeautifulSoup
to make it easier to edit and use.The "url"specifies the site that would be used in the code.
Then you specify the link and specific class which our code would be searching.( for postedtext in soup.findAll('a', {'class': 'news-info'}):
)
Then turn it into a string to remove all the html data in it.I also recommend splitting the text and then adding it to the array already created.This is optional but recommended(words = plaintext.lower().split()
).Then we add the text into the array we first created in our function using the word append(list.append(eachtext)
)
We would then create another function def clean_up_words(list):
which as the name implies is for editing the text we already have by removing symbols that aren't needed, and then adding them to another array (cleaned_up_list.append(word)
).
In this function we can see that the symbols are replaced by empty spaces (word = word.replace(symbols[w], "")
). So if all the content of that word is just symbols the word would just be only a blank space. Therefore in order to avoid or evade such, we use say
if len(word) > 0:
cleaned_up_list.append(word)
so if the content of the word is only a blank space it won't be added to the array.
Then the last function we would be creating is the
def clean_up_words(list):
function which would hold another array that would be carrying the final words that would be printed when the function is called.The array used in this code is
words_list = {}
(NOTE:You can use any array name of your choice).We would also create a "for" loop to add the number of times a word appears in the text.This is shown below;
for word in cleaned_up_list: if word in words_list: word_list[word] += 1 else: word_list[word] = 1
What this loop simply does is;
If the word is already in the array(words_list:
), it just increases the value of the word by 1. But if the word isn't in the array(words_list:
), it the loop adds the word and and also increases the value of the word by 1. And lastly the word is printed with the value(in this case the number of times the word appears)
print(key,value)
Just as an example I used a random "url" in the 'search'function
So we are done with our word counter.
Below is the whole code all together;
import requests
from bs4 import BeautifulSoup
import operator
def search(url):
list = []
data = requests.get(url).text
soup = BeautifulSoup(data,'lxml')#We are specifying lxml as a parser which is better than html in more recent versions of bs4
for postedtext in soup.findAll('a', {'class': 'news-info'}):
plaintext = postedtext.string
words = plaintext.lower().split()
for eachtext in words:
list.append(eachtext)
clean_up_words(list)
def clean_up_words(list):
cleaned_up_list = []
for word in list:
symbols = "!@#$^&*())_+=][\';/.,><?'"
for w in range(0, len(symbols)):
word = word.replace(symbols[w], "")
if len(word) > 0:
cleaned_up_list.append(word)
dictionary(cleaned_up_list)
def dictionary(cleaned_up_list):
words_list = {}
for word in cleaned_up_list:
if word in words_list:
word_list[word] += 1
else:
word_list[word] = 1
for key, value in sorted(words_list.items(), key=operator.itemgetter(1)):
print(key,value)
search('https://www.thenetnaija.com/news')
Thank you for your contribution.
For any future tutorials, we advise to work on changing and improving all the above for higher quality work.
Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.
Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]
Thank you for your review, @mcfarhat! Keep up the good work!
Hello! Your post has been resteemed and upvoted by @ilovecoding because we love coding! Keep up good work! Consider upvoting this comment to support the @ilovecoding and increase your future rewards! ^_^ Steem On!
Reply !stop to disable the comment. Thanks!
Congratulations @choja! You received a personal award!
Click here to view your Board
Congratulations @choja! You received a personal award!
You can view your badges on your Steem Board and compare to others on the Steem Ranking
Vote for @Steemitboard as a witness to get one more award and increased upvotes!