Intro To Pandas Library - Python

in StemSocial5 months ago

image.png

Image Source

Pandas is an open source library for python that is used for data manipulation. The popular package was first created by Wes McKinney in 2008 to be used for data analysis purpose using python. Concepts of pandas are must to have if you are looking to excel your career as a data scientist or data analysts, Machine Learning Engineers. Data manipulation tasks can even be performed by other popular packages like NumPy but what makes Pandas so preferred package over other is the fact that it serves as a single hub to do all of the data analysis tasks like loading data, cleaning the data, modelling the data, analyzing and manipulating it.

Pandas is built on top of NumPy library. So, it works well with other third party libraries like SciPy, matplotlib. And on top of that, pandas make any data processing tasks very fast, easy and efficient. So with that being a little short introduction to pandas, we will now focus on pandas core components.

Pandas typically have two data structures. The first one is known as Series which works like an array and is also known as one-dimensional array. The difference is that values in the series are labelled with an index number. The following image depicts a simple example of series.

image.png

Image Source

The second type of data structure in pandas and the one that is mostly used is the Data Frame. It is a two dimensional data structure and works like table in the form of rows and columns. In other words, it can also be thought to work as a relational database tables like the SQL. We can also say data frame is a collection of series. You can see the following image for clarity.

image.png

Image Source

For this post, we will only focus on Series tutorial in pandas. First you need to have both numpy and pandas libary installed. You can simply do it by opening the terminal or command line and running pip install pandas or pip3 install pandas depending on the OS. The sames goes for NumPy library. You can run pip install numpy or pip3 install numpy. If you have downloaded Anaconda all of the package comes pre-installed and there's nothing you need to do. You can just simply import the library. Here's a code to create a simple series object:

import pandas as pd

index_list = [0,1,2,3,4,5]

value_list = ["US", "Egypt", "Turkey", "Japan", "India", "France"]

series_1 = pd.Series(index=index_list, data=value_list)

series_1

You can see the output like this:

image.png

To create a series, you will need to use Series function that takes many arguments. Two of the important are the one to specify the index and the other one to specify the value of that index. Notice that the function to create a series in pandas is case-sensitive. If you use a lowercase function then you will get an error saying "pandas has no attribute series". The way of indexing the value in pandas series is same like that of the list, array and other data structure. If you want to access the series with value "Japan", you simply can access it by writing series_1[3].

One more thing is that if you forgot to specify the index, pandas will default give an index value from 0 and incrementing it according to list of value as you can see here. It is very similar to the code we wrote above but without an index value.

image.png

You can also convert python dictionary into series in a very easy way. Let's say we have a following python dictionary of average monthly temperature for each of the months for a particular year:

import pandas as pd

temperature_dict = {
    'January': 5,
    'February': 7,
    'March': 10,
    'April': 15,
    'May': 20,
    'June': 25,
    'July': 30,
    'August': 29,
    'September': 24,
    'October': 18,
    'November': 12,
    'December': 7
}

print(temperature_dict)

temperature_series = pd.Series(data=temperature_dict, index=temperature_dict)

temperature_series

When you are converting the dictionary to the series object in pandas, you need to pass the dictionary variable name in both the index as well as data argument for the Series function as I have done above. I set both arguments value to temperature_dict. You can see the output as:

image.png

We can see that the data type for the series is integer 64 bit. You can change the datatype by specifying an additional parameter in the Series function like:

temperature_series = pd.Series(data=temperature_dict, index=temperature_dict, dtype=float)

And you will get the following result:

image.png

I forgot to tell you about other value indexing method above. You can use series to access multiple values by specifying an index. For example from the above temperature series data lets say we want to access the temperature value for January, March and August. We can do this by:

temperature_series[["January", "March", "August"]]

Make sure you are using double brackets while trying to index multiple values from a series. Here goes our desired output.

image.png

Lets say now we want the temperature value from May till December. We can do it by writing : temperature_series["May":], the same we did for indexing and slicing in list and other data structure. It's pretty easy in pandas series as well. Here's the output:

image.png

Sort:  

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

Thanks for including @stemsocial as a beneficiary, which gives you stronger support. 
 

Thank you for your unweavering support🙂

sahi chat ta.
Eta tution dina paryo yar brother.