Pandas Interview Questions and Answers

Pandas is a Python-based library which can be used for Data Manipulation and Analysis. Nowadays a number of companies and even small businesses are collecting tonne of data literally everyday, this has happened because of lowering of costs for storing Data specifically in Clouds-like AWS or Microsoft Azure.

But the next question for Businesses to ask is, how to use that data for business decision making? Otherwise, there is no use of paying to Cloud Companies for collecting data.

So for drawing decisions from collected data and using those decisions for either improving Business Decision Making or for improving the products which Business is offering.

So in order to improve Business Decision Making or Improving Products, data already collected need to be well analysed. For doing businesses are continuously hiring people who can help for analyzing data.

For doing this analysis of data, there are a lot of tools available in the Software Market, Pandas Python Library is one of these.

Here in this article, I’ve collected together some commonly asked Pandas-related questions in the Job Interviews.

Just to give some context around why learning Pandas is important, I just did a search on Linkedin for the keyword “Pandas” under Job Search and filtered down results for “Past Month” and “United States”. It popped up with “2701 results” meaning there were atleast 2701 jobs posted on Linkedin in USA in last month. From this it can be estimated how many Businesses are looking for people who know “Pandas Python Library“.

Linkedin Job Search dashboard showing list of jobs for keyword "Pandas" for USA

Similar is true for other countries like Canada, Australia, India etc.

So in this day and age, its crucial to learn how to use “Pandas Python Library” for getting a job role like Data Engineer or Data Scienctist.

Table of Contents

Q. 1 – What is Pandas Python Library?

Pandas is a Python Library that helps in the easier representation of data in memory to perform analysis. Pandas help in faster representation and processing of data.

Q. 2 – How does Pandas represent data?

Pandas data representation is in similar line to an excel sheet which consists of row and columns.

Pandas table showing rows and columns
Columns in Pandas are known asSeries
The collection of series is calledData Frame

Q. 3 – How to create Series in Pandas?

Pandas Code for creating Series

Q. 4 – How to create Data Frame in Pandas?

Data Frame in Pandas can be created either directly from a dictionary or by combining various series.

import pandas as pd
country_population = {'India': 1100000, 'China': 45679000, 'USA': 3400000}
population = pd.Series(country_population)
#print(population)

country_land = {'India': '2000 hectares', 'China': '4000 hectares', 'USA': '3000 hectares'}
area = pd.Series(country_land)
#print(area)

df = pd.DataFrame({'Population': population, 'SpaceOccupied': area})
print(df)

Output of Above Code

Pandas DataFrame shown on Mac's Terminal Window

Q. 5 – How are missing values represented in Pandas DataFrame?

In Pandas DataFrame missing values are represented as NaN

import pandas as pd
missing = pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
# {'a': 1, 'b': 2} and {'b': 3, 'c': 4}
# Value of c is not in first dictionary, but value of a is not in second dictionary
print(missing)

Output of Above Code

Pandas DataFrame in Mac's Terminal showing NaN value at some data positions

Q. 6 – Explain the process of creating indexes in pandas?

Indexes can be created using Pandas’ Index function. Indexes support intersection and union.

import pandas as pd
index_A = pd.Index([1, 3, 5, 7, 9])
index_B = pd.Index([2, 3, 5, 7, 11])

Q. 7 – Explain various attributes associated with Pandas Series.

Pandas Series attributeDescription
Series.axesStands for row
Series.dtypeData type of the object is given by this attribute
Series.emptyCheck if Series is empty
Series.ndimDimensions of data are given back
Series.sizeSize or number of elements from data are given
Series.valuesGets the values in the form of ndarray
Series.head()First n rows are returned
Series.tail()Last n rows are returned

Q. 8 – Explain various statistical measures supported by Pandas.

Statistical Measure in PandasDescription
axesPrint row index as well as column index
sumCalculates sum of all series
meanCalculates mean of all series
medianCalculates median of all series
stdCalculates standard deviation
countCalculates sum of various series
cumsumCalculates cumulative sum

Q. 9 – Explain reindexing in Pandas.

Reindexing allows us to modify the index of one data frame by keeping the other data frame as a reference.

Q. 10 – Explain bfill and ffill.

While reindexing NaN can be introduced .bfill and .ffill are used to handle NaN

  • bfill – Fills the value from ahead value into the previous NaN value
  • ffill – Fills the value from behind value into the missing NaN value

Reindexing without using any method(bfill or ffill)

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['col1', 'col2', 'col3'])

print(df2.reindex_like(df1))

Output of Above Code

Pandas DataFrame showing in Mac's Terminal

Reindexing with using methods(bfill or ffill)

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['col1', 'col2', 'col3'])

print(df2.reindex_like(df1, method='ffill')) # Or method='bfill'

Output of above code if method=’ffill’

Output of above code if method=’bfill’

Q. 11 – What all type of iterations are provided in Pandas Data Frame?

Iterator for Pandas Data FrameDescription
iteritems()To iterate over the (key, value) pairs
iterrows()Iterate over the rows as (index, series) pairs
itertuples()Iterate over the rows as namedtuples

Q. 12 – How to sort column of Data Frame in Pandas?

  • sort_index – Allows sorting based rowwise or column wise
  • sort_values – Allows sorting based on values in a column

Creating a Sample DataFrame

import pandas as pd
d = {'col1': [10, 93, 16, 23, 81, 283, 10], 'col2': [19, 145, 195, 1952, 785, 543, 83782]}
df = pd.DataFrame(data=d)

Output of Above Code

Pandas DataFrame

Using sort_index for Sorting Sample Data Frame

sorted_df = df.sort_index(ascending=False)
print(sorted_df)

Output of Above Code

Pandas DataFrame showing three columns. Rows of DataFrame are arranged in Descending Order by Index Column

Using sort_values for Sorting Sample Data Frame

sorted_df = df.sort_values(by='col1')
print(sorted_df)

Output of Above Code

Pandas DataFrame showing three columns. Rows of DataFrame are arranged in Ascending Order by column named Col1

Q. 13 – How to override default reload option in Pandas?

Pandas Code for overriding default reload option

Q. 14 – Explain various DataFrame slicing options available in Pandas?

  • .loc() – Slicing DataFrame based upon Label
  • .iloc() – Slicing DataFrame based on Interger
  • .ix() – Slicing DataFrame based on both Label and Integer

Q. 15 – How can we handle NaN values in Pandas DataFrame?

NaN values in a Pandas DataFrame can be handled in the following three ways: –

  • dropna – Removing all the rows in DataFrame for which values in column are NaN
  • pad – Replacing NaN values with previous non NaN values meaning replacing NaN with value just above it in same column
  • backfill – Replacing NaN values with ahead non NaN values meaning replacing NaN with value just below it in same column

Q. 16 – Explain “group by” function in Pandas?

group_by allows to group data in a DataFrame based on single or multiple columns.

Q. 17 – Explaing “merge function” in Pandas?

Data Frame in Pandas support merge operations in which two related data from diverse data frames can be brought in a single view.
There are different ways through which different DataFrames can be merged together. Below are some of these ways: –

Merging using a column as id

So if we have two dataframes let’s say – df1 and df2 having data as in following table. Then merging by ‘Name’ will create a new DataFrame containing all rows for Name which are in df2 but not in df1.
See below tables, these will make scenario little bit clearer.

df1

IndexNameAge
0Bob24
1John34
2Garry18
3Smith26

df2

IndexNameBirth Place
0BobAustin
1JohnMiami
2GarryNew York
import pandas as pd
merged_dataframe = pd.merge(df1, df2, on='Name')
print(merged_dataframe)

Output of Above Code

IndexNameAgeBirth Place
0Bob24Austin
1John34Miami
2Garry18New York

Doing a Left Merger of DataFrames

In Left Merger, all data from left side will come and only those matching from right would come.
Below is the code for doing Left Merger of two dataframes – df1, df2. (See picture just below code to better understand how Left Merger works in Pandas)

import pandas as pd
merged_dataframe = pd.merge(df1, df2, on='Name', how='left')
# df1 being left side
# df2 being right side 
print(merged_dataframe)

Doing a Right Merger of DataFrames

In right merge everything from right side comes and only matching in left would come else it would come as NaN. Below is the code for doing Right Merger of two dataframes – df1, df2. (See picture just below code to better understand how Right Merger works in Pandas)

import pandas as pd
merged_dataframe = pd.merge(df1, df2, on='Name', how='right')
# df1 being left side
# df2 being right side 
print(merged_dataframe)

Doing an Outer Merger of DataFrames

Data from both left and right DataFrames will come together and all non-existing values will be replaced by NaN. Below is the code for doing Outer Merger of two dataframes – df1, df2. (See picture just below code to better understand how Outer Merger works in Pandas)

Q. 18 – Explain Pandas’s concat method?

concat method can be used for combining two different data frames either at row level or column level.

  • For combining rows, just putting one DataFrame on top of another – pd.concat([Top DataFrame, Bottom DataFrame]) can be used.
  • For combining columns, just putting one DataFrame on right side of another – pd.concat([Left DataFrame, Right DataFrame], axis=1) can be used.

For better understanding Pandas’s concat method let’s have a look at two examples. Suppose that we have two DataFrames df1, df2 which contain following data.

df1

IndexNameAge
0Bob24
1John34
2Garry18
3Smith26

df2

IndexNameBirth Place
0BobAustin
1JohnMiami
2GarryNew York

Putting one DataFrame on Top of another – pd.concat([Top DataFrame, Bottom DataFrame])

print(pd.concat([df1, df2]))

Output of Above Code

IndexNameAgeBirth Place
0Bob24.0NaN
1John34.0NaN
2Garry18.0NaN
3Smith26.0NaN
0BobNaNAustin
1JohnNaNMiami
2GarryNaNNew York

Putting one DataFrame on Left side of another – pd.concat([Top DataFrame, Bottom DataFrame], axis=1)

print(pd.concat([df1, df2], axis=1))

Output of Above Code

IndexNameAgeNameBirth Place
0Bob24BobAustin
1John34JohnMiami
2Garry18GarryNew York
3Smith26NaNNaN

Q. 19 – Explain all Data File Types which either can be read or written by Pandas?

Data File TypePandas Function For ReadingPandas Function For Writing
CSVread_csvto_csv
JSONread_jsonto_json
HTMLread_htmlto_html
MS Excelread_excelto_excel
HDF5 Formatread_hdfto_hdf
Feather Formatread_featherto_feather
Parquet Formatread_parquetto_parquet
Msgpackread_msgpackto_msgpack
Stataread_statato_stata
SASread_sasNo Function for this
Python Pickle Formatread_pickleto_pickle
SQLread_sqlto_sql
Google Big Queryread_gbqto_gbq

Q. 20 – Compare functions in Pandas and R?

Filtering, Sampling, Querying Functions in Pandas versus R

RPandas
dim(dataframe)dataframe.shape
head(dataframe)dataframe.head()
slice(dataframe, 1:100)dataframe.iloc[:99]
filter(dataframe, column1 == 1, column2 == 1)dataframe.query(‘column1 == 1 & column2 == 1’)
select(dataframe, column1, column2)dataframe[[‘column1’, ‘column2’]]
select(dataframe, column1:column3)dataframe.loc[:, ‘column1′:’column3’]
distinct(select(dataframe. column1))dataframe[[‘column1’]].drop_duplicates()
sample_n(dataframe, 10)dataframe.sample(n=10)
sample_frac(dataframe, 0.01)dataframe.sample(frac=0.01)

Sorting Functions in Pandas versus R

RPandas
arrange(dataframe, column1, column2)dataframe.sort_values([‘column1’, ‘column2’])
arrange(dataframe, desc(column1))dataframe.sort_values(‘column1’, ascending=False)

Transforming Functions in Pandas versus R

RPandas
select(dataframe, col_one = column1)dataframe.rename(columns={‘column1′:’col_one’})[‘col_one’]
rename(dataframe, col_one = column1)dataframe.rename(columns={‘column1′:’col_one’})
mutate(dataframe, c = a – b)dataframe.assign(c = dataframe.a-dataframe.b)

Aggregate/Grouping Functions in Pandas versus R

RPandas
summary(dataframe)dataframe.describe()
gdatafram <- group_by(dataframe, column1)gdataframe = dataframe.groupby(‘column1’)
summarise(gdataframe, avg=mean(column1, na.rm=TRUE))dataframe.groupby(‘column1’).agg({‘column1′:’mean’})
summarise(gdataframe, total =- sum(column1))dataframe.groupby(‘column1’).sum()

Gagan

Hi, there I'm founder of ComputerScienceHub(Started this to bring useful Computer Science information just at one place). Personally I've been doing JavaScript, Python development since 2015(Been long) - Worked upon couple of Web Development Projects, Did some Data Science stuff using Python. Nowadays primarily I work as Freelance JavaScript Developer(Web Developer) and on side-by-side managing team of Computer Science specialists at ComputerScienceHub.io

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts