Exploring Pandas in Python: Filter and Pivot Operations with Sample Data

Pawan Kumar Ganjhu
5 min readNov 15, 2023

--

Image from- Center Pivot

Pandas is a powerful data manipulation library in Python, and the filter and pivot functions are useful tools for working with DataFrame objects. Let's go through each of them using sample data.

import pandas as pd

# Sample Data
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'Category': ['A', 'B', 'A', 'B'],
'Value': [10, 20, 30, 40]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

This creates a simple DataFrame with columns ‘Date’, ‘Category’, and ‘Value’. Now, let’s explore filter and pivot.

Output:

Original DataFrame:
Date Category Value
0 2023-01-01 A 10
1 2023-01-01 B 20
2 2023-01-02 A 30
3 2023-01-02 B 40

Filter

The filter function is used to subset the DataFrame based on column labels. You can use it to select specific columns.

# Filtering columns
filtered_df = df.filter(items=['Date', 'Value'])
print("\nFiltered DataFrame:")
print(filtered_df)

This will result in a DataFrame containing only the ‘Date’ and ‘Value’ columns.

Filtered DataFrame:
Date Value
0 2023-01-01 10
1 2023-01-01 20
2 2023-01-02 30
3 2023-01-02 40

Example 1: Filtering Rows Based on a Condition

You can use filter to select rows that meet a specific condition. For example, let's filter rows where the 'Value' is greater than 20.

# Filtering rows based on a condition
filtered_rows = df[df['Value'] > 20]
print("\nFiltered Rows:")
print(filtered_rows)
Filtered Rows:
Date Category Value
2 2023-01-02 A 30
3 2023-01-02 B 40

Example 2: Using like Parameter

The like parameter allows you to select columns based on partial string matching. For example, let's select columns that contain the string 'Dat'.

# Using the like parameter
selected_columns = df.filter(like='Dat')
print("\nSelected Columns:")
print(selected_columns)
Selected Columns:
Date
0 2023-01-01
1 2023-01-01
2 2023-01-02
3 2023-01-02

Example 3: Using regex Parameter

The regex parameter enables you to use regular expressions to match column names. Let's select columns that start with 'C' or 'V'.

# Using the regex parameter
selected_columns_regex = df.filter(regex='^C|^V')
print("\nSelected Columns with Regex:")
print(selected_columns_regex)
Selected Columns with Regex:
Category Value
0 A 10
1 B 20
2 A 30
3 B 40

Example 4: Filtering Columns Based on a List

You can use filter to select columns based on a list of column names.

# Filtering columns based on a list
selected_columns_list = df.filter(items=['Date', 'Value'])
print("\nSelected Columns with List:")
print(selected_columns_list)
Selected Columns with List:
Date Value
0 2023-01-01 10
1 2023-01-01 20
2 2023-01-02 30
3 2023-01-02 40

Example 5: Using items Parameter

The items parameter can be used to select columns based on a list of substrings

# Using the items parameter
selected_columns_items = df.filter(items=['Date', 'Val'])
print("\nSelected Columns with Items:")
print(selected_columns_items)
Selected Columns with Items:
Date
0 2023-01-01
1 2023-01-01
2 2023-01-02
3 2023-01-02

These examples showcase different ways to use the filter function in Pandas for both column and row selection based on various conditions and criteria.

Pivot

The pivot function is used to reshape the DataFrame by pivoting the values in one column into new columns. Let's pivot the data based on the 'Category' column.

# Pivoting the DataFrame
pivot_df = df.pivot(index='Date', columns='Category', values='Value')
print("\nPivoted DataFrame:")
print(pivot_df)

This will create a new DataFrame where unique values in the ‘Category’ column become new columns, and the ‘Value’ column provides the corresponding values.

Pivoted DataFrame:
Category A B
Date
2023-01-01 10 20
2023-01-02 30 40

Pivot Table

If you have duplicate entries for a combination of index and columns in the pivot operation, you can use the pivot_table function with an aggregation function to handle the duplicates. Let's add another entry for '2023-01-01' and 'Category' 'A' to demonstrate this.

# Adding a duplicate entry
df = df.append({'Date': '2023-01-01', 'Category': 'A', 'Value': 15}, ignore_index=True)

# Using pivot_table to handle duplicates
pivot_table_df = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
print("\nPivot Table DataFrame:")
print(pivot_table_df)

This will create a pivot table where the values are aggregated using the sum function for duplicate entries.

Pivot Table DataFrame:
Category A B
Date
2023-01-01 25 20
2023-01-02 30 40

Example 1: Multi-level Indexing

You can create a multi-level index by passing a list of columns to the index parameter.

# Using pivot with multi-level indexing
multi_level_pivot = df.pivot(index=['Date', 'Category'], columns='Value')
print("\nMulti-level Index Pivot:")
print(multi_level_pivot)

This creates a DataFrame with a multi-level index based on ‘Date’ and ‘Category’.

Multi-level Index Pivot:
Empty DataFrame
Columns: []
Index: [(2023-01-01, A), (2023-01-01, B), (2023-01-02, A), (2023-01-02, B)]

Example 2: Handling Missing Values with fillna

You can use the fillna function to replace missing values after pivoting.

# Adding a duplicate entry
df = df.append({'Date': '2023-01-01', 'Category': 'A', 'Value': 15}, ignore_index=True)

# Using groupby and pivot_table to handle duplicates
pivot_fillna = df.groupby(['Date', 'Category'])['Value'].sum().unstack(fill_value=0)
print("\nPivot with Missing Values Filled:")
print(pivot_fillna)
Pivot Table with Sum for Duplicate Entries:
Category A B
Date
2023-01-01 25 20
2023-01-02 30 40

Example 3: Aggregating with aggfunc

The aggfunc parameter allows you to specify an aggregation function for duplicate entries. Let's use the aggfunc parameter to calculate the average value for duplicate entries.

# Using pivot_table to calculate the average for duplicate entries
pivot_avg = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='mean')
print("\nPivot Table with Average for Duplicate Entries:")
print(pivot_avg)
Pivot Table with Average for Duplicate Entries:
Category A B
Date
2023-01-01 12.5 20.0
2023-01-02 30.0 40.0

Example 4: Resetting Index after Pivot

To reset the index after pivoting, you need to use pivot_table with an aggregation function for duplicate entries.

# Using pivot_table to handle duplicate entries and resetting index
pivot_table_df = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
pivot_reset_index = pivot_table_df.reset_index()
print("\nPivot Table with Reset Index:")
print(pivot_reset_index)
Pivot Table with Reset Index:
Category Date A B
0 2023-01-01 10 20
1 2023-01-02 30 40

Example 5: Pivoting with Multiple Value Columns

You can pivot on multiple value columns by specifying a list for the values parameter.

# Sample Data with an additional 'Value2' column
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'Category': ['A', 'B', 'A', 'B'],
'Value': [10, 20, 30, 40],
'Value2': [100, 200, 300, 400]
}

df = pd.DataFrame(data)

# Pivoting with multiple value columns
multi_value_pivot = df.pivot(index='Date', columns='Category', values=['Value', 'Value2'])
print("\nPivot with Multiple Value Columns:")
print(multi_value_pivot)
Pivot with Multiple Value Columns:
Value Value2
Category A B A B
Date
2023-01-01 10 20 100 200
2023-01-02 30 40 300 400

--

--

Pawan Kumar Ganjhu

Data Engineer | Data & AI | R&D | Data Science | Data Analytics | Cloud