Exploring Pandas in Python: Filter and Pivot Operations with Sample Data
Pandas is a powerful data manipulation library in Python, and the filter
and pivot
functions are useful tools for working with DataFrame objects. Let's go through each of them using sample data.
import pandas as pd
# Sample Data
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'Category': ['A', 'B', 'A', 'B'],
'Value': [10, 20, 30, 40]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
This creates a simple DataFrame with columns ‘Date’, ‘Category’, and ‘Value’. Now, let’s explore filter
and pivot
.
Output:
Original DataFrame:
Date Category Value
0 2023-01-01 A 10
1 2023-01-01 B 20
2 2023-01-02 A 30
3 2023-01-02 B 40
Filter
The filter
function is used to subset the DataFrame based on column labels. You can use it to select specific columns.
# Filtering columns
filtered_df = df.filter(items=['Date', 'Value'])
print("\nFiltered DataFrame:")
print(filtered_df)
This will result in a DataFrame containing only the ‘Date’ and ‘Value’ columns.
Filtered DataFrame:
Date Value
0 2023-01-01 10
1 2023-01-01 20
2 2023-01-02 30
3 2023-01-02 40
Example 1: Filtering Rows Based on a Condition
You can use filter
to select rows that meet a specific condition. For example, let's filter rows where the 'Value' is greater than 20.
# Filtering rows based on a condition
filtered_rows = df[df['Value'] > 20]
print("\nFiltered Rows:")
print(filtered_rows)
Filtered Rows:
Date Category Value
2 2023-01-02 A 30
3 2023-01-02 B 40
Example 2: Using like
Parameter
The like
parameter allows you to select columns based on partial string matching. For example, let's select columns that contain the string 'Dat'.
# Using the like parameter
selected_columns = df.filter(like='Dat')
print("\nSelected Columns:")
print(selected_columns)
Selected Columns:
Date
0 2023-01-01
1 2023-01-01
2 2023-01-02
3 2023-01-02
Example 3: Using regex
Parameter
The regex
parameter enables you to use regular expressions to match column names. Let's select columns that start with 'C' or 'V'.
# Using the regex parameter
selected_columns_regex = df.filter(regex='^C|^V')
print("\nSelected Columns with Regex:")
print(selected_columns_regex)
Selected Columns with Regex:
Category Value
0 A 10
1 B 20
2 A 30
3 B 40
Example 4: Filtering Columns Based on a List
You can use filter
to select columns based on a list of column names.
# Filtering columns based on a list
selected_columns_list = df.filter(items=['Date', 'Value'])
print("\nSelected Columns with List:")
print(selected_columns_list)
Selected Columns with List:
Date Value
0 2023-01-01 10
1 2023-01-01 20
2 2023-01-02 30
3 2023-01-02 40
Example 5: Using items
Parameter
The items
parameter can be used to select columns based on a list of substrings
# Using the items parameter
selected_columns_items = df.filter(items=['Date', 'Val'])
print("\nSelected Columns with Items:")
print(selected_columns_items)
Selected Columns with Items:
Date
0 2023-01-01
1 2023-01-01
2 2023-01-02
3 2023-01-02
These examples showcase different ways to use the filter
function in Pandas for both column and row selection based on various conditions and criteria.
Pivot
The pivot
function is used to reshape the DataFrame by pivoting the values in one column into new columns. Let's pivot the data based on the 'Category' column.
# Pivoting the DataFrame
pivot_df = df.pivot(index='Date', columns='Category', values='Value')
print("\nPivoted DataFrame:")
print(pivot_df)
This will create a new DataFrame where unique values in the ‘Category’ column become new columns, and the ‘Value’ column provides the corresponding values.
Pivoted DataFrame:
Category A B
Date
2023-01-01 10 20
2023-01-02 30 40
Pivot Table
If you have duplicate entries for a combination of index and columns in the pivot operation, you can use the pivot_table
function with an aggregation function to handle the duplicates. Let's add another entry for '2023-01-01' and 'Category' 'A' to demonstrate this.
# Adding a duplicate entry
df = df.append({'Date': '2023-01-01', 'Category': 'A', 'Value': 15}, ignore_index=True)
# Using pivot_table to handle duplicates
pivot_table_df = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
print("\nPivot Table DataFrame:")
print(pivot_table_df)
This will create a pivot table where the values are aggregated using the sum function for duplicate entries.
Pivot Table DataFrame:
Category A B
Date
2023-01-01 25 20
2023-01-02 30 40
Example 1: Multi-level Indexing
You can create a multi-level index by passing a list of columns to the index
parameter.
# Using pivot with multi-level indexing
multi_level_pivot = df.pivot(index=['Date', 'Category'], columns='Value')
print("\nMulti-level Index Pivot:")
print(multi_level_pivot)
This creates a DataFrame with a multi-level index based on ‘Date’ and ‘Category’.
Multi-level Index Pivot:
Empty DataFrame
Columns: []
Index: [(2023-01-01, A), (2023-01-01, B), (2023-01-02, A), (2023-01-02, B)]
Example 2: Handling Missing Values with fillna
You can use the fillna
function to replace missing values after pivoting.
# Adding a duplicate entry
df = df.append({'Date': '2023-01-01', 'Category': 'A', 'Value': 15}, ignore_index=True)
# Using groupby and pivot_table to handle duplicates
pivot_fillna = df.groupby(['Date', 'Category'])['Value'].sum().unstack(fill_value=0)
print("\nPivot with Missing Values Filled:")
print(pivot_fillna)
Pivot Table with Sum for Duplicate Entries:
Category A B
Date
2023-01-01 25 20
2023-01-02 30 40
Example 3: Aggregating with aggfunc
The aggfunc
parameter allows you to specify an aggregation function for duplicate entries. Let's use the aggfunc
parameter to calculate the average value for duplicate entries.
# Using pivot_table to calculate the average for duplicate entries
pivot_avg = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='mean')
print("\nPivot Table with Average for Duplicate Entries:")
print(pivot_avg)
Pivot Table with Average for Duplicate Entries:
Category A B
Date
2023-01-01 12.5 20.0
2023-01-02 30.0 40.0
Example 4: Resetting Index after Pivot
To reset the index after pivoting, you need to use pivot_table
with an aggregation function for duplicate entries.
# Using pivot_table to handle duplicate entries and resetting index
pivot_table_df = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
pivot_reset_index = pivot_table_df.reset_index()
print("\nPivot Table with Reset Index:")
print(pivot_reset_index)
Pivot Table with Reset Index:
Category Date A B
0 2023-01-01 10 20
1 2023-01-02 30 40
Example 5: Pivoting with Multiple Value Columns
You can pivot on multiple value columns by specifying a list for the values
parameter.
# Sample Data with an additional 'Value2' column
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'Category': ['A', 'B', 'A', 'B'],
'Value': [10, 20, 30, 40],
'Value2': [100, 200, 300, 400]
}
df = pd.DataFrame(data)
# Pivoting with multiple value columns
multi_value_pivot = df.pivot(index='Date', columns='Category', values=['Value', 'Value2'])
print("\nPivot with Multiple Value Columns:")
print(multi_value_pivot)
Pivot with Multiple Value Columns:
Value Value2
Category A B A B
Date
2023-01-01 10 20 100 200
2023-01-02 30 40 300 400