The groupby() function in Pandas is a powerful tool used for data aggregation, transformation, and filtering. It allows you to group data based on specific criteria and then perform operations on each group. This is especially useful in real-world data analysis when you want to summarize large datasets by category.
Basic Concept of GroupBy
The concept of groupby() follows a split-apply-combine strategy:
- Split: Divide the data into groups based on certain criteria.
- Apply: Apply a function (e.g., sum, mean, count) to each group.
- Combine: Merge the results into a new DataFrame or Series.
This method is fundamental in exploratory data analysis to extract insights and identify patterns.
Basic Syntax
import pandas as pd
# Sample dataset
data = {
‘Department’: [‘HR’, ‘IT’, ‘HR’, ‘IT’, ‘Finance’, ‘Finance’],
‘Employee’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’, ‘Edward’, ‘Frank’],
‘Salary’: [50000, 70000, 52000, 75000, 60000, 61000]
}
df = pd.DataFrame(data)
# Using groupby to find the average salary per department
avg_salary = df.groupby(‘Department’)[‘Salary’].mean()
print(avg_salary)
Real-World Use Cases of groupby()
1. Sales and Retail Analytics
Retail businesses generate massive amounts of transactional data. Grouping by region helps assess sales performance.
sales_data = pd.DataFrame({
‘Region’: [‘North’, ‘South’, ‘North’, ‘East’, ‘West’, ‘East’],
‘Product’: [‘A’, ‘B’, ‘C’, ‘A’, ‘B’, ‘C’],
‘Quantity’: [10, 20, 30, 40, 50, 60],
‘Revenue’: [100, 200, 300, 400, 500, 600]
})
region_group = sales_data.groupby(‘Region’).agg({
‘Quantity’: ‘sum’,
‘Revenue’: ‘sum’
}).reset_index()
Use Case: Compare performance across different regions.
2. Monthly Revenue Trend
Seasonal trends affect sales. Grouping by month reveals peaks and dips.
sales_data[‘Date’] = pd.date_range(start=’2023-01-01′, periods=6, freq=’M’)
monthly_revenue = sales_data.groupby(pd.Grouper(key=’Date’, freq=’M’)).agg({
‘Revenue’: ‘sum’
}).reset_index()
Use Case: Identify seasonal trends in revenue.
3. Customer Segmentation
Identifying high-value and low-value customers.
customer_data = pd.DataFrame({
‘CustomerID’: [1, 2, 1, 3, 2, 4],
‘OrderValue’: [250, 100, 150, 300, 200, 400]
})
customer_segment = customer_data.groupby(‘CustomerID’)[‘OrderValue’].sum().reset_index()
Use Case: Segment customers based on total spend.
4. Product Performance Evaluation
Finding top-performing products based on total and average sales.
product_data = pd.DataFrame({
‘Product’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘A’],
‘Sales’: [100, 200, 150, 300, 250, 350]
})
product_performance = product_data.groupby(‘Product’)[‘Sales’].agg([‘sum’, ‘mean’]).reset_index()
Use Case: Helps in marketing and inventory decisions.
5. Employee Performance in HR Analytics
Analyzing department-wise employee performance scores.
employee_data = pd.DataFrame({
‘Department’: [‘HR’, ‘IT’, ‘HR’, ‘Finance’, ‘IT’, ‘Finance’],
‘PerformanceScore’: [3.2, 4.0, 3.8, 4.2, 3.9, 4.5]
})
department_performance = employee_data.groupby(‘Department’)[‘PerformanceScore’].mean().reset_index()
Use Case: Identify departments that need training or recognition.
6. Financial Reporting
Summarizing revenue and expenses by account type.
finance_data = pd.DataFrame({
‘AccountType’: [‘Revenue’, ‘Expense’, ‘Revenue’, ‘Expense’],
‘Amount’: [5000, 3000, 4000, 2000]
})
financial_summary = finance_data.groupby(‘AccountType’)[‘Amount’].sum().reset_index()
Use Case: Support budgeting and profit/loss tracking.
7. Finance and Banking: Monthly Customer Transactions
Tracking monthly spend by each customer.
df_transactions = pd.DataFrame({
‘CustomerID’: [101, 102, 101, 103, 102, 101],
‘Month’: [‘Jan’, ‘Jan’, ‘Feb’, ‘Feb’, ‘Mar’, ‘Mar’],
‘Amount’: [500, 700, 300, 1000, 650, 400]
})
monthly_spending = df_transactions.groupby([‘CustomerID’, ‘Month’])[‘Amount’].sum()
print(monthly_spending)
Use Case: Monitor spending behavior, detect anomalies.
8. Healthcare and Medical Analysis
Tracking the number of patients diagnosed with specific conditions.
df_healthcare = pd.DataFrame({
‘Diagnosis’: [‘Diabetes’, ‘Hypertension’, ‘Diabetes’, ‘Asthma’, ‘Hypertension’],
‘PatientID’: [2001, 2002, 2003, 2004, 2005]
})
diagnosis_count = df_healthcare.groupby(‘Diagnosis’)[‘PatientID’].count()
print(diagnosis_count)
Use Case: Allocate resources based on disease frequency.
9. Employee Salary Analysis in HR
Ensuring compensation fairness across departments.
df_employees = pd.DataFrame({
‘Department’: [‘HR’, ‘IT’, ‘Finance’, ‘HR’, ‘IT’],
‘Salary’: [50000, 70000, 60000, 52000, 75000]
})
avg_salary_dept = df_employees.groupby(‘Department’)[‘Salary’].mean()
print(avg_salary_dept)
Use Case: Identify salary gaps across departments.
10. Customer Feedback and Sentiment Analysis
Grouping sentiment to understand product satisfaction.
df_reviews = pd.DataFrame({
‘Product’: [‘Laptop’, ‘Phone’, ‘Laptop’, ‘Tablet’, ‘Phone’],
‘Sentiment’: [‘Positive’, ‘Negative’, ‘Neutral’, ‘Positive’, ‘Negative’]
})
sentiment_count = df_reviews.groupby(‘Sentiment’)[‘Product’].count()
print(sentiment_count)
Use Case: Identify areas of customer dissatisfaction or praise.
Conclusion
The groupby() function in Pandas is a must-know tool for data analysts. Whether you’re working in finance, healthcare, HR, or retail, groupby() helps you extract meaningful insights from complex datasets with ease.
If you’re planning to become a data analyst, consider enrolling in Console Flare. You’ll learn from industry experts, gain hands-on experience with real-world data, and become job-ready to solve actual business problems. With strong placement support, you’ll be well-positioned to land a high-paying role in the analytics field.
For more such content and regular updates, follow us on Facebook, Instagram, LinkedIn