Part 8: Data Manipulation in Grouping and Aggregation
Last Updated on March 11, 2026 by Editorial Team Author(s): Raj kumar Originally published on Towards AI. Every business decision starts with a question. What are our total sales by region? Which product categories generate the most revenue? How do customer segments compare in profitability? These questions all share something in common: they require grouping data and calculating aggregates. Grouping and aggregation are the backbone of business intelligence. They transform raw transactions into actionable insights. They turn thousands of individual data points into clear summaries that executives can understand and act upon. Without these operations, you’re drowning in details. With them, you see patterns, trends, and opportunities. But here’s the critical point that many analysts miss: how you group and aggregate your data fundamentally shapes the story it tells. The same dataset can produce wildly different conclusions depending on the aggregation method you choose. Average revenue per customer masks the difference between whale customers and small buyers. Total sales by quarter hides monthly volatility. Median income tells a different story than mean income. This is where the subtitle becomes relevant: summaries, KPIs, and portfolio metrics aren’t just constructed through grouping and aggregation. They can be distorted by them. Not through malice, but through thoughtless application of default methods. An analyst who blindly uses mean() when median() would be more appropriate isn’t lying. They’re just telling an incomplete or misleading story. Why Grouping and Aggregation Matter Consider a retail company analyzing store performance. They have transaction-level data: every sale, every product, every timestamp. This granular data is valuable, but it’s overwhelming. To make decisions about which stores to expand, which to close, and which need intervention, they need summaries. They group by store and calculate total revenue. Simple enough. But what if some stores are larger than others? Revenue per square foot becomes the metric. What if some stores are in expensive real estate markets? Revenue per dollar of rent matters. What if customer traffic varies? Revenue per customer visit tells another story. Each grouping and aggregation choice highlights different aspects of performance. None are wrong. All are partial views. The skill lies in choosing the right aggregation for the question at hand and being honest about what it shows and what it hides. The Power of Pandas GroupBy Pandas provides groupby operations that mirror SQL’s GROUP BY but with more flexibility and power. The pattern is consistent and intuitive: df.groupby('column').aggregation_function() This simple pattern unlocks complex analysis. You can group by one column or many. You can apply one aggregation or several. You can use built-in functions or write custom logic. You can filter groups, transform values within groups, or create entirely new features based on group statistics. The operations covered in this guide represent the essential toolkit for group-based analysis. Single column grouping for basic summaries. Multiple column grouping for dimensional analysis. Multiple aggregations for comprehensive views. Custom functions for specialized calculations. Pivot tables for reshaping. Cross-tabulation for categorical relationships. Rolling windows for time-based aggregates. What You’ll Learn This guide walks through ten fundamental grouping and aggregation patterns with practical business examples. You’ll see how to create sales reports by region, analyze customer behavior by segment, build portfolio performance metrics, and construct financial KPIs. More importantly, you’ll understand when each method is appropriate and what assumptions it carries. The complete example at the end demonstrates a realistic business scenario: analyzing e-commerce performance across multiple dimensions. This reflects how you’ll actually use these operations in practice, combining multiple grouping strategies to answer layered business questions. A Warning About Aggregates Before we dive into the mechanics, remember this: aggregates destroy information. When you calculate an average, you lose the distribution. When you sum values, you lose individual contributions. When you count records, you lose their content. This isn’t a flaw. It’s the point. Aggregation trades detail for clarity. But you need to be conscious of what you’re trading away. Always ask: what am I losing by summarizing this way? Would a different aggregation tell a different story? Am I inadvertently hiding important variation? The best analysts don’t just group and aggregate. They think critically about whether their chosen method serves the analysis honestly. Let’s start with the simplest case: grouping by a single column. Understanding Basic Grouping The foundation of all grouping operations is the single-column group with a single aggregation function. 1. Group by Single Column import pandas as pdimport numpy as np# Sample sales datadata = { 'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'], 'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'], 'sales': [1200, 800, 600, 1500, 900, 700, 1300, 850], 'quantity': [2, 4, 3, 2, 3, 4, 2, 5]}df = pd.DataFrame(data)print("Original Data:")print(df)print()# Group by region and calculate mean salesregion_avg = df.groupby('region')['sales'].mean()print("Average Sales by Region:")print(region_avg)print()# Group by region with multiple statisticsregion_summary = df.groupby('region')['sales'].agg(['sum', 'mean', 'count'])print("Complete Region Summary:")print(region_summary) Output: Original Data: region product sales quantity0 North Laptop 1200 21 South Phone 800 42 East Tablet 600 33 West Laptop 1500 24 North Phone 900 35 South Tablet 700 46 East Laptop 1300 27 West Phone 850 5Average Sales by Region:regionEast 950.0North 1050.0South 750.0West 1175.0Name: sales, dtype: float64Complete Region Summary: sum mean countregion East 1900 950.0 2North 2100 1050.0 2South 1500 750.0 2West 2350 1175.0 2 Use case: Basic performance summaries, regional comparisons, identifying top performers. 2. Group by Multiple Columns # Group by region and productregion_product_sales = df.groupby(['region', 'product'])['sales'].sum()print("Sales by Region and Product:")print(region_product_sales)print()# Convert to readable formatregion_product_df = region_product_sales.reset_index()print("As DataFrame:")print(region_product_df) Output: Sales by Region and Product:region productEast Laptop 1300 Tablet 600North Laptop 1200 Phone 900South Phone 800 Tablet 700West Laptop 1500 Phone 850Name: sales, dtype: int64As DataFrame: region product sales0 East Laptop 13001 East Tablet 6002 North Laptop 12003 North Phone 9004 South Phone 8005 South Tablet 7006 West Laptop 15007 West Phone 850 Use case: Multi-dimensional analysis, cross-category comparisons, hierarchical reporting. Advanced Aggregation Techniques 3. Group by with Multiple Aggregations # Apply different aggregations to different columnsmulti_agg = df.groupby('region').agg({ 'sales': ['sum', 'mean', 'max'], 'quantity': ['sum', 'mean']})print("Multiple Aggregations by Region:")print(multi_agg)print()# Flatten column names for easier accessmulti_agg.columns = […]
