Towards AIblog

Part 16: Data Manipulation in Data Validation and Quality Control

Thursday, April 2, 2026Raj kumarView original
Last Updated on April 2, 2026 by Editorial Team Author(s): Raj kumar Originally published on Towards AI. Data quality issues are the silent killers of production systems. A single malformed record can crash your pipeline. A gradual drift in data distributions can slowly degrade model performance. Missing values that sneak through validation can corrupt downstream analytics. The cost of poor data quality is measured not just in failed jobs, but in wrong business decisions, customer frustration, and lost revenue. Data validation and cleaning are not optional preprocessing steps. They are your first line of defense against data degradation. This article explores practical techniques for ensuring data quality through validation rules, type enforcement, and systematic cleaning operations. We will look at how to catch issues early, handle them gracefully, and build data contracts that prevent silent failures. Remove Duplicates Duplicate records inflate dataset sizes, skew statistics, and create incorrect aggregations. Removing duplicates is often the first step in data cleaning, but doing it correctly requires understanding which columns define uniqueness and which duplicate to keep. import pandas as pdimport numpy as np# Create sample data with duplicatesdata = { 'customer_id': [101, 102, 103, 101, 104, 102], 'transaction_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-15', '2024-01-18', '2024-01-19'], 'amount': [250.0, 150.0, 300.0, 250.0, 400.0, 150.0], 'status': ['completed', 'completed', 'pending', 'completed', 'completed', 'failed']}df = pd.DataFrame(data)print("Original data:")print(df)print(f"\nShape: {df.shape}")# Remove complete duplicates (all columns match)df_no_dup = df.drop_duplicates()print("\n\nAfter removing complete duplicates:")print(df_no_dup)print(f"Shape: {df_no_dup.shape}")# Remove duplicates based on specific column (customer_id)df_unique_customers = df.drop_duplicates(subset=['customer_id'], keep='first')print("\n\nKeeping first occurrence per customer:")print(df_unique_customers)# Keep last occurrence insteaddf_last = df.drop_duplicates(subset=['customer_id'], keep='last')print("\n\nKeeping last occurrence per customer:")print(df_last) Output: Original data: customer_id transaction_date amount status0 101 2024-01-15 250.0 completed1 102 2024-01-16 150.0 completed2 103 2024-01-17 300.0 pending3 101 2024-01-15 250.0 completed4 104 2024-01-18 400.0 completed5 102 2024-01-19 150.0 failedShape: (6, 4)After removing complete duplicates: customer_id transaction_date amount status0 101 2024-01-15 250.0 completed1 102 2024-01-16 150.0 completed2 103 2024-01-17 300.0 pending4 104 2024-01-18 400.0 completed5 102 2024-01-19 150.0 failedShape: (5, 4)Keeping first occurrence per customer: customer_id transaction_date amount status0 101 2024-01-15 250.0 completed1 102 2024-01-16 150.0 completed2 103 2024-01-17 300.0 pending4 104 2024-01-18 400.0 completedKeeping last occurrence per customer: customer_id transaction_date amount status2 103 2024-01-17 300.0 pending3 101 2024-01-15 250.0 completed4 104 2024-01-18 400.0 completed5 102 2024-01-19 150.0 failed The subset parameter defines uniqueness criteria. The keep parameter controls which duplicate to retain: ‘first’, ‘last’, or False to remove all duplicates. Check Data Types Type mismatches cause runtime errors and incorrect calculations. Before performing any analysis, verify that each column contains the expected data type. This prevents treating numeric strings as numbers or attempting mathematical operations on text fields. import pandas as pd# Create data with mixed typesdata = { 'user_id': [1, 2, 3, 4, 5], 'age': ['25', '30', 'unknown', '45', '28'], 'salary': [50000, 65000, 70000, 80000, 55000], 'join_date': ['2020-01-15', '2019-03-22', '2021-07-10', '2018-11-05', '2022-02-18'], 'active': [True, False, True, True, False]}df = pd.DataFrame(data)# Check current data typesprint("Current data types:")print(df.dtypes)print("\n" + "="*50 + "\n")# Detailed type informationprint("Detailed information:")print(df.info())print("\n" + "="*50 + "\n")# Check for specific typeprint("Checking if 'salary' is numeric:")print(pd.api.types.is_numeric_dtype(df['salary']))print("\nChecking if 'age' is numeric:")print(pd.api.types.is_numeric_dtype(df['age']))# Identify columns with object type (potential mixed types)object_columns = df.select_dtypes(include=['object']).columnsprint(f"\nColumns with object type: {list(object_columns)}") Output: Current data types:user_id int64age objectsalary int64join_date objectactive booldtype: object==================================================Detailed information:<class 'pandas.core.frame.DataFrame'>RangeIndex: 5 entries, 0 to 4Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 5 non-null int64 1 age 5 non-null object 2 salary 5 non-null int64 3 join_date 5 non-null object 4 active 5 non-null bool dtypes: bool(1), int64(2), object(2)memory usage: 297.0+ bytesNone==================================================Checking if 'salary' is numeric:TrueChecking if 'age' is numeric:FalseColumns with object type: Index(['age', 'join_date'], dtype='object') The dtypes attribute reveals the current type of each column. Object type often indicates string data or mixed types that need conversion. Convert Data Types Once you identify type mismatches, convert columns to their correct types. Type conversion must handle invalid values gracefully to avoid crashing on malformed data. import pandas as pdimport numpy as np# Create sample datadata = { 'product_id': ['1001', '1002', '1003', '1004'], 'price': ['29.99', '45.50', '19.99', '99.99'], 'quantity': ['10', '25', '30', '15'], 'rating': ['4.5', '3.8', '4.9', '4.2'], 'in_stock': ['true', 'false', 'true', 'true']}df = pd.DataFrame(data)print("Before conversion:")print(df.dtypes)print("\n")print(df)print("\n" + "="*50 + "\n")# Convert to appropriate typesdf['product_id'] = df['product_id'].astype('int64')df['price'] = df['price'].astype('float64')df['quantity'] = df['quantity'].astype('int32')df['rating'] = df['rating'].astype('float32')df['in_stock'] = df['in_stock'].map({'true': True, 'false': False})print("After conversion:")print(df.dtypes)print("\n")print(df) Output: Before conversion:product_id objectprice objectquantity objectrating objectin_stock objectdtype: objectproduct_id price quantity rating in_stock0 1001 29.99 10 4.5 true1 1002 45.50 25 3.8 false2 1003 19.99 30 4.9 true3 1004 99.99 15 4.2 true==================================================After conversion:product_id int64price float64quantity int32rating float32in_stock booldtype: object product_id price quantity rating in_stock0 1001 29.99 10 4.5 True1 1002 45.50 25 3.8 False2 1003 19.99 30 4.9 True3 1004 99.99 15 4.2 True The astype method converts column types explicitly. For boolean conversions, map provides better control over string to boolean mapping than astype. Handle Mixed Types Real-world data often contains mixed types in a single column. A numeric column might contain error codes as strings, or a date column might have ‘N/A’ entries. The to_numeric function handles these scenarios with an errors parameter. import pandas as pdimport numpy as np# Create data with mixed typesdata = { 'sensor_id': [1, 2, 3, 4, 5], 'temperature': ['72.5', '68.3', 'ERROR', '75.1', '70.0'], 'pressure': ['1013', '1015', '1012', 'null', '1014'], 'humidity': ['45', '52', '48', '50', 'N/A']}df = pd.DataFrame(data)print("Original data:")print(df)print("\n" + "="*50 + "\n")# Attempt conversion with errors='coerce' (invalid values become NaN)df['temperature'] = pd.to_numeric(df['temperature'], errors='coerce')df['pressure'] = pd.to_numeric(df['pressure'], errors='coerce')df['humidity'] = pd.to_numeric(df['humidity'], errors='coerce')print("After conversion (errors='coerce'):")print(df)print("\n")print("Data types:")print(df.dtypes)print("\n" + "="*50 + "\n")# Check for conversion failuresprint("Rows with NaN values:")print(df[df.isna().any(axis=1)]) Output: Original data: sensor_id temperature pressure humidity0 1 72.5 1013 451 2 68.3 1015 522 3 ERROR 1012 483 4 75.1 null 504 5 70.0 1014 N/A==================================================After conversion (errors='coerce'): sensor_id temperature pressure humidity0 1 72.5 1013.0 45.01 2 68.3 1015.0 52.02 3 NaN 1012.0 48.03 4 75.1 NaN 50.04 5 70.0 1014.0 NaNData types:sensor_id int64temperature float64pressure float64humidity float64dtype: object==================================================Rows with NaN values: sensor_id temperature pressure humidity2 3 NaN 1012.0 48.03 4 75.1 NaN 50.04 5 70.0 1014.0 NaN The errors=’coerce’ parameter converts invalid values to NaN instead of raising an exception. This allows the pipeline to […]