Combining DataFrames in Pandas: Merge, Join, and Concatenate Explained¶
In data analysis, your data rarely resides in a single table. You often need to combine information from multiple sources or tables based on common fields or simply stack datasets together. Pandas, the cornerstone library for data manipulation in Python, provides powerful and flexible functions for these tasks, primarily pd.merge() and pd.concat().
This post will guide you through the common methods for combining DataFrames in Pandas, focusing on database-style joins using merge and stacking using concat.
Why Combine DataFrames?¶
- Enriching Data: Adding columns from one DataFrame to another based on a shared key (e.g., adding customer details to a sales transaction table).
- Comparing Datasets: Identifying common or unique records between two tables.
- Appending Data: Stacking data from multiple files or time periods into a single DataFrame (e.g., combining monthly sales reports).
Merging DataFrames with pd.merge() (Database-Style Joins)¶
The pd.merge() function is the primary tool for performing database-like joins on DataFrames based on one or more common columns (keys) or indices.
Core Syntax:
pd.merge(left_df, right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'))
Key Parameters:
left_df,right_df: The two DataFrames to merge.how: Specifies the type of join. Common options:'inner'(default): Keeps only rows where the key(s) exist in both DataFrames.'left': Keeps all rows from theleft_dfand matching rows from theright_df. Fills withNaNwhere no match exists in the right DataFrame.'right': Keeps all rows from theright_dfand matching rows from theleft_df. Fills withNaNwhere no match exists in the left DataFrame.'outer': Keeps all rows from both DataFrames. Fills withNaNwhere matches don't exist in either DataFrame.
on: Column name(s) to join on. Must be present in both DataFrames. IfNone, Pandas uses the intersection of columns in both DataFrames as keys.left_on,right_on: Column name(s) from the left and right DataFrames respectively to join on (used when key column names differ).left_index,right_index: Use the index of the left or right DataFrame as the join key.suffixes: Tuple of strings to append to overlapping column names (that are not join keys) in the left and right DataFrames respectively (default:('_x', '_y')).
Example Setup:
Let's create two simple DataFrames to illustrate merging:
import pandas as pd
# Employee data
employees = pd.DataFrame({
'employee_id': [101, 102, 103, 104],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'department_id': [1, 2, 1, 3]
})
# Department data
departments = pd.DataFrame({
'department_id': [1, 2, 4],
'department_name': ['HR', 'Engineering', 'Marketing']
})
print("Employees:\n", employees)
print("\nDepartments:\n", departments)
Merge Examples:
-
Inner Join (Default): Keep only employees whose department exists in the
departmentstable.inner_merged = pd.merge(employees, departments, on='department_id', how='inner') print("\nInner Merge:\n", inner_merged) # Output: Alice, Bob, Charlie (David's department 3 and Marketing dept 4 are excluded) -
Left Join: Keep all employees, add department names where available.
left_merged = pd.merge(employees, departments, on='department_id', how='left') print("\nLeft Merge:\n", left_merged) # Output: All employees included. David's department_name will be NaN. Marketing dept is excluded. -
Right Join: Keep all departments, add employee names where available.
right_merged = pd.merge(employees, departments, on='department_id', how='right') print("\nRight Merge:\n", right_merged) # Output: All departments included. Marketing's employee info will be NaN. David is excluded. -
Outer Join: Keep all employees and all departments.
outer_merged = pd.merge(employees, departments, on='department_id', how='outer') print("\nOuter Merge:\n", outer_merged) # Output: All employees and all departments included. NaN where matches don't exist.
Concatenating DataFrames with pd.concat() (Stacking)¶
Concatenation is about stacking multiple DataFrames together either vertically (appending rows) or horizontally (adding columns). It doesn't rely on keys like merge.
Core Syntax:
pd.concat(objs, axis=0, join='outer', ignore_index=False)
Key Parameters:
objs: A sequence (like a list) of DataFrames to concatenate.axis: The axis to concatenate along.0(default): Stack vertically (append rows). Columns are aligned by name.1: Stack horizontally (append columns). Rows are aligned by index.
join: How to handle indices/columns on the other axis.'outer'(default): Keep all unique indices/columns from all DataFrames. Fills non-matching areas withNaN.'inner': Keep only indices/columns shared across all DataFrames.
ignore_index: IfTrue, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. Useful when stacking rows and you want a clean new index.
Example: Stacking Rows (Vertical Concatenation)
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})
stacked = pd.concat([df1, df2], ignore_index=True) # ignore_index creates a new 0..n-1 index
print("\nStacked Vertically:\n", stacked)
Example: Combining Multiple CSV Files
This is a common use case for concat – combining data from several files with the same structure.
import pandas as pd
import os
import glob # For finding files matching a pattern
# --- Create dummy CSV files for the example ---
if not os.path.exists('temp_csvs'):
os.makedirs('temp_csvs')
pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}).to_csv('temp_csvs/data_part1.csv', index=False)
pd.DataFrame({'col1': [5, 6], 'col2': [7, 8]}).to_csv('temp_csvs/data_part2.csv', index=False)
# --- End of dummy file creation ---
# Specify the folder containing the CSV files
folder_path = 'temp_csvs'
# Find all csv files in the folder
all_files = glob.glob(os.path.join(folder_path, "*.csv"))
# Read and concatenate all CSV files into a single DataFrame
df_list = [pd.read_csv(file) for file in all_files]
if df_list: # Check if the list is not empty
merged_result = pd.concat(df_list, ignore_index=True)
print("\nConcatenated CSVs:\n", merged_result)
else:
print(f"No CSV files found in {folder_path}")
# --- Clean up dummy files ---
# import shutil
# shutil.rmtree('temp_csvs')
# --- End of cleanup ---
Merge vs. Concat: When to Use Which?¶
- Use
pd.merge()when you need to combine DataFrames based on the values in one or more common columns (like SQL joins). - Use
pd.concat()when you want to stack DataFrames on top of each other (axis=0) or side-by-side (axis=1), aligning by index or column names.
Understanding these functions is fundamental to effective data manipulation in Pandas. Choose the right tool based on how you need to combine your datasets.