Skip to content

Pandas basics

Before we start, dont scroll through really fast and think, this is too difficult. It is easy easy when you get the hang of it. Think of it as a powerful version of Excel that you can give it a recipe to read and it will follow every step that you tell it. In order to bake that cake (your report), we first need to understand how to speak the same language. This section will be providing you a baseline on what the tool is doing when you tell it to open a csv file. When you open one in Excel, it just opens and nothing else. Here are are opening it and telling it to tell us what type of data that we have here (dates, numbers, letters, so on).

We'll start with the basics, how to start pandas and what types of data you can load into it. In this example we'll load the sales_data_2023+2024.csv that we crated in the Data/data.md page.

#!/usr/bin/env python3 
# dont add that if you're using Windows

import pandas as pd

# specify your filename (you can use csv, xlsx, etc)
df = pd.read_csv('sales_data_2023+2024.csv')

print(df)

Output of the script:

                             Transaction_ID        Date     Customer_Name Region  ...     Product Quantity  Price_per_Unit  Total_Sales
0      09792c2c-fee9-425f-8b7f-13b2e5a0b854  2024-08-27       Jesse Gibbs   West  ...       Table       19             100         1900
1      839d70a3-8621-428a-a3f6-3a43995e466f  2022-12-15    Anthony Torres   East  ...       Shirt       13              20          260
2      286a6ffb-5c0e-4ce7-a5fe-4ba874d52df8  2024-11-13  Dr. Patrick Meza  South  ...     Fiction        1              20           20
3      d561f11c-d446-4b06-99f0-f32cf964af60  2024-03-28        Lori Tyler  South  ...  Vegetables       10             100         1000
4      dc035123-334c-43f8-b706-c5228dc11a2d  2023-03-10   Madison Whitney  South  ...  Headphones       14             500         7000
...                                     ...         ...               ...    ...  ...         ...      ...             ...          ...
24995  b6a05480-3a4b-4b4a-9f07-ba91e1604681  2024-04-14     Colleen Lyons  South  ...   Beverages        9              50          450
24996  0aff1f74-4d12-487a-80d3-54ab6f3d0ab5  2024-08-17    Joseph Buckley  North  ...    Magazine       17             200         3400
24997  9acb0b7e-ba92-4b73-8d74-0bc3ccdc0bd1  2024-10-16  Nathaniel Carter  North  ...    Textbook       10             100         1000
24998  ad951ce0-f994-47ef-9fe7-51009fb73c36  2023-08-23     Caitlin Black  South  ...       Shoes       14             200         2800
24999  736b9109-0e65-4135-b2e0-eab5049c7db1  2023-02-19      Ashley Davis   East  ...    Textbook       12             500         6000

[25000 rows x 9 columns]

Depending on your terminal or notebook settings, some columns may be truncated (cut off) to fit the screen. Don't worry—pandas still has all the data. Just looking at the raw data doesn't always give you the full picture. We often want to know how many rows and columns there are, the types of data stored in each column, and if there are any missing values. Pandas makes this easy with the df.info() function:

#!/usr/bin/env python3

import pandas as pd

df = pd.read_csv('sales_data_2023+2024.csv')

print(df.info())

And the output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction_ID    25000 non-null  object
 1   Date              25000 non-null  object
 2   Customer_Name     25000 non-null  object
 3   Region            25000 non-null  object
 4   Product_Category  25000 non-null  object
 5   Product           25000 non-null  object
 6   Quantity          25000 non-null  int64 
 7   Price_per_Unit    25000 non-null  int64 
 8   Total_Sales       25000 non-null  int64 
dtypes: int64(3), object(6)
memory usage: 1.7+ MB
None

Breaking down what all this means:

  • # <class 'pandas.core.frame.DataFrame'> is a docstring that provides information about the type of the object being described where pandas.core.frame.DataFrame specifically refers to the DataFrame class within the pandas library.
  • RangeIndex: 25000 entries, 0 to 24999 describes the index of the DataFrame, which is a sequence of values used as the row labels in the data structure.
  • Data columns (total 9 columns): This line announces the total number of columns (or features/columns) present within this DataFrame.
  • 25000 non-null: This indicates that there are 25,000 entries in the column that are not missing or null. All values are present and valid.
  • object: The data type of the column. An object typically means the data is stored as a generic type, usually representing strings or mixed types.
  • int64: This is the data type of the column. int64 indicates that the values are stored as 64-bit integers, meaning they are whole numbers.

Each column's data type and non-null count give insights into the kind of data it holds and how complete that data is.

The number of rows (transactions) and columns in the DataFrame. The data type of each column (e.g., string, integer). The total memory usage of the DataFrame.

Here are some common data types in pandas DataFrames:

  • Integers: Represent whole numbers (no decimals). Examples: int64, int32.
  • Floats: Represent numbers with decimals (floating-point numbers). Examples: float64, float32.
  • Boolean: Represents True or False values. Example: bool.
  • Strings: Represent text data (words, sentences, etc.). Common data types for strings include:
  • object (most common): Generally used for text or mixed data types.
  • category: A special data type used for columns with a limited number of possible values (like a list of categories).
  • string_: A newer, more explicit type for string data in pandas.
  • Dates and Times:
  • datetime64: Represents both dates and times as a single value (e.g., '2024-12-11 14:30:00').
  • timedelta64: Represents the difference or interval between two dates or times (e.g., '5 days', '3 hours').