Pandas basics¶
Before we start, dont scroll through really fast and think, this is too difficult. It is easy easy when you get the hang of it. Think of it as a powerful version of Excel that you can give it a recipe to read and it will follow every step that you tell it. In order to bake that cake (your report), we first need to understand how to speak the same language. This section will be providing you a baseline on what the tool is doing when you tell it to open a csv file. When you open one in Excel, it just opens and nothing else. Here are are opening it and telling it to tell us what type of data that we have here (dates, numbers, letters, so on).
We'll start with the basics, how to start pandas and what types of data you can load into it. In this example we'll load the sales_data_2023+2024.csv that we crated in the Data/data.md page.
#!/usr/bin/env python3
# dont add that if you're using Windows
import pandas as pd
# specify your filename (you can use csv, xlsx, etc)
df = pd.read_csv('sales_data_2023+2024.csv')
print(df)
Output of the script:
Transaction_ID Date Customer_Name Region ... Product Quantity Price_per_Unit Total_Sales
0 09792c2c-fee9-425f-8b7f-13b2e5a0b854 2024-08-27 Jesse Gibbs West ... Table 19 100 1900
1 839d70a3-8621-428a-a3f6-3a43995e466f 2022-12-15 Anthony Torres East ... Shirt 13 20 260
2 286a6ffb-5c0e-4ce7-a5fe-4ba874d52df8 2024-11-13 Dr. Patrick Meza South ... Fiction 1 20 20
3 d561f11c-d446-4b06-99f0-f32cf964af60 2024-03-28 Lori Tyler South ... Vegetables 10 100 1000
4 dc035123-334c-43f8-b706-c5228dc11a2d 2023-03-10 Madison Whitney South ... Headphones 14 500 7000
... ... ... ... ... ... ... ... ... ...
24995 b6a05480-3a4b-4b4a-9f07-ba91e1604681 2024-04-14 Colleen Lyons South ... Beverages 9 50 450
24996 0aff1f74-4d12-487a-80d3-54ab6f3d0ab5 2024-08-17 Joseph Buckley North ... Magazine 17 200 3400
24997 9acb0b7e-ba92-4b73-8d74-0bc3ccdc0bd1 2024-10-16 Nathaniel Carter North ... Textbook 10 100 1000
24998 ad951ce0-f994-47ef-9fe7-51009fb73c36 2023-08-23 Caitlin Black South ... Shoes 14 200 2800
24999 736b9109-0e65-4135-b2e0-eab5049c7db1 2023-02-19 Ashley Davis East ... Textbook 12 500 6000
[25000 rows x 9 columns]
Depending on your terminal or notebook settings, some columns may be truncated (cut off) to fit the screen. Don't worry—pandas still has all the data. Just looking at the raw data doesn't always give you the full picture. We often want to know how many rows and columns there are, the types of data stored in each column, and if there are any missing values. Pandas makes this easy with the df.info() function:
#!/usr/bin/env python3
import pandas as pd
df = pd.read_csv('sales_data_2023+2024.csv')
print(df.info())
And the output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Transaction_ID 25000 non-null object
1 Date 25000 non-null object
2 Customer_Name 25000 non-null object
3 Region 25000 non-null object
4 Product_Category 25000 non-null object
5 Product 25000 non-null object
6 Quantity 25000 non-null int64
7 Price_per_Unit 25000 non-null int64
8 Total_Sales 25000 non-null int64
dtypes: int64(3), object(6)
memory usage: 1.7+ MB
None
Breaking down what all this means:
# <class 'pandas.core.frame.DataFrame'>is a docstring that provides information about the type of the object being described wherepandas.core.frame.DataFramespecifically refers to the DataFrame class within the pandas library.RangeIndex: 25000 entries, 0 to 24999describes the index of the DataFrame, which is a sequence of values used as the row labels in the data structure.Data columns (total 9 columns): This line announces the total number of columns (or features/columns) present within this DataFrame.25000 non-null: This indicates that there are 25,000 entries in the column that are not missing or null. All values are present and valid.object: The data type of the column. Anobjecttypically means the data is stored as a generic type, usually representing strings or mixed types.int64: This is the data type of the column.int64indicates that the values are stored as 64-bit integers, meaning they are whole numbers.
Each column's data type and non-null count give insights into the kind of data it holds and how complete that data is.
The number of rows (transactions) and columns in the DataFrame. The data type of each column (e.g., string, integer). The total memory usage of the DataFrame.
Here are some common data types in pandas DataFrames:
- Integers: Represent whole numbers (no decimals). Examples:
int64,int32. - Floats: Represent numbers with decimals (floating-point numbers). Examples:
float64,float32. - Boolean: Represents True or False values. Example:
bool. - Strings: Represent text data (words, sentences, etc.). Common data types for strings include:
object(most common): Generally used for text or mixed data types.category: A special data type used for columns with a limited number of possible values (like a list of categories).string_: A newer, more explicit type for string data in pandas.- Dates and Times:
datetime64: Represents both dates and times as a single value (e.g., '2024-12-11 14:30:00').timedelta64: Represents the difference or interval between two dates or times (e.g., '5 days', '3 hours').