Reading Data with pandas.read

Pandas is a powerful and versatile data manipulation library for Python, designed to handle structured data efficiently. It provides high-performance, easy-to-use data structures and tools for working with various types of data, including CSV files, Excel spreadsheets, SQL databases, and more.

At the core of pandas are two primary data structures:

A one-dimensional labeled array capable of holding data of any type (e.g., integers, floats, strings).
A two-dimensional labeled data structure with columns of potentially different types. It can be thought of as a table or a spreadsheet-like structure.

These data structures allow for intuitive and efficient data manipulation, making pandas an essential tool for data scientists, analysts, and developers working with structured data.

Some key features of pandas include:

Fast and efficient DataFrame object with default and customizable indexing
Tools for loading data from various file formats and databases
Data alignment and integrated handling of missing data
Reshaping and pivoting of datasets
Label-based slicing, fancy indexing, and subsetting of large datasets
Data merging and joining
Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure
Time series functionality, including date range generation, frequency conversion, moving window statistics, and date shifting

To illustrate the power of pandas, here’s a simple example of creating a DataFrame and performing basic operations:

import pandas as pd

# Create a sample DataFrame

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [25, 30, 35, 28],

'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']

}

df = pd.DataFrame(data)

# Display the DataFrame

print(df)

# Calculate the mean age

mean_age = df['Age'].mean()

print(f"Mean age: {mean_age}")

# Filter the DataFrame

young_people = df[df['Age'] < 30]

print("People younger than 30:")

print(young_people)

import pandas as pd # Create a sample DataFrame data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 28], 'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago'] } df = pd.DataFrame(data) # Display the DataFrame print(df) # Calculate the mean age mean_age = df['Age'].mean() print(f"Mean age: {mean_age}") # Filter the DataFrame young_people = df[df['Age'] < 30] print("People younger than 30:") print(young_people)

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# Calculate the mean age
mean_age = df['Age'].mean()
print(f"Mean age: {mean_age}")

# Filter the DataFrame
young_people = df[df['Age'] < 30]
print("People younger than 30:")
print(young_people)

This example demonstrates how easily you can create a DataFrame, perform calculations, and filter data using pandas. As we dive deeper into the library’s capabilities, you’ll discover how it can significantly streamline your data analysis workflows and make working with structured data in Python a breeze.

Importing pandas and reading data

To begin working with pandas and reading data, you first need to import the library. The convention is to import pandas with the alias ‘pd’. Here’s how you can import pandas:

import pandas as pd

import pandas as pd

Once pandas is imported, you can use its various functions to read data from different file formats. The most common function for reading CSV files is read_csv(). Here’s a basic example of how to use it:

# Reading a CSV file

df = pd.read_csv('data.csv')

# Reading a CSV file df = pd.read_csv('data.csv')

# Reading a CSV file
df = pd.read_csv('data.csv')

This line of code reads the contents of ‘data.csv’ and stores it in a DataFrame called ‘df’. The read_csv() function automatically infers the structure of your data, including column names and data types.

You can also read data from other file formats using similar functions:

read_excel() for Excel files
read_json() for JSON files
read_sql() for SQL databases
read_html() for HTML tables

Here’s an example of reading an Excel file:

# Reading an Excel file

df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Reading an Excel file df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Reading an Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

After reading the data, you can quickly check its contents using various pandas functions:

# Display the first few rows

print(df.head())

# Get information about the DataFrame

print(df.info())

# Display summary statistics

print(df.describe())

# Display the first few rows print(df.head()) # Get information about the DataFrame print(df.info()) # Display summary statistics print(df.describe())

# Display the first few rows
print(df.head())

# Get information about the DataFrame
print(df.info())

# Display summary statistics
print(df.describe())

These functions provide a quick overview of your data, including column names, data types, and basic statistics.

When dealing with large datasets, you might want to read only a portion of the file. You can do this by specifying the nrows parameter:

# Read only the first 1000 rows

df_sample = pd.read_csv('large_data.csv', nrows=1000)

# Read only the first 1000 rows df_sample = pd.read_csv('large_data.csv', nrows=1000)

# Read only the first 1000 rows
df_sample = pd.read_csv('large_data.csv', nrows=1000)

This can be particularly useful when working with very large files or when you want to quickly explore the structure of your data without loading the entire dataset into memory.

Remember that pandas provides many options to customize how data is read. We’ll explore more advanced features in the following sections, but these basics should get you started with importing pandas and reading data from various file formats.

Specifying file path and format options

When working with pandas.read_csv(), specifying the correct file path and format options very important for successfully loading your data. Let’s explore some common scenarios and options you might encounter.

Specifying the File Path:

You can provide either a relative or absolute path to your CSV file. Here are some examples:

# Relative path

df = pd.read_csv('data/sales.csv')

# Absolute path

df = pd.read_csv('C:/Users/YourName/Documents/data/sales.csv')

# Using raw string for Windows paths

df = pd.read_csv(r'C:UsersYourNameDocumentsdatasales.csv')

# Relative path df = pd.read_csv('data/sales.csv') # Absolute path df = pd.read_csv('C:/Users/YourName/Documents/data/sales.csv') # Using raw string for Windows paths df = pd.read_csv(r'C:UsersYourNameDocumentsdatasales.csv')

# Relative path
df = pd.read_csv('data/sales.csv')

# Absolute path
df = pd.read_csv('C:/Users/YourName/Documents/data/sales.csv')

# Using raw string for Windows paths
df = pd.read_csv(r'C:UsersYourNameDocumentsdatasales.csv')

Handling Different Delimiters:

While CSV stands for “Comma-Separated Values,” not all CSV files use commas as separators. You can specify a different delimiter using the sep parameter:

# Tab-separated file

df = pd.read_csv('data.tsv', sep='t')

# Semicolon-separated file

df = pd.read_csv('data.csv', sep=';')

# Tab-separated file df = pd.read_csv('data.tsv', sep='t') # Semicolon-separated file df = pd.read_csv('data.csv', sep=';')

# Tab-separated file
df = pd.read_csv('data.tsv', sep='t')

# Semicolon-separated file
df = pd.read_csv('data.csv', sep=';')

Specifying Column Names:

If your CSV file doesn’t have column names in the first row, you can provide them using the names parameter:

column_names = ['Date', 'Product', 'Quantity', 'Price']

df = pd.read_csv('sales_data.csv', names=column_names)

column_names = ['Date', 'Product', 'Quantity', 'Price'] df = pd.read_csv('sales_data.csv', names=column_names)

column_names = ['Date', 'Product', 'Quantity', 'Price']
df = pd.read_csv('sales_data.csv', names=column_names)

Skipping Rows:

Sometimes, you may need to skip certain rows in your CSV file. Use the skiprows parameter for this:

# Skip the first 2 rows

df = pd.read_csv('data.csv', skiprows=2)

# Skip specific rows

df = pd.read_csv('data.csv', skiprows=[0, 2, 3])

# Skip the first 2 rows df = pd.read_csv('data.csv', skiprows=2) # Skip specific rows df = pd.read_csv('data.csv', skiprows=[0, 2, 3])

# Skip the first 2 rows
df = pd.read_csv('data.csv', skiprows=2)

# Skip specific rows
df = pd.read_csv('data.csv', skiprows=[0, 2, 3])

Specifying Data Types:

You can explicitly set the data types for columns using the dtype parameter:

dtypes = {'Date': str, 'Price': float, 'Quantity': int}

df = pd.read_csv('sales_data.csv', dtype=dtypes)

dtypes = {'Date': str, 'Price': float, 'Quantity': int} df = pd.read_csv('sales_data.csv', dtype=dtypes)

dtypes = {'Date': str, 'Price': float, 'Quantity': int}
df = pd.read_csv('sales_data.csv', dtype=dtypes)

Handling Dates:

For date columns, you can use the parse_dates parameter to automatically convert strings to datetime objects:

df = pd.read_csv('sales_data.csv', parse_dates=['Date'])

df = pd.read_csv('sales_data.csv', parse_dates=['Date'])

Handling Encodings:

If your CSV file uses a specific encoding, you can specify it using the encoding parameter:

df = pd.read_csv('data.csv', encoding='utf-8')

df = pd.read_csv('data.csv', encoding='latin-1')

df = pd.read_csv('data.csv', encoding='utf-8') df = pd.read_csv('data.csv', encoding='latin-1')

df = pd.read_csv('data.csv', encoding='utf-8')
df = pd.read_csv('data.csv', encoding='latin-1')

Reading Compressed Files:

Pandas can directly read compressed CSV files:

df = pd.read_csv('data.csv.gz', compression='gzip')

df = pd.read_csv('data.csv.zip', compression='zip')

df = pd.read_csv('data.csv.gz', compression='gzip') df = pd.read_csv('data.csv.zip', compression='zip')

df = pd.read_csv('data.csv.gz', compression='gzip')
df = pd.read_csv('data.csv.zip', compression='zip')

Combining Multiple Options:

You can combine multiple options to handle complex CSV files:

df = pd.read_csv('sales_data.csv',

sep=';',

encoding='utf-8',

parse_dates=['Date'],

dtype={'Price': float, 'Quantity': int},

skiprows=1)

df = pd.read_csv('sales_data.csv', sep=';', encoding='utf-8', parse_dates=['Date'], dtype={'Price': float, 'Quantity': int}, skiprows=1)

df = pd.read_csv('sales_data.csv',
                 sep=';',
                 encoding='utf-8',
                 parse_dates=['Date'],
                 dtype={'Price': float, 'Quantity': int},
                 skiprows=1)

By using these options, you can efficiently handle a wide variety of CSV file formats and structures, ensuring that your data is loaded correctly into a pandas DataFrame for further analysis and manipulation.

Handling missing data and duplicates

When working with real-world datasets, it is common to encounter missing values and duplicate entries. Pandas provides several tools to handle these issues effectively. Let’s explore how to deal with missing data and duplicates using pandas.

Handling Missing Data:

Pandas represents missing data using NaN (Not a Number) values. Here are some common techniques for handling missing data:

1. Detecting missing values:

# Check for missing values

print(df.isnull().sum())

# Get rows with any missing values

missing_rows = df[df.isnull().any(axis=1)]

print(missing_rows)

# Check for missing values print(df.isnull().sum()) # Get rows with any missing values missing_rows = df[df.isnull().any(axis=1)] print(missing_rows)

# Check for missing values
print(df.isnull().sum())

# Get rows with any missing values
missing_rows = df[df.isnull().any(axis=1)]
print(missing_rows)

2. Dropping missing values:

# Drop rows with any missing values

df_cleaned = df.dropna()

# Drop rows where all columns are missing

df_cleaned = df.dropna(how='all')

# Drop columns with at least 50% missing values

df_cleaned = df.dropna(axis=1, thresh=len(df) * 0.5)

# Drop rows with any missing values df_cleaned = df.dropna() # Drop rows where all columns are missing df_cleaned = df.dropna(how='all') # Drop columns with at least 50% missing values df_cleaned = df.dropna(axis=1, thresh=len(df) * 0.5)

# Drop rows with any missing values
df_cleaned = df.dropna()

# Drop rows where all columns are missing
df_cleaned = df.dropna(how='all')

# Drop columns with at least 50% missing values
df_cleaned = df.dropna(axis=1, thresh=len(df) * 0.5)

3. Filling missing values:

# Fill missing values with a specific value

df_filled = df.fillna(0)

# Fill missing values with the mean of each column

df_filled = df.fillna(df.mean())

# Fill missing values using forward fill method

df_filled = df.fillna(method='ffill')

# Fill missing values using backward fill method

df_filled = df.fillna(method='bfill')

# Fill missing values with a specific value df_filled = df.fillna(0) # Fill missing values with the mean of each column df_filled = df.fillna(df.mean()) # Fill missing values using forward fill method df_filled = df.fillna(method='ffill') # Fill missing values using backward fill method df_filled = df.fillna(method='bfill')

# Fill missing values with a specific value
df_filled = df.fillna(0)

# Fill missing values with the mean of each column
df_filled = df.fillna(df.mean())

# Fill missing values using forward fill method
df_filled = df.fillna(method='ffill')

# Fill missing values using backward fill method
df_filled = df.fillna(method='bfill')

Handling Duplicates:

Duplicate entries can skew your analysis, so it’s important to identify and handle them appropriately. Here’s how you can work with duplicates:

1. Detecting duplicates:

# Check for duplicate rows

duplicate_rows = df[df.duplicated()]

print(duplicate_rows)

# Check for duplicate rows based on specific columns

duplicate_rows = df[df.duplicated(subset=['Name', 'Age'])]

print(duplicate_rows)

# Check for duplicate rows duplicate_rows = df[df.duplicated()] print(duplicate_rows) # Check for duplicate rows based on specific columns duplicate_rows = df[df.duplicated(subset=['Name', 'Age'])] print(duplicate_rows)

# Check for duplicate rows
duplicate_rows = df[df.duplicated()]
print(duplicate_rows)

# Check for duplicate rows based on specific columns
duplicate_rows = df[df.duplicated(subset=['Name', 'Age'])]
print(duplicate_rows)

2. Removing duplicates:

# Remove duplicate rows, keeping the first occurrence

df_unique = df.drop_duplicates()

# Remove duplicate rows, keeping the last occurrence

df_unique = df.drop_duplicates(keep='last')

# Remove duplicate rows based on specific columns

df_unique = df.drop_duplicates(subset=['Name', 'Age'])

# Remove duplicate rows, keeping the first occurrence df_unique = df.drop_duplicates() # Remove duplicate rows, keeping the last occurrence df_unique = df.drop_duplicates(keep='last') # Remove duplicate rows based on specific columns df_unique = df.drop_duplicates(subset=['Name', 'Age'])

# Remove duplicate rows, keeping the first occurrence
df_unique = df.drop_duplicates()

# Remove duplicate rows, keeping the last occurrence
df_unique = df.drop_duplicates(keep='last')

# Remove duplicate rows based on specific columns
df_unique = df.drop_duplicates(subset=['Name', 'Age'])

Combining Missing Data and Duplicate Handling:

In practice, you might want to combine these techniques to clean your data efficiently:

# Load the CSV file

df = pd.read_csv('data.csv')

# Remove rows with missing values

df_cleaned = df.dropna()

# Remove duplicate rows

df_cleaned = df_cleaned.drop_duplicates()

# Fill remaining missing values with the mean

df_cleaned = df_cleaned.fillna(df_cleaned.mean())

# Check the result

print(df_cleaned.info())

print(df_cleaned.head())

# Load the CSV file df = pd.read_csv('data.csv') # Remove rows with missing values df_cleaned = df.dropna() # Remove duplicate rows df_cleaned = df_cleaned.drop_duplicates() # Fill remaining missing values with the mean df_cleaned = df_cleaned.fillna(df_cleaned.mean()) # Check the result print(df_cleaned.info()) print(df_cleaned.head())

# Load the CSV file
df = pd.read_csv('data.csv')

# Remove rows with missing values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Fill remaining missing values with the mean
df_cleaned = df_cleaned.fillna(df_cleaned.mean())

# Check the result
print(df_cleaned.info())
print(df_cleaned.head())

By applying these techniques, you can effectively handle missing data and duplicates in your datasets, ensuring that your data is clean and ready for analysis. Remember to always ponder the implications of your data cleaning choices on your analysis and results.

Exploring and manipulating the loaded data

Once you have loaded your data into a pandas DataFrame, you can start exploring and manipulating it. Pandas offers a wide range of powerful tools for data analysis and transformation. Let’s look at some common operations you can perform on your loaded data.

1. Basic DataFrame Information:

# Display basic information about the DataFrame

print(df.info())

# Display the first few rows

print(df.head())

# Display the last few rows

print(df.tail())

# Get a concise summary of the DataFrame

print(df.describe())

# Display column names

print(df.columns)

# Get the shape of the DataFrame (rows, columns)

print(df.shape)

# Display basic information about the DataFrame print(df.info()) # Display the first few rows print(df.head()) # Display the last few rows print(df.tail()) # Get a concise summary of the DataFrame print(df.describe()) # Display column names print(df.columns) # Get the shape of the DataFrame (rows, columns) print(df.shape)

# Display basic information about the DataFrame
print(df.info())

# Display the first few rows
print(df.head())

# Display the last few rows
print(df.tail())

# Get a concise summary of the DataFrame
print(df.describe())

# Display column names
print(df.columns)

# Get the shape of the DataFrame (rows, columns)
print(df.shape)

2. Selecting and Filtering Data:

# Select a single column

age_column = df['Age']

# Select multiple columns

subset = df[['Name', 'Age', 'City']]

# Filter rows based on a condition

young_people = df[df['Age'] 25) & (df['City'] == 'New York')]

# Select specific rows and columns

subset = df.loc[0:5, ['Name', 'Age']]

# Select a single column age_column = df['Age'] # Select multiple columns subset = df[['Name', 'Age', 'City']] # Filter rows based on a condition young_people = df[df['Age'] 25) & (df['City'] == 'New York')] # Select specific rows and columns subset = df.loc[0:5, ['Name', 'Age']]

# Select a single column
age_column = df['Age']

# Select multiple columns
subset = df[['Name', 'Age', 'City']]

# Filter rows based on a condition
young_people = df[df['Age']  25) & (df['City'] == 'New York')]

# Select specific rows and columns
subset = df.loc[0:5, ['Name', 'Age']]

3. Adding and Modifying Columns:

# Add a new column

df['Age_Group'] = pd.cut(df['Age'], bins=[0, 18, 35, 60, 100], labels=['Child', 'Young Adult', 'Adult', 'Senior'])

# Modify an existing column

df['Name'] = df['Name'].str.upper()

# Apply a function to a column

df['Age_Squared'] = df['Age'].apply(lambda x: x**2)

# Add a new column df['Age_Group'] = pd.cut(df['Age'], bins=[0, 18, 35, 60, 100], labels=['Child', 'Young Adult', 'Adult', 'Senior']) # Modify an existing column df['Name'] = df['Name'].str.upper() # Apply a function to a column df['Age_Squared'] = df['Age'].apply(lambda x: x**2)

# Add a new column
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 18, 35, 60, 100], labels=['Child', 'Young Adult', 'Adult', 'Senior'])

# Modify an existing column
df['Name'] = df['Name'].str.upper()

# Apply a function to a column
df['Age_Squared'] = df['Age'].apply(lambda x: x**2)

4. Grouping and Aggregating Data:

# Group by a column and calculate mean

mean_age_by_city = df.groupby('City')['Age'].mean()

# Multiple aggregations

summary = df.groupby('City').agg({

'Age': ['mean', 'min', 'max'],

'Name': 'count'

})

# Pivot tables

pivot_table = pd.pivot_table(df, values='Age', index='City', columns='Gender', aggfunc='mean')

# Group by a column and calculate mean mean_age_by_city = df.groupby('City')['Age'].mean() # Multiple aggregations summary = df.groupby('City').agg({ 'Age': ['mean', 'min', 'max'], 'Name': 'count' }) # Pivot tables pivot_table = pd.pivot_table(df, values='Age', index='City', columns='Gender', aggfunc='mean')

# Group by a column and calculate mean
mean_age_by_city = df.groupby('City')['Age'].mean()

# Multiple aggregations
summary = df.groupby('City').agg({
    'Age': ['mean', 'min', 'max'],
    'Name': 'count'
})

# Pivot tables
pivot_table = pd.pivot_table(df, values='Age', index='City', columns='Gender', aggfunc='mean')

5. Sorting Data:

# Sort by a single column

df_sorted = df.sort_values('Age')

# Sort by multiple columns

df_sorted = df.sort_values(['City', 'Age'], ascending=[True, False])

# Sort by a single column df_sorted = df.sort_values('Age') # Sort by multiple columns df_sorted = df.sort_values(['City', 'Age'], ascending=[True, False])

# Sort by a single column
df_sorted = df.sort_values('Age')

# Sort by multiple columns
df_sorted = df.sort_values(['City', 'Age'], ascending=[True, False])

6. Handling Missing Values:

# Drop rows with missing values

df_cleaned = df.dropna()

# Fill missing values

df['Age'].fillna(df['Age'].mean(), inplace=True)

# Drop rows with missing values df_cleaned = df.dropna() # Fill missing values df['Age'].fillna(df['Age'].mean(), inplace=True)

# Drop rows with missing values
df_cleaned = df.dropna()

# Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)

7. Merging and Joining DataFrames:

# Merge two DataFrames

merged_df = pd.merge(df1, df2, on='ID', how='inner')

# Concatenate DataFrames

combined_df = pd.concat([df1, df2], axis=0)

# Merge two DataFrames merged_df = pd.merge(df1, df2, on='ID', how='inner') # Concatenate DataFrames combined_df = pd.concat([df1, df2], axis=0)

# Merge two DataFrames
merged_df = pd.merge(df1, df2, on='ID', how='inner')

# Concatenate DataFrames
combined_df = pd.concat([df1, df2], axis=0)

8. Reshaping Data:

# Melt DataFrame from wide to long format

df_long = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science', 'English'], var_name='Subject', value_name='Score')

# Pivot DataFrame from long to wide format

df_wide = df_long.pivot(index='Name', columns='Subject', values='Score')

# Melt DataFrame from wide to long format df_long = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science', 'English'], var_name='Subject', value_name='Score') # Pivot DataFrame from long to wide format df_wide = df_long.pivot(index='Name', columns='Subject', values='Score')

# Melt DataFrame from wide to long format
df_long = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science', 'English'], var_name='Subject', value_name='Score')

# Pivot DataFrame from long to wide format
df_wide = df_long.pivot(index='Name', columns='Subject', values='Score')

These operations showcase the versatility of pandas in exploring and manipulating data. By combining these techniques, you can perform complex data analysis tasks efficiently. Remember to always keep a copy of your original data and document your data manipulation steps for reproducibility.

Advanced features and customization options

Pandas offers a wealth of advanced features and customization options for reading CSV files, so that you can handle complex data structures and improve performance. Let’s explore some of these advanced capabilities:

1. Chunking large files:

When dealing with very large CSV files, you can use the chunksize parameter to read the file in smaller chunks, processing each chunk separately:

chunk_size = 10000

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):

# Process each chunk

print(chunk.shape)

# Perform operations on the chunk

chunk_size = 10000 for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size): # Process each chunk print(chunk.shape) # Perform operations on the chunk

chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    print(chunk.shape)
    # Perform operations on the chunk

2. Memory-efficient data types:

Use the dtype parameter to specify more memory-efficient data types for your columns:

dtypes = {

'id': 'int32',

'name': 'category',

'value': 'float32'

}

df = pd.read_csv('data.csv', dtype=dtypes)

dtypes = { 'id': 'int32', 'name': 'category', 'value': 'float32' } df = pd.read_csv('data.csv', dtype=dtypes)

dtypes = {
    'id': 'int32',
    'name': 'category',
    'value': 'float32'
}
df = pd.read_csv('data.csv', dtype=dtypes)

3. Custom date parsing:

For complex date formats, use a custom date parser function:

import datetime

def custom_date_parser(date_string):

return datetime.datetime.strptime(date_string, '%Y-%m-%d %H:%M:%S.%f')

df = pd.read_csv('data.csv', parse_dates=['timestamp'], date_parser=custom_date_parser)

import datetime def custom_date_parser(date_string): return datetime.datetime.strptime(date_string, '%Y-%m-%d %H:%M:%S.%f') df = pd.read_csv('data.csv', parse_dates=['timestamp'], date_parser=custom_date_parser)

import datetime

def custom_date_parser(date_string):
    return datetime.datetime.strptime(date_string, '%Y-%m-%d %H:%M:%S.%f')

df = pd.read_csv('data.csv', parse_dates=['timestamp'], date_parser=custom_date_parser)

4. Handling multi-line fields:

If your CSV contains fields with line breaks, use the quoting parameter:

import csv

df = pd.read_csv('data.csv', quoting=csv.QUOTE_ALL, quotechar='"')

import csv df = pd.read_csv('data.csv', quoting=csv.QUOTE_ALL, quotechar='"')

import csv

df = pd.read_csv('data.csv', quoting=csv.QUOTE_ALL, quotechar='"')

5. Specifying column subsets:

Load only specific columns to reduce memory usage:

columns_to_use = ['id', 'name', 'value']

df = pd.read_csv('data.csv', usecols=columns_to_use)

columns_to_use = ['id', 'name', 'value'] df = pd.read_csv('data.csv', usecols=columns_to_use)

columns_to_use = ['id', 'name', 'value']
df = pd.read_csv('data.csv', usecols=columns_to_use)

6. Custom NA values:

Specify custom strings to be treated as NA/NaN:

df = pd.read_csv('data.csv', na_values=['NA', 'N/A', 'null', '-'])

df = pd.read_csv('data.csv', na_values=['NA', 'N/A', 'null', '-'])

7. Handling thousands separators:

For numbers with thousands separators, use the thousands parameter:

df = pd.read_csv('data.csv', thousands=',')

df = pd.read_csv('data.csv', thousands=',')

8. Skipping bad lines:

To handle files with occasional formatting errors, use error_bad_lines and warn_bad_lines:

df = pd.read_csv('data.csv', error_bad_lines=False, warn_bad_lines=True)

df = pd.read_csv('data.csv', error_bad_lines=False, warn_bad_lines=True)

9. Using converters:

Apply custom functions to specific columns during import:

def uppercase_converter(x):

return x.upper()

converters = {'name': uppercase_converter}

df = pd.read_csv('data.csv', converters=converters)

def uppercase_converter(x): return x.upper() converters = {'name': uppercase_converter} df = pd.read_csv('data.csv', converters=converters)

def uppercase_converter(x):
    return x.upper()

converters = {'name': uppercase_converter}
df = pd.read_csv('data.csv', converters=converters)

10. Handling hierarchical columns:

For CSV files with multi-level column headers:

df = pd.read_csv('data.csv', header=[0, 1])

df = pd.read_csv('data.csv', header=[0, 1])

These advanced features and customization options allow you to handle a wide variety of CSV file structures and improve the efficiency of your data loading process. By using these capabilities, you can tailor the read_csv() function to your specific needs and optimize your data analysis workflow.

Reading Data with pandas.read_csv

Importing pandas and reading data

Specifying file path and format options

Handling missing data and duplicates

Exploring and manipulating the loaded data

Advanced features and customization options

Comments

Leave a Reply Cancel reply

Designfullprint Python Cheat Sheet RGB Led Mousepad

Python Programming Commands Cheat Sheet

Artificial Intelligence Programming with Python

Learn Python 3 the Hard Way