Data Analytics: Using Pandas, Polars, and PySpark
Pandas, Polars, and PySpark are the leading Python libraries for data processing, each excelling in different scenarios based on dataset size - performance needs - computational resources. At best these data tools can be combined in a hybrid approach to leverage their respective strengths.
The challenges of Big Data can be better understood by the 4 V's of big data are Volume, Velocity, Variety, and Veracity. They describe the key characteristics of big data:
Its large size, the speed at which it's generated and processed, the many different types of data it includes, and its trustworthiness or accuracy. Some models add a fifth V, Value, which represents the importance of deriving useful information from the data.
- Volume: The enormous amount of data that is generated and collected every second.
- Velocity: The speed at which new data is created, gathered, and processed.
- Variety: The wide range of data types, which can be structured (like a database), semi-structured, or unstructured (like text or images).
- Veracity: The degree of data accuracy and trustworthiness, which can be a challenge to ensure for a large, diverse dataset.
- Value: The importance of deriving useful information from the data.
Here is a breakdown of when to use which library:
| Feature | Pandas | Polars | PySpark |
|---|---|---|---|
| Dataset Size | Small to Medium (<10GB, fits in memory) | Medium to Large (GBs to 100GBs, single machine) | Massive (100GB to PBs, distributed) |
| Execution | Eager, single-threaded | Eager/Lazy, multi-threaded | Lazy, distributed |
| Performance | Good for small data, struggles with scale | Very fast on single machine, memory efficient | Scalable and fault-tolerant for big data |
| Complexity | Simple, intuitive API, low learning curve | Pythonic API, moderate learning curve | Complex setup and management |
| Primary Use | EDA, prototyping, ML integration | Performance-critical single-machine tasks | Enterprise ETL, large-scale ML, streaming |
1. Pandas:
Best for: Exploratory Data Analysis (EDA), quick analysis, and prototyping on small to medium
datasets that fit comfortably in your computer's RAM (typically under 10GB). It integrates
seamlessly with popular machine learning libraries like scikit-learn.
* While powerful, its single-threaded nature and in-memory operations make it a bottleneck for
larger datasets.
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Age': [25, 30, np.nan, 22, 28, 45, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'New York', 'Los Angeles', np.nan],
'Salary': [70000, 90000, 60000, 55000, 80000, 120000, 90000]
}
df = pd.DataFrame(data)
Example (Pandas): Most frequently used pandas functions, complete
with practical examples using a hypothetical DataFrame named df.Assume the following DataFrame for the examples::
The simple example for 16 most used pandas functions can be seen my Colab Pandas Notebook through GitHub Colab Pandas Notebook
2. Polars:
Polars is a library for data manipulation. Polars is built with an OLAP query engine
implemented in Rust using Apache Arrow Columnar Format as the memory model. Although built using
Rust, there are Python, Node. js, R, and SQL API interfaces to use Polars.
daily_weather.parquet file is used for the example has 27.6 million records.
Performance:
Pandas read_parquet time (avg over 10 runs): 66.4953 seconds
Polars eager read_parquet time (avg over 10 runs): 33.3602 seconds
Polars lazy scan_parquet time (avg over 10 runs): 32.3216 seconds
import pandas as pd
import polars as pl
import numpy as np
import timeit
import os
# --- 1. Create a sample Parquet file ---
file_path = '/content/drive/MyDrive/Colab Notebooks/daily_weather.parquet'
if not os.path.exists(file_path):
# Create a large DataFrame for a meaningful comparison
data = {
'id': np.arange(1000000),
'value': np.random.rand(1000000),
'category': np.random.choice(['A', 'B', 'C'], size=1000000)
}
df_create = pd.DataFrame(data)
df_create.to_parquet(file_path, index=False)
print(f"Created a sample parquet file: {file_path}\n")
# --- 2. Define functions for timeit ---
def read_pandas():
"""Function to read the parquet file using pandas."""
# Ensure pandas uses pyarrow engine for better performance and compatibility
df = pd.read_parquet(file_path, engine='pyarrow')
return df
def read_polars_eager():
"""Function to read the parquet file using Polars (eagerly)."""
df = pl.read_parquet(file_path)
return df
def read_polars_lazy():
"""Function to read the parquet file using Polars (lazily and collect)."""
df = pl.scan_parquet(file_path).collect()
return df
# --- 3. Time the operations ---
# The timeit module runs the function multiple times and provides an average/best time.
# Time Pandas read_parquet
pandas_time = timeit.timeit(read_pandas, number=10) # Run 10 times
print(f"Pandas read_parquet time (avg over 10 runs): {pandas_time:.4f} seconds")
# Time Polars eager read_parquet
polars_eager_time = timeit.timeit(read_polars_eager, number=10) # Run 10 times
print(f"Polars eager read_parquet time (avg over 10 runs): {polars_eager_time:.4f} seconds")
# Time Polars lazy scan_parquet and collect
polars_lazy_time = timeit.timeit(read_polars_lazy, number=10) # Run 10 times
print(f"Polars lazy scan_parquet time (avg over 10 runs): {polars_lazy_time:.4f} seconds")
# --- 4. Clean up (optional) ---
# os.remove(file_path)
# print(f"\nRemoved the sample file: {file_path}")
The simple example for 14 most used Polars functions can be seen my Colab Polars Notebook through GitHub Colab Polars Notebook
3. PySpark
PySpark is a Python API for Apache Spark, a distributed computing framework for big data processing.
It allows users to write Spark applications using Python, leveraging Spark's capabilities for
parallel processing, fault tolerance, and in-memory computation.
Key Components and Concepts:
- SparkSession: The entry point to programming Spark with the Dataset and DataFrame API. It provides a unified entry point for Spark functionality.
- DataFrame: A distributed collection of data organized into named columns, similar to a table in a relational database or a DataFrame in Pandas. It provides a higher-level abstraction than RDDs and offers numerous advantages for data processing and analysis.
- RDD (Resilient Distributed Dataset): The fundamental data structure in Spark, representing an immutable, fault-tolerant, and distributed collection of objects that can be processed in parallel. While DataFrames are generally preferred for structured data, RDDs are still useful for lower-level control and unstructured data.
- Transformations: Operations that create a new DataFrame or RDD from an existing one without immediately computing the result (e.g., select, filter, groupBy).
- Actions: Operations that trigger the execution of transformations and return a result to the driver program or write data to an external storage (e.g., show, count, collect, write).
import findspark
findspark.init() # Initializes findspark to locate Spark installation
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
# Create a SparkSession
spark = SparkSession.builder \
.appName("PySparkJupyterExample") \
.master("local[*]") \
.getOrCreate()
print("SparkSession created successfully!")
Example (PySpark): Most frequently used PySpark functions, complete
with practical examples using a hypothetical DataFrame named df.Assume the following DataFrame for the examples:
The simple example used PySpark functions can be seen my Colab PySpark Notebook through GitHub Colab PySpark Notebook