In Business Intelligence (BI), SQL is used to extract and manipulate data from databases, while Python adds flexibility for data processing, visualization, and automation. Combining both enables you to build powerful, automated BI pipelines and dashboards.
What You Need to Get Started
Before diving in, ensure you have the following installed:
- Python 3.x
- pip (Python package manager)
- A SQL database (e.g., PostgreSQL, MySQL, SQLite, or Microsoft SQL Server)
Recommended Python Libraries:
pip install pandas sqlalchemy matplotlib seaborn
pip install pyspark
- pandas: Data manipulation
- SQLAlchemy: Connect to SQL databases
- matplotlib/seaborn: Data visualization
- pyspark: for working with pyspark
1. Connecting Python to Relational Databases
Here’s how to connect to a database using SQLAlchemy (for most DBs) or SQLite (for quick prototyping).
Example 1: SQLite
import sqlite3
# Create connection to SQLite
con = sqlite3.connect('sqlite:///example.db')
cur = con.cursor()
# The result of a "cursor.execute" can be iterated over by row
for row in cur.execute('SELECT * FROM tablename;'):
print(row)
# Be sure to close the connection
con.close()
Alternatively, use Pandas and SQLite
import sqlite3
import pandas as pd
# Create connection to SQLite
con = sqlite3.connect('sqlite:///example.db')
# Read data from SQL into Pandas
df = pd.read_sql_query("SELECT * FROM tablename", con)
print(df.head())
con.close()
Example 2: PostgreSQL (or NoSQL MySQL)
import pandas as pd
import sqlalchemy as db
# Format: dialect+driver://username:password@host:port/database
engine = db.create_engine(('postgresql://user:password@localhost:5432/mydatabase')
df = pd.read_sql_query('SELECT * FROM tablename', engine)
print(df.head())
Example 3: MS SQL
import pyodbc
import pandas as pd
cnxn = pyodbc.connect('DRIVER=ODBC Driver 17 for SQL Server;SERVER=your_db_server_id,your_db_server_port;DATABASE=pangard;UID=your_db_username;PWD=your_db_password')
query = "SELECT * FROM database.tablename;"
df = pd.read_sql(query, cnxn)
print(df)
You can also use sqlalchemy to make that connection
import pandas as pd
import sqlalchemy as sql
import pyodbc
server = '100.10.10.10'
driver = 'SQL+Server'
db = 'master'
engine = sql.create_engine('mssql+pyodbc://{}/{}?driver={}'.format(server, db, driver))
df = pd.read_sql_query('SELECT * FROM tablename', engine)
2. Using PySpark for Spark SQL
Ideal for big data or distributed clusters (e.g., Databricks, AWS EMR, local Spark)
from pyspark.sql import SparkSession
# Start Spark session
spark = SparkSession
.builder
.appName("BI with Spark SQL")
.config("spark.some.config.option", "some-value")
.getOrCreate()
# Read CSV or Parquet
df_spark = spark.read.csv("sales.csv", header=True, inferSchema=True)
df_spark.createOrReplaceTempView("sales")
# Run SQL
result = spark.sql("SELECT region, SUM(sales) as total_sales FROM sales GROUP BY region")
result.show()
spark.stop()
3. Common BI Use Cases (SQL + Python + Spark)
Let’s walk through real-world BI tasks using both SQLAlchemy + pandas and Spark SQL.
Use Case 1: Sales Trend Analysis
SQLAlchemy Version
df = pd.read_sql_query(query, engine)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(df['order_date'], df['total_sales'], marker='o')
plt.title('Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.grid(True)
plt.show()
Spark SQL Version
query = """
SELECT order_date, SUM(sales) AS total_sales
FROM sales
GROUP BY order_date
ORDER BY order_date
"""
result = spark.sql(query)
result.show()
Tip: For Spark, use .toPandas()
to convert results to pandas for visualization:
result.toPandas().plot(x='order_date', y='total_sales')
Use Case 2: Top Products by Revenue
Python + SQLAlchemy:
df = pd.read_sql_query(query, engine)
import seaborn as sns
sns.barplot(data=df, x='revenue', y='product_name')
plt.title("Top 10 Products by Revenue")
plt.xlabel("Revenue")
plt.ylabel("Product")
plt.show()
Spark SQL:
spark.sql(query).toPandas().plot(kind='barh', x='product_name', y='revenue')
Use Case 3: Churn Labeling
###spark SQL
query = """
SELECT customer_id,
MAX(order_date) AS last_order,
CASE WHEN MAX(order_date) < DATE_SUB(CURRENT_DATE(), 180) THEN 1 ELSE 0 END AS churn
FROM sales
GROUP BY customer_id
"""
spark.sql(query).show()
Bonus: Automating BI Pipelines
You can automate Spark + SQL pipelines using:
- Apache Airflow for scheduling
- Databricks Jobs or EMR Steps
- Save results to Delta/Parquet for dashboards
- Email PDF/Excel reports with Python
result.toPandas().to_excel("daily_summary.xlsx", index=False)
Final Thoughts
Using Python + SQL + Spark SQL gives you:
✅ Traditional database querying (SQL)
✅ Powerful processing at scale (Spark SQL)
✅ Rich analysis, visualization, and automation (Python)
This tech stack is essential for modern BI analysts dealing with growing data and needing deeper insight.
Discover more from Daily BI Talks
Subscribe to get the latest posts sent to your email.