correlation-analysis

aj-geddes/useful-ai-prompts · updated Apr 8, 2026

$npx skills add https://github.com/aj-geddes/useful-ai-prompts --skill correlation-analysis
0 commentsdiscussion
summary

Correlation analysis measures the strength and direction of relationships between variables, helping identify which features are related and detect multicollinearity.

skill.md

Correlation Analysis

Overview

Correlation analysis measures the strength and direction of relationships between variables, helping identify which features are related and detect multicollinearity.

When to Use

  • Identifying relationships between numerical variables
  • Detecting multicollinearity before regression modeling
  • Exploratory data analysis to understand feature dependencies
  • Feature selection and dimensionality reduction
  • Validating assumptions about variable relationships
  • Comparing linear and non-linear associations

Correlation Types

  • Pearson: Linear correlation (continuous variables)
  • Spearman: Rank-based correlation (ordinal/non-linear)
  • Kendall: Rank correlation (robust alternative)
  • Cramér's V: Association for categorical variables
  • Mutual Information: Non-linear dependencies

Key Concepts

  • Correlation Coefficient: Ranges from -1 to +1
  • Positive Correlation: Variables move together
  • Negative Correlation: Variables move oppositely
  • Multicollinearity: High correlations between predictors

Implementation with Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr, kendalltau

# Sample data
np.random.seed(42)
n = 200
age = np.random.uniform(20, 70, n)
income = age * 2000 + np.random.normal(0, 10000, n)
education_years = age / 2 + np.random.normal(0, 3, n)
satisfaction = income / 50000 + np.random.normal(0, 0.5, n)

df = pd.DataFrame({
    'age': age,
    'income': income,
    'education_years': education_years,
    'satisfaction': satisfaction,
    'years_employed': age - education_years - 6
})

# Pearson correlation (linear)
corr_matrix = df.corr(method='pearson')
print("Pearson Correlation Matrix:")
print(corr_matrix)

# Individual correlation with p-value
corr_coef, p_value = pearsonr(df['age'], df['income'])
print(f"\nPearson correlation (age vs income): r={corr_coef:.4f}, p-value={p_value:.4f}")

# Spearman correlation (rank-based)
spearman_matrix = df.corr(method='spearman')
print("\nSpearman Correlation Matrix:")
print(spearman_matrix)

spearman_coef, p_value = spearmanr(df['age'], df['income'])
print(f"Spearman correlation (age vs income): rho={spearman_coef:.4f}, p-value={p_value:.4f}")

# Kendall tau correlation
kendall_coef, p_value = kendalltau(df['age'], df['income'])
print(f"Kendall correlation (age vs income): tau={kendall_coef:.4f}, p-value={p_value:.4f}")

# Correlation heatmap
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pearson heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, ax=axes[0], vmin=-1, vmax=1)
axes[0].set_title('Pearson Correlation Heatmap')

# Spearman heatmap
sns.heatmap(spearman_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, ax=axes[1], vmin=-1, vmax=1)
axes[1].set_title('Spearman Correlation Heatmap')

plt.tight_layout()
plt.show()

# Correlation with significance testing
def correlation_with_pvalue(df):
    rows, cols = [], []
    for col1 in df.columns:
        for col2 in df.columns:
            if col1 < col2:  # Avoid duplicates
                r, p = pearsonr(df[col1], df[col2])
                rows.append({
                    'Variable 1': col1,
                    'Variable 2': col2,
                    'Correlation': r,
                    'P-value': p,
                    'Significant': 'Yes' if p < 0.05 else 'No'
                })
    return pd.DataFrame(rows)

corr_table = correlation_with_pvalue(df)
print("\nCorrelation with P-values:")
print(corr_table)

# Scatter plots with regression lines
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

pairs = [('age', 'income'), ('age', 'education_years'),
         ('income', 'satisfaction'), ('education_years', 'years_employed')]

for idx, (var1, var2) in enumerate(pairs):
    ax = axes[idx // 2, idx % 2]
    ax.scatter(df[var1], df[var2], alpha=0.5)

    # Add regression line
    z = np.polyfit(df[var1], df[var2], 1)
    p = np.poly1d(z)
    x_line = np.linspace(df[var1].min(), df[var1].max(), 100)
    ax.plot(x_line, p(x_line), "r--", linewidth=2)

    r, p_val = pearsonr(df[var1], df[var2])
    ax.set_title(f'{var1} vs {var2}\nr={r:.4f}, p={p_val:.4f}')
    ax.set_xlabel(var1)
    ax.set_ylabel(var2)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Multicollinearity detection (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[['age', 'education_years', 'years_employed']]
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print("\nVariance Inflation Factor (VIF):")
print(vif_data)
print("\nVIF > 10: High multicollinearity")
print("VIF > 5: Moderate multicollinearity")

# Partial correlation (controlling for confounding)
def partial_correlation(df, x, y, control_vars):
    from scipy.stats import linregress

    # Residuals of x after removing control variables
    x_residuals = df[x] - np.poly1d(
        np.polyfit(df[control_vars].values, df[x], deg=1)
    )(df[control_vars].values)

    # Residuals of y after removing control variables
    y_residuals = df[y] - np.poly1d(
        np.polyfit(df[control_vars].values, df[y], deg=1)
    )(df[control_vars].values)

    return pearsonr(x_residuals, y_residuals)[0]

partial_corr = partial_correlation(df, 'income', 'satisfaction', ['age'])
print(f"\nPartial correlation (income vs satisfaction, controlling for age): {partial_corr:.4f}")

# Distance correlation (non-linear relationships)
try:
    from dcor import distance_correlation
    dist_corr = distance_correlation(df['age'], df['income'])
    print(f"Distance correlation (age vs income): {dist_corr:.4f}")
except ImportError:
    print("dcor library not installed for distance correlation")

# Correlation stability over time
fig, ax = plt.subplots(figsize=(12, 5))

rolling_corr = df['age'].rolling(window=50).corr(df['income'])
ax.plot(rolling_corr.index, rolling_corr.values)
ax.set_title('Rolling Correlation (age vs income, window=50)')
ax.set_ylabel('Correlation Coefficient')
ax.grid(True, alpha=0.3)
plt.show()

Interpretation Guidelines

  • |r| = 0.0-0.3: Weak correlation
  • |r| = 0.3-0.7: Moderate correlation
  • |r| = 0.7-1.0: Strong correlation
  • p < 0.05: Statistically significant
  • High VIF (>10): Multicollinearity problem

Important Notes

  • Correlation ≠ Causation
  • Non-linear relationships missed by Pearson
  • Outliers can distort correlations
  • Sample size affects significance
  • Temporal trends can create spurious correlations

Visualization Strategies

  • Heatmaps for overview
  • Scatter plots for relationships
  • Pair plots for multivariate analysis
  • Rolling correlations for time-varying relationships

Deliverables

  • Correlation matrices (Pearson, Spearman)
  • Correlation heatmaps with annotations
  • Statistical significance table
  • Scatter plots with regression lines
  • Multicollinearity assessment (VIF)
  • Partial correlation analysis
  • Relationship interpretation report

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.746 reviews
  • Omar Li· Dec 28, 2024

    correlation-analysis is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Diya Iyer· Dec 20, 2024

    Solid pick for teams standardizing on skills: correlation-analysis is focused, and the summary matches what you get after install.

  • Zara Taylor· Dec 12, 2024

    Keeps context tight: correlation-analysis is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Dhruvi Jain· Dec 8, 2024

    We added correlation-analysis from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Pratham Ware· Dec 4, 2024

    Registry listing for correlation-analysis matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Oshnikdeep· Nov 27, 2024

    correlation-analysis fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Tariq Shah· Nov 23, 2024

    Registry listing for correlation-analysis matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Isabella Jackson· Nov 19, 2024

    Solid pick for teams standardizing on skills: correlation-analysis is focused, and the summary matches what you get after install.

  • Kiara Thomas· Nov 11, 2024

    correlation-analysis is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Ganesh Mohane· Oct 18, 2024

    correlation-analysis is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

showing 1-10 of 46

1 / 5