feature-engineering▌
aj-geddes/useful-ai-prompts · updated Apr 8, 2026
Feature engineering creates and transforms features to improve model performance, interpretability, and generalization through domain knowledge and mathematical transformations.
Feature Engineering
Overview
Feature engineering creates and transforms features to improve model performance, interpretability, and generalization through domain knowledge and mathematical transformations.
When to Use
- When you need to improve model performance beyond using raw features
- When dealing with categorical variables that need encoding for ML algorithms
- When features have different scales and require normalization
- When creating domain-specific features based on business knowledge
- When handling skewed distributions or non-linear relationships
- When preparing data for different types of ML algorithms with specific requirements
Engineering Techniques
- Encoding: Converting categorical to numerical
- Scaling: Normalizing feature ranges
- Polynomial Features: Higher-order terms
- Interactions: Combining features
- Domain-specific: Business-relevant transformations
- Temporal: Time-based features
Key Principles
- Create features based on domain knowledge
- Remove redundant features
- Scale features appropriately
- Handle categorical variables
- Create meaningful interactions
Implementation with Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler, PolynomialFeatures,
OneHotEncoder, OrdinalEncoder, LabelEncoder
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import seaborn as sns
# Create sample dataset
np.random.seed(42)
df = pd.DataFrame({
'age': np.random.uniform(18, 80, 1000),
'income': np.random.uniform(20000, 150000, 1000),
'experience_years': np.random.uniform(0, 50, 1000),
'category': np.random.choice(['A', 'B', 'C'], 1000),
'city': np.random.choice(['NYC', 'LA', 'Chicago'], 1000),
'purchased': np.random.choice([0, 1], 1000),
})
print("Original Data:")
print(df.head())
print(df.info())
# 1. Categorical Encoding
# One-Hot Encoding
print("\n1. One-Hot Encoding:")
df_ohe = pd.get_dummies(df, columns=['category', 'city'], drop_first=True)
print(df_ohe.head())
# Ordinal Encoding
print("\n2. Ordinal Encoding:")
ordinal_encoder = OrdinalEncoder()
df['category_ordinal'] = ordinal_encoder.fit_transform(df[['category']])
print(df[['category', 'category_ordinal']].head())
# Label Encoding
print("\n3. Label Encoding:")
le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])
print(df[['city', 'city_encoded']].head())
# 2. Feature Scaling
print("\n4. Feature Scaling:")
X = df[['age', 'income', 'experience_years']].copy()
# StandardScaler (mean=0, std=1)
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)
# MinMaxScaler [0, 1]
minmax_scaler = MinMaxScaler()
X_minmax = minmax_scaler.fit_transform(X)
# RobustScaler (resistant to outliers)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].hist(X['age'], bins=30, edgecolor='black')
axes[0, 0].set_title('Original Age')
axes[0, 1].hist(X_standard[:, 0], bins=30, edgecolor='black')
axes[0, 1].set_title('StandardScaler Age')
axes[1, 0].hist(X_minmax[:, 0], bins=30, edgecolor='black')
axes[1, 0].set_title('MinMaxScaler Age')
axes[1, 1].hist(X_robust[:, 0], bins=30, edgecolor='black')
axes[1, 1].set_title('RobustScaler Age')
plt.tight_layout()
plt.show()
# 3. Polynomial Features
print("\n5. Polynomial Features:")
X_simple = df[['age']].copy()
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_simple)
X_poly_df = pd.DataFrame(X_poly, columns=['age', 'age^2'])
print(X_poly_df.head())
# Visualization
plt.figure(figsize=(12, 5))
plt.scatter(df['age'], df['income'], alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Age vs Income')
plt.grid(True, alpha=0.3)
plt.show()
# 4. Feature Interactions
print("\n6. Feature Interactions:")
df['age_income_interaction'] = df['age'] * df['income'] / 10000
df['age_experience_ratio'] = df['age'] / (df['experience_years'] + 1)
print(df[['age', 'income', 'age_income_interaction', 'age_experience_ratio']].head())
# 5. Domain-specific Transformations
print("\n7. Domain-specific Features:")
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 45, 60, 100],
labels=['Young', 'Middle', 'Senior', 'Retired'])
df['income_level'] = pd.qcut(df['income'], q=3, labels=['Low', 'Medium', 'High'])
df['log_income'] = np.log1p(df['income'])
df['sqrt_experience'] = np.sqrt(df['experience_years'])
print(df[['age', 'age_group', 'income', 'income_level', 'log_income']].head())
# 6. Temporal Features (if date data available)
print("\n8. Temporal Features:")
dates = pd.date_range('2023-01-01', periods=len(df))
df['date'] = dates
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['date'].dt.dayofweek >= 5
print(df[['date', 'year', 'month', 'day_of_week', 'is_weekend']].head())
# 7. Feature Standardization Pipeline
print("\n9. Feature Engineering Pipeline:")
# Separate numerical and categorical features
numerical_features = ['age', 'income', 'experience_years']
categorical_features = ['category', 'city']
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(drop='first'), categorical_features),
]
)
X_processed = preprocessor.fit_transform(df[numerical_features + categorical_features])
print(f"Processed shape: {X_processed.shape}")
# 8. Feature Statistics
print("\n10. Feature Statistics:")
X_for_stats = df[numerical_features].copy()
X_for_stats['category_A'] = (df['category'] == 'A').astype(int)
X_for_stats['city_NYC'] = (df['city'] == 'NYC').astype(int)
feature_stats = pd.DataFrame({
'Feature': X_for_stats.columns,
'Mean': X_for_stats.mean(),
'Std': X_for_stats.std(),
'Min': X_for_stats.min(),
'Max': X_for_stats.max(),
'Skewness': X_for_stats.skew(),
'Kurtosis': X_for_stats.kurtosis(),
})
print(feature_stats)
# 9. Feature Correlations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
X_numeric = df[numerical_features].copy()
X_numeric['purchased'] = df['purchased']
corr_matrix = X_numeric.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0])
axes[0].set_title('Feature Correlation Matrix')
# Distribution of engineered features
axes[1].hist(df['age_income_interaction'], bins=30, edgecolor='black', alpha=0.7)
axes[1].set_title('Age-Income Interaction Distribution')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# 10. Feature Binning / Discretization
print("\n11. Feature Binning:")
df['age_bin_equal'] = pd.cut(df['age'], bins=5)
df['age_bin_quantile'] = pd.qcut(df['age'], q=5)
df['income_bins'] = pd.cut(df['income'], bins=[0, 50000, 100000, 150000])
print("Equal Width Binning:")
print(df['age_bin_equal'].value_counts().sort_index())
print("\nEqual Frequency Binning:")
print(df['age_bin_quantile'].value_counts().sort_index())
# 11. Missing Value Creation and Handling
print("\n12. Missing Value Imputation:")
df_with_missing = df.copy()
missing_indices = np.random.choice(len(df), 50, replace=False)
df_with_missing.loc[missing_indices, 'age'] = np.nan
# Mean imputation
age_mean = df_with_missing['age'].mean()
df_with_missing['age_imputed_mean'] = df_with_missing['age'].fillna(age_mean)
# Median imputation
age_median = df_with_missing['age'].median()
df_with_missing['age_imputed_median'] = df_with_missing['age'].fillna(age_median)
# Forward fill
df_with_missing['age_imputed_ffill'] = df_with_missing['age'].fillna(method='ffill')
print(df_with_missing[['age', 'age_imputed_mean', 'age_imputed_median']].head(10))
print("\nFeature Engineering Complete!")
print(f"Original features: {len(df.columns) - 5}")
print(f"Final features available: {len(df.columns)}")
Best Practices
- Understand your domain before engineering features
- Create features that are interpretable
- Avoid data leakage (using future information)
- Test feature importance after engineering
- Document all transformations
- Use appropriate scaling for different algorithms
Common Transformations
- Log Transform: For skewed distributions
- Polynomial Features: For non-linear relationships
- Interaction Terms: For combined effects
- Binning: For categorical approximation
- Normalization: For comparison across scales
Deliverables
- Engineered feature dataset
- Feature transformation documentation
- Correlation analysis of new features
- Distribution comparisons (before/after)
- Feature importance rankings
- Preprocessing pipeline code
- Data dictionary with feature descriptions
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.8★★★★★60 reviews- ★★★★★Ama Dixit· Dec 28, 2024
feature-engineering reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Dhruvi Jain· Dec 24, 2024
I recommend feature-engineering for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Ira Martin· Dec 24, 2024
feature-engineering has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Mia Singh· Dec 20, 2024
feature-engineering reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Ama Jackson· Dec 12, 2024
We added feature-engineering from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Omar Abbas· Nov 19, 2024
feature-engineering is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Oshnikdeep· Nov 15, 2024
Useful defaults in feature-engineering — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Ira Sharma· Nov 15, 2024
Solid pick for teams standardizing on skills: feature-engineering is focused, and the summary matches what you get after install.
- ★★★★★Ishan Rao· Nov 11, 2024
Registry listing for feature-engineering matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Hiroshi Huang· Nov 11, 2024
feature-engineering is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
showing 1-10 of 60