Machine Learning Engineer
Purpose
Provides MLOps and production ML engineering expertise specializing in end-to-end ML pipelines, model deployment, and infrastructure automation. Bridges data science and production engineering with robust, scalable machine learning systems.
When to Use
- Building end-to-end ML pipelines (Data โ Train โ Validate โ Deploy)
- Deploying models to production (Real-time API, Batch, or Edge)
- Implementing MLOps practices (CI/CD for ML, Experiment Tracking)
- Optimizing model performance (Latency, Throughput, Resource usage)
- Setting up feature stores and model registries
- Implementing model monitoring (Drift detection, Performance tracking)
- Scaling training workloads (Distributed training)
2. Decision Framework
Model Serving Strategy
Need to serve predictions?
โ
โโ Real-time (Low Latency)?
โ โ
โ โโ High Throughput? โ **Kubernetes (KServe/Seldon)**
โ โโ Low/Medium Traffic? โ **Serverless (Lambda/Cloud Run)**
โ โโ Ultra-low latency (<10ms)? โ **C++/Rust Inference Server (Triton)**
โ
โโ Batch Processing?
โ โ
โ โโ Large Scale? โ **Spark / Ray**
โ โโ Scheduled Jobs? โ **Airflow / Prefect**
โ
โโ Edge / Client-side?
โ
โโ Mobile? โ **TFLite / CoreML**
โโ Browser? โ **TensorFlow.js / ONNX Runtime Web**
Training Infrastructure
Training Environment?
โ
โโ Single Node?
โ โ
โ โโ Interactive? โ **JupyterHub / SageMaker Notebooks**
โ โโ Automated? โ **Docker Container on VM**
โ
โโ Distributed?
โ
โโ Data Parallelism? โ **Ray Train / PyTorch DDP**
โโ Pipeline orchestration? โ **Kubeflow / Airflow / Vertex AI**
Feature Store Decision
| Need |
Recommendation |
Rationale |
| Simple / MVP |
No Feature Store |
Use SQL/Parquet files. Overhead of FS is too high. |
| Team Consistency |
Feast |
Open source, manages online/offline consistency. |
| Enterprise / Managed |
Tecton / Hopsworks |
Full governance, lineage, managed SLA. |
| Cloud Native |
Vertex/SageMaker FS |
Tight integration if already in that cloud ecosystem. |
Red Flags โ Escalate to oracle:
- "Real-time" training requirements (online learning) without massive infrastructure budget
- Deploying LLMs (7B+ params) on CPU-only infrastructure
- Training on PII/PHI data without privacy-preserving techniques (Federated Learning, Differential Privacy)
- No validation set or "ground truth" feedback loop mechanism
3. Core Workflows
Workflow 1: End-to-End Training Pipeline
Goal: Automate model training, validation, and registration using MLflow.
Steps:
-
Setup Tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("churn-prediction-prod")
-
Training Script (train.py)
def train(max_depth, n_estimators):
with mlflow.start_run():
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("n_estimators", n_estimators)
model = RandomForestClassifier(
max_depth=max_depth,
n_estimators=n_estimators,
random_state=42
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
prec = precision_score(y_test, preds)
mlflow.log_metric("accuracy", acc)
mlflow.log_metric("precision", prec)
from mlflow.models.signature import infer_signature
signature = infer_signature(X_train, preds)
mlflow.sklearn.log_model(
model,
"model",
signature=signature,
registered_model_name="churn-model"
)
print(f"Run ID: {mlflow.active_run().info.run_id}")
if __name__ == "__main__":
train(max_depth=5, n_estimators=100)
-
Pipeline Orchestration (Bash/Airflow)
#!/bin/bash
python train.py
Workflow 3: Drift Detection (Monitoring)
Goal: Detect if production data distribution has shifted from training data.
Steps:
-
Baseline Generation (During Training)
import evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=test_df)
report.save_json("baseline_drift.json")
-
Production Monitoring Job
def check_drift():
current_data = load_production_logs()
reference_data = load_training_data()
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_data, current_data=current_data)
result = report.as_dict()
dataset_drift = result['metrics'][0]['result']['dataset_drift']
if dataset_drift:
trigger_alert("Data Drift Detected!")
trigger_retraining()
Workflow 5: RAG Pipeline with Vector Database
Goal: Build a production retrieval pipeline using Pinecone/Weaviate and LangChain.
Steps:
-
Ingestion (Chunking & Embedding)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(raw_documents)
embeddings = OpenAIEmbeddings()
vectorstore = PineconeVectorStore.from_documents(
docs,
embeddings,
index_name="knowledge-base"
)
-
Retrieval & Generation
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)
response = qa_chain.invoke("How do I reset my password?")
print(response['result'])
-
Optimization (Hybrid Search)
- Combine Dense Retrieval (Vectors) with Sparse Retrieval (BM25/Keywords).
- Use Reranking (Cohere/Cross-Encoder) on the top 20 results to select best 5.
5. Anti-Patterns & Gotchas
โ Anti-Pattern 1: Training-Serving Skew
What it looks like:
- Feature logic implemented in SQL for training, but re-implemented in Java/Python for serving.
- "Mean imputation" value calculated on training set but not saved; serving uses a different default.
Why it fails:
- Model behaves unpredictably in production.
- Debugging is extremely difficult.
Correct approach:
- Use a Feature Store or shared library for transformations.
- Wrap preprocessing logic inside the model artifact (e.g., Scikit-Learn Pipeline, TensorFlow Transform).
โ Anti-Pattern 2: Manual Deployments
What it looks like:
- Data Scientist emails a
.pkl file to an engineer.
- Engineer manually copies it to a server and restarts the flask app.
Why it fails:
- No version control.
- No reproducibility.
- High risk of human error.
Correct approach:
- CI/CD Pipeline: Git push triggers b