0%
Open to Opportunities

Krishna
Sathvik

Senior Data Engineer · AI/ML & RAG Builder

I'm Krishna Sathvik Mantripragada, a Senior Data Engineer who loves turning complex, messy data into fast, reliable, and usable products. I design and build cloud data platforms, scalable ELT pipelines, and real-time streaming systems using Databricks, PySpark, Kafka, and Azure.

Beyond data engineering, I explore the intersection of AI, GenAI, and RAG — building production-ready applications using LangChain, vector databases, and LLM APIs. I also ship full-stack apps end-to-end, from backend APIs to polished frontend experiences.

I build real, production-grade data and AI systems — from streaming pipelines and analytics foundations to RAG chatbots and full-stack web applications.

Azure Databricks Python Apache Kafka Apache Spark Snowflake dbt Power BI Machine Learning SQL Apache Airflow Tableau AWS Delta Lake Apache Flink Scala LangChain OpenAI HuggingFace MLflow Vector Databases Feature Store Azure Data Factory PostgreSQL Docker Azure DevOps RAG LLM PySpark Java Azure Databricks Python Apache Kafka Apache Spark Snowflake dbt Power BI Machine Learning SQL

Featured Projects

Production-Ready
TrailVerse: National Parks Explorer
AI-Powered Platform

TrailVerse

Role: Founder · Full-Stack Developer · Data Engineer

TrailVerse is an AI-powered national parks exploration and trip planning platform for 470+ U.S. park units. It unifies NPS data, interactive maps, real-time weather, events, reviews, and dual LLM trip planning (OpenAI + Claude) into a single production-ready experience.

React 18.3 Node.js MongoDB OpenAI GPT-4
View Live Site
Job Tracking App

ApplyTrak - Enterprise Job Application Tracker

Role: Full-Stack Developer · Automation Engineer

ApplyTrak is a production-ready job application tracking platform that helps modern job seekers manage unlimited applications, goals, and analytics with real-time sync across devices. Built with React, TypeScript, Supabase, and Tailwind, it includes achievements, rich analytics, and a local-to-cloud migration system.

React 19 TypeScript Supabase
View Live App
ApplyTrak: Job Application Tracker

Other Projects

LLM Engineer

RAG Chatbot - Advanced Interview Preparation Assistant

Role: LLM Engineer · RAG Developer

A dual-persona Retrieval-Augmented Generation (RAG) chatbot for interview preparation, combining 557+ curated knowledge chunks with FastAPI and a modern React frontend. It routes questions across AI/ML, Data Engineering, BI, and Analytics Engineering profiles to deliver structured, interview-ready answers.

React 19 FastAPI RAG
GitHub
Data Engineer

Real-time Fraud Detection

Role: Data Engineer · ML Engineer (Real-time Streaming · Anomaly Detection)

ML-powered fraud detection pipeline processing millions of transactions with sub-second latency. Built with Kafka for real-time event streaming, Spark for distributed processing, and machine learning models for anomaly detection and fraud classification.

Python Kafka Spark
GitHub
Data Engineer

Finance Tracker Pipeline

Role: Data Engineer · ETL Developer (Python · Pandas · SQLite)

A personal finance tracking pipeline that ingests CSV transaction data, cleans and categorizes expenses, stores them in SQLite, and exposes interactive summaries through a Streamlit dashboard. It generates monthly breakdowns, category views, and savings trends from raw bank exports.

Python pandas SQLite
GitHub
ML Engineer

Stock Price Prediction Pipeline

Role: ML Engineer (LSTM · Time-Series Modeling)

A complete end-to-end machine learning project for forecasting stock prices using traditional ML (Linear Regression, XGBoost) and deep learning (LSTM), along with time-series forecasting via Facebook Prophet. An interactive Streamlit dashboard makes model outputs, metrics, and visualizations easy to explore.

Python LSTM Streamlit
GitHub
Data Engineer

Market Basket Analysis Pipeline

Role: Data Engineer · ML Engineer (FP-Growth · Association Rules)

An end-to-end Market Basket Analysis pipeline that ingests retail transactions, cleans and filters them, and uses the FP-Growth algorithm to mine frequent itemsets and association rules. A Streamlit dashboard lets users filter by confidence/lift, search by product, and explore top co-occurring items.

Python Streamlit FP-Growth
GitHub
Data Engineer

Real-Time Vehicle Telemetry Pipeline

Role: Data Engineer · IoT Streaming Developer

This project simulates and processes real-time vehicle telemetry (GPS, speed, fuel level, engine temperature) using Kafka, Spark Structured Streaming, Cassandra, and Streamlit. It detects anomalies like overspeeding, overheating, and low fuel, and visualizes live metrics and alerts on a real-time dashboard.

Kafka Spark Cassandra
GitHub

AI/ML & GenAI

Personal Research & Projects

Beyond my professional data engineering work, I actively explore AI/ML, GenAI, and RAG technologies through hands-on projects. I build proof-of-concepts with LangChain, vector databases, and LLMs to understand how these tools work in practice and stay current with the AI landscape.

RAG System

RAG Chatbot - Advanced Interview Preparation Assistant

Role: LLM Engineer · RAG Developer

A dual-persona Retrieval-Augmented Generation (RAG) chatbot for interview preparation, combining 557+ curated knowledge chunks with FastAPI and a modern React frontend. It routes questions across AI/ML, Data Engineering, BI, and Analytics Engineering profiles to deliver structured, interview-ready answers.

RAG OpenAI GPT-4 Claude Vector DB LangChain
View Project
GenAI

Generative AI Applications

Building production applications with GPT-4, Claude, and other LLMs. Exploring prompt engineering, fine-tuning, and agent-based architectures for real-world use cases.

OpenAI API Anthropic Claude Prompt Engineering LLM Agents

Ongoing exploration and experimentation

Machine Learning

ML & Deep Learning Projects

Personal ML projects including stock prediction with LSTM networks, fraud detection systems, and time series forecasting. Focus on production-ready implementations and model optimization.

LSTM XGBoost TensorFlow PyTorch
View Projects
Research

AI Research & Continuous Learning

Staying current with latest AI/ML research, experimenting with new architectures, and contributing to open-source AI projects. Regularly building proof-of-concepts and sharing learnings.

Research Papers Open Source Experimentation Knowledge Sharing

Active learning and contribution

AI Powered

Ask my AI Assistant

Query my background, tech stack, or availability. It reads directly from my resume data.

krishna-bot — node — 80x24
~ System online. Try asking: "What is your experience with Azure?"
~

Career
Timeline

I'm a Senior Data Engineer focused on building scalable, reliable data platforms that power analytics, machine learning, and real-time decision-making. My experience spans Azure and AWS, large-scale processing with Databricks and PySpark, and streaming architectures using Kafka and event-driven pipelines.

At Walgreens Boots Alliance, I design and operate enterprise data platforms that process terabytes of retail and healthcare data each month. I work across Azure Databricks, Synapse, ADF, and Kafka to build high-availability ETL/ELT workflows, improve data quality, and deliver trusted datasets consumed across finance, supply chain, and product teams.

Outside of work, I explore the intersection of Data Engineering and AI—building real, production-ready applications using GenAI, RAG, LangChain, vector databases, and LLM APIs. I also enjoy shipping full-stack applications end-to-end, which helps me bridge backend data systems with real user experiences.

What drives me: creating systems that scale for tomorrow, remain invisible to end users, and deliver value through clean design, automation, and long-term reliability.

Education

Master of Science in Computer Science

University of North Texas • 2021

Bachelor of Technology in IT

GITAM University • 2019

Senior Data Engineer

Walgreens Boots Alliance / Feb 2022 — Present

  • Designed large-scale data pipelines in Azure Databricks, Synapse, and ADF processing 10TB+ monthly across retail and healthcare systems, enabling reliable analytics and ML workflows for enterprise teams
  • Built scalable PySpark ETL and ELT pipelines integrating 15+ enterprise data sources, delivering high-availability curated datasets consumed by business, engineering, and data science partners
  • Implemented data quality frameworks using Python and SQL with automated validation and remediation, reducing data incidents 45% and eliminating over 200 hours of manual resolution work monthly
  • Partnered with data science teams to deliver feature-ready datasets and streamline ML training workflows by improving data consistency, availability, and lineage tracking across critical pipelines
  • Developed end-to-end pipeline observability with SLA monitoring, lineage metadata, and automated alerting, increasing critical pipeline reliability from 92% to 99.8% across production workloads
  • Automated CI/CD workflows using Azure DevOps to standardize deployments, reduce operational overhead, and improve delivery speed and stability for data engineering releases

Analytics Engineer

CVS Health / Oct 2020 — Dec 2021

  • Built Python and SQL ETL pipelines processing 7M+ daily records across claims, retail, and care data systems, enabling analytics teams to perform forecasting, planning, and operational reporting at scale
  • Automated ingestion workflows using Oracle Cloud and Python, reducing data refresh time from 8 hours to 2 hours and improving reporting freshness for business and analytics stakeholders
  • Designed data validation frameworks with rule-based checks and monitoring, reducing manual intervention by 30% and improving accuracy and trust in enterprise reporting datasets
  • Integrated ML model outputs into production ETL pipelines and delivered clean feature datasets, increasing model accuracy by 16% and enabling more reliable forecasting workflows
  • Delivered analytics-ready datasets consumed in Tableau and Power BI, accelerating leadership reporting cycles and improving decision-making for finance, operations, and care management teams

Data Science Intern

McKesson Corporation / Mar 2020 — Sep 2020

  • Built Python and Spark pipelines processing 2M+ prescription and utilization records, improving data readiness by 22% and enabling accurate analytics, forecasting, and downstream ML modeling
  • Created reusable feature datasets for XGBoost and time-series forecasting models, reducing model training time 40% and improving reliability and consistency across multiple model runs
  • Developed Tableau dashboards monitoring pipeline health and operational KPIs, reducing issue detection time from hours to minutes and improving system transparency for analytics teams
  • Delivered clean and well-documented datasets supporting supply chain optimization, enabling accurate demand forecasting and contributing to a 12% reduction in excess inventory costs

Software Developer

Inditek Pioneer Solutions / Jun 2018 — Dec 2019

  • Built responsive React and TypeScript applications improving page load times by ~30% and enhancing usability for internal healthcare tools used across multiple teams
  • Developed reusable UI components, shared API integration layers, and state management patterns that reduced frontend development time for new features by 25%
  • Implemented UI testing, debugging workflows, and CI/CD practices with backend and QA teams, reducing production UI defects by 20% and improving release stability
React.js TypeScript / JavaScript (ES6+) REST API Integration UI Component Architecture HTML5 / CSS3 / Responsive Design

Core Competencies

Data Architecture & Design

Enterprise data platform design, Medallion/Lakehouse architecture, dimensional modeling, cloud-native data solutions

ETL/ELT Pipeline Engineering

Real-time streaming with Kafka, batch ELT with PySpark/Databricks, orchestration with Airflow and ADF, scalable transformation layers using dbt-style modeling

Data Quality & Governance

Validation frameworks, automated monitoring, metadata management, lineage tracking, quality rules, SLA enforcement

ML/AI Data Infrastructure

Feature-ready dataset engineering, ML data pipelines, model input/output integration, automation for data used in training and inference

Cloud Data Platforms

Azure (Databricks, Synapse, Data Lake, ADF), AWS (S3), Snowflake (projects), dbt-style transformations, Delta Lakehouse architectures

DataOps & Automation

CI/CD for data pipelines, automated testing and deployment, monitoring and alerting, DevOps workflows for data engineering teams

Technical Skills

Data Engineering & Streaming

Databricks · PySpark · Apache Spark · Delta Lake · Kafka · dbt · Airflow · Azure Data Factory

AI, ML & RAG

Python (ML) · scikit-learn · LSTM · RAG · LangChain · OpenAI API · FAISS · Pinecone

Cloud & Warehouses

Azure · AWS · Snowflake · Azure Synapse · Delta Lakehouse

Programming & Scripting

Python · SQL · Scala · Java · PowerShell

BI & Analytics Tools

Power BI · Tableau · Streamlit

DevOps & Tooling

Git/GitHub · CI/CD · Docker · Azure DevOps

Certifications

Microsoft Azure Data Engineer Associate

Active • Credential ID: 2CA6D7588001CC9F

Designing and implementing data storage, data processing, and data security solutions using Azure services.

Microsoft Azure AI Engineer Associate

Active • Credential ID: 61B6FE700A01EC6

Designing and implementing AI solutions using Azure Cognitive Services, Azure Machine Learning, and Azure Bot Service.

SnowPro Core Certification

In Progress

Validates expertise in Snowflake data warehousing, administration, and analytics.

Databricks Certified Data Engineer

In Progress

Certified in Databricks lakehouse architecture, Spark, and data engineering best practices.

dbt Analytics Engineering Certification

Planned

Production-grade dbt transformations, testing, documentation, and analytics engineering.

O'Reilly ChatGPT Data Analysis

Active • 2025

Advanced techniques for using ChatGPT and AI tools for data analysis and business intelligence.

Publications

AI for Electricity Market Design

Book Chapter — Handbook of Smart Energy Systems, Springer (2023)

Published chapter on artificial intelligence applications in electricity market design and optimization.

Status: Available

Let's Work
Together

I'm currently exploring roles as a Senior Data Engineer or AI/Data Engineer (DE + GenAI/RAG). If you'd like to chat about data platforms, streaming, or AI products, feel free to reach out.

Location

Jefferson City, MO

Remote (USA)

Stack

Azure / Databricks / Snowflake

Python / SQL / dbt