A Field Guide to Agent Evaluations
Jun 18, 2026Practical taxonomy of agent evals: unit tests, integration tests, online evals, and benchmarks.
While AI is often associated with digital applications, its role in hardware and physical environments is just as exciting! To explore AI past software, I built and trained a robotic arm capable of learning and replicating tasks using a machine learning model called an Action Chunking Transformer (ACT). To do this, I 3D printed and wired both a "leader" and a "follower" arm, allowing me to teleoperate the follower and record demonstrations such as placing blocks into different buckets. Using this dataset, I trained the ACT model to learn from my examples, enabling the robotic arm to autonomously perform the task and even generalize to fixing its own mistakes.
LangChain
Member of LangChain's Applied AI team, working on production agents and AI-related features.
Cisco
Technical SME and hands-on developer for GenAI solutions on the MarTech Portfolio & Innovation Team, integrating AI marketing technology across the enterprise stack.
The DTH Media Corp.
Led and trained a 15-rep advertising sales team, designing the commission structure, training program, and new ad products while managing local and national client relationships.
Degree in Economics, Statistics & Information Systems
Practical taxonomy of agent evals: unit tests, integration tests, online evals, and benchmarks.
How the claw agent framework combines messaging, filesystems, and memory to create evolving digital assistants.
Context engineering, context rot, and how LLMs navigate content far exceeding their context window.
The new standard for packaging reusable workflow capabilities for filesystem-based agent harnesses.
Defining, building, and applying LLM evaluations to improve AI products.
Understanding RLVR and creating RL environments for large language models.
LLMs should be given the same tools as humans to interact with the digital world we share.
Banana Bench is a benchmark for evaluating LLMs by pitting them head-to-head in a game of Bananagrams. LLMs must build valid crossword-style boards, demonstrating spatial reasoning, constraint satisfaction, and multi-turn strategic decision making in a competitive environment.
Evaluizer is an interface for evaluating and optimizing LLM prompts. It allows you to visualize outputs against datasets, manually annotate results, and run automated evaluations using both LLM judges and deterministic functions. It features GEPA (Genetic-Pareto), an optimization engine that iteratively evolves your prompts to maximize evaluation scores through reflective feedback loops.
Deep Competitive Analyst is a 'deep agent' style LLM assistant built to automate the creation of company profiles and competitive analyses. Built on top of LangGraph Platform and Perplexity Search, DCA can perform thorough research autonomously to create detailed business reports in a fraction of the time of a human. It operates for extended periods, dynamically spawning sub-agents to parallelize research tasks and creates the kind of in-depth competitive analysis that usually costs thousands.
Are you prompting the model or is the model prompting you? rewrAIt offers a unique 'text-editor' style interface for turn-based conversations with large language models that lets you revise any part of an LLM conversation- system, user, or AI messages, at any point in time. Add, modify, and remove context or change providers to your liking. Control the narrative with AI before it controls you.
seb-ocr is a custom built OCR and entity extraction pipeline for processing unstructured historical documents- part of a larger academic research project in the field of political science. Relying on LLMs to transcribe large volumes of scanned documents from official archives, then performing named entity recognition and extraction for downstream analysis.
QuicKB is an end-to-end machine learning pipeline that turns unstructured text into optimized, semantic knowledge bases with complimentary finetuned embedding models ready for RAG/AI retrieval. It combines the latest chunking approaches, synthetic training data generation, and dimension-reduced embedding model finetuning to create personalized domain-specific retrieval systems that are both more accurate and more efficient than generic methods.
NeedleInAVidStack is a lightweight streamlit app that rapidly identifies, timestamps, and extracts specific content across large video and audio libraries. Rather than the tedious process of manually scrubbing through video and audio files yourself, NeedleInAVidStack uses Google's Gemini AI models to automatically and efficiently parse out exactly what you're looking for in a fraction of the time.
ppt2desc converts PowerPoint presentations into comprehensive machine-readable text formats using vision language models. It captures the full semantic meaning of slides by interpreting how text, graphics, and charts relate to each other- a crucial part of presentations that traditional scraping tools miss. Compatible with major AI platforms including OpenAI, Gemini, Anthropic, GCP Vertex AI, AWS Bedrock, and Azure AI Foundry.
Taking a deep dive into Stanford's declarative self-improving python (DSPy), showcasing how to program LLMs rather than rely on brittle text based prompting.
Combine RAG with knowledge graphs for richer insights. Use LLMs to extract entities and relationships, enabling structured reasoning and deeper context.
Can GenAI interact with the real world? Absolutely! I trained a robotic arm using an action chunking transformer, enabling it to autonomously replicate and generalize tasks from teleoperated demonstrations.
Discover how AI transforms text prompts into 3D models! From diffusion models to NeRFs, explore cutting-edge tech merging deep learning with 3D graphics.
While generative language models like GPT have captured public attention, BERT models remain the most widely deployed solution for production NLP tasks. Learn how BERT works and why it's so popular.
Open-source AI image generation now rivals enterprise solutions, offering flexibility and customizability. Learn how I trained FLUX.1 to create personalized images.
Learn how vector databases store embeddings to capture meaning, enable semantic search, and power dynamic RAG systems for accurate LLM responses.
Optimize RAG systems with smarter text chunking. Explore strategies from basic splits to LLM-assisted chunking for better context and performance.
I tested OpenAI's Advanced Voice Mode to build low-latency voice assistants, integrating RAG to retrieve relevant context while maintaining natural, real-time spoken interactions.