Not all explanations are created equal. For an explanation to be useful in practice, it must do more than highlight inputs or display weights. It needs to behave reliably and reflect how the model actually works.
Academic Writing, Data Science, My Digital Universe, Portfolio

Why Consistency and Faithfulness Matter in AI Explanations

In the growing field of explainable AI, tools like LIME and SHAP have made it possible to peek inside complex models and understand their reasoning. But just because a model can explain itself doesn’t mean every explanation is meaningful or trustworthy.

Evaluating the Quality of Explanations

Not all explanations are created equal. For an explanation to be useful in practice, it must do more than highlight inputs or display weights. It needs to behave reliably and reflect how the model actually works.

Two critical properties help assess that:

1. Consistency

A good explanation should behave consistently. That means:

  • If you train the same model on different subsets of similar data, the explanations should remain relatively stable.
  • Small changes to input data shouldn’t lead to dramatically different explanations.

Inconsistent explanations can confuse users, misrepresent what the model has learned, and signal overfitting or instability in the model itself.

2. Faithfulness

Faithfulness asks a simple but powerful question: Do the features highlighted in the explanation actually influence the model’s prediction?

An explanation is not faithful if it attributes importance to features that, when changed or removed, don’t affect the outcome. This kind of misleading output can erode trust and create false narratives around how the model operates.

Why These Metrics Matter

In sensitive applications like healthcare, lending, or security, misleading explanations are more than just technical flaws. They can have real-world consequences.

  • Imagine a credit scoring model that cites a user’s browser history or favorite color as key decision drivers. Even if the model is technically accurate, such explanations would damage its credibility and raise ethical and legal concerns.
  • In regulated industries, explanations that fail consistency or faithfulness checks can expose organizations to compliance risks and reputational damage.

Real-World Examples

Faithfulness Test: Credit Risk Model

A faithfulness test was applied to a credit risk model used to classify applicants as “high” or “low” risk. The SHAP explanation highlighted feature A (e.g., number of bank accounts) as highly important.

To test faithfulness, this feature was removed and the model’s prediction didn’t change … at all!

What the graph shows:

  • SHAP value for “Number of Bank Accounts” was +0.25 (suggesting a major contribution).
  • But after removing it, the model’s risk prediction stayed the same, proving that this feature wasn’t actually influencing the output.

This revealed a serious problem: the model was producing unfaithful explanations. It was surfacing irrelevant features as important, likely due to correlation artifacts in the training data.

Consistency Test: Credit Risk Model

A credit scoring model was trained on two different but similar subsets of loan application data. Both versions produced the same prediction for an applicant: “high risk”, but gave very different explanations.

What the graph shows:

  • In Training Set A, the top contributing feature was “Credit Utilization” (+0.3).
  • In Training Set B, it was “Employment Type” (+0.28).
  • The SHAP bar charts for the same applicant looked noticeably different, even though the final decision didn’t change.

This inconsistency raised questions about model stability: Can we trust that the model is learning the right patterns, or is it too sensitive to the training data?

Final Thoughts

As AI systems continue to make critical decisions in our lives, explainability is not a luxury, it’s a necessity. Tools like LIME and SHAP offer a valuable window into how models work, but that window needs to be clear and reliable.

Metrics like consistency and faithfulness help us evaluate the strength of those explanations. Without them, we risk mistaking noise for insights, or worse, making important decisions based on misleading information.

Accuracy might get a model deployed, but consistency and faithfulness should decide its validity and trust. If you want to learn more about explainability in AI, please check this blog post, where I talk about how LIME and SHAP can help explain model outcomes.

Understanding Model Decisions with SHAP and LIME
Academic Writing, Data Science, My Digital Universe

What Made the Model Say That? Real Examples of Explainable AI

When people talk about artificial intelligence, especially deep learning, the conversation usually centers around accuracy and performance. How well does the model classify images? Can it outperform humans in pattern recognition? While these questions are valid, they miss a crucial piece of the puzzle: explainability.

Explainability is about understanding why an AI model makes a specific prediction. In high-stakes domains like healthcare, finance, or criminal justice, knowing the why is just as important as the what. Yet this topic is often overlooked in favor of performance benchmarks.

Why Is Explainability Hard in Deep Learning?

Classical models like decision trees (e.g., CART) offer built-in transparency. You can trace the decision path from root to leaf and understand the model’s logic. But deep learning models are different. They operate through layers of nonlinear transformations and millions of parameters. As a result, even domain experts can find their predictions opaque.

This can lead to problems:

  • Lack of trust from users or stakeholders
  • Difficulty debugging or improving models
  • Potential for hidden biases or unfair decisions

This is where explainability tools come in.

Tools That Help Open the Black Box

Two widely used frameworks for model explanation are LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations). Both aim to provide insights into which features influenced a specific prediction and by how much.

LIME in Action

LIME works by perturbing the input data and observing how the model’s predictions change. For instance, in a text classification task, LIME can highlight which words in an email led the model to flag it as spam. It does this by creating many variations of the email (e.g., removing or replacing words) and observing the output.

Loan Risk Example:

  • A model classifies a loan application as risky. We will use John as an example.
  • We want to find the reasons as to why their application was labeled as risky.
  • LIME reveals that the applicant’s job status and credit utilization were the two most influential factors.

LIME reveals that the model flagged John’s loan as risky mainly due to his contract employment status and high credit utilization. Although John had no previous defaults and a moderate income, those factors were outweighed by the others in the model’s decision.

SHAP in Practice

SHAP uses concepts from cooperative game theory to assign each feature an importance value. It ensures a more consistent and mathematically grounded explanation. SHAP values can be plotted to show how each input feature pushes the prediction higher or lower.

Medical Diagnosis Example:

  • Let’s use Maria as an example, after her information was entered into the system, she was classified as high risk by the model
  • To understand as to what factors contributed to that classification, we can use SHAP. SHAP shows that age and blood pressure significantly contributed to a high-risk prediction.
  • These insights help physicians verify if the model aligns with clinical reasoning.

Final Thoughts

The examples of Maria and John illustrate a powerful truth: even highly accurate models are incomplete without explanations. When a model labels someone as high-risk, whether for a disease or a loan default, it’s not enough to accept the outcome at face value. We need to understand why the model made that decision.

Tools like LIME and SHAP make this possible. They open up the black box and allow us to see which features mattered most, giving decision-makers the context they need to trust or challenge the model’s output.

Why Explainability Matters in Business:

  • Builds trust with stakeholders
  • Supports accountability in sensitive decisions
  • Uncovers potential biases or errors in the model
  • Aligns predictions with domain expertise

As AI becomes more embedded in real-world systems, explainability is not optional; it’s essential. It turns predictions into insights, and insights into informed, ethical decisions.

Good AI model evaluations doesn’t stop at explainability. Learn the importance of consistency and faithfulness and see why it matters by checking out this post.

Conflict-free scheduling: Ensures no two vehicles with intersecting paths enter at the same time.
Academic Writing, Data Science, My Digital Universe, Portfolio

Optimizing Traffic Flow: Efficient, but Is It Safe?

Unsignalized intersections are managed without traffic lights. They rely on stop signs and right-of-way rules. These intersections are inherently riskier compared to traffic light enforced ones because now there is no lights and it depends on the driver paying attention to the stop sign, but that is a different matter altogether.

They’re common in suburban or low-traffic areas but increasingly challenged by growing traffic volumes and the emergence of Connected and Automated Vehicles (CAVs).

These intersections are friction points in modern traffic systems. And the problem often starts with one outdated rule: First-Come-First-Served (FCFS).

First-Come-First-Served

FCFS is a simple scheduling principle: vehicles cross in the order they arrive. If multiple vehicles approach, each waits for the ones ahead (yes, you are supposed to wait if the other person arrives at the stop sign before you) even if their paths don’t conflict.

Why It Falls Short

  • No spatial awareness: Vehicles wait even when their paths don’t intersect. This may not be a bad thing if your city or neighborhood has CRAZY drivers but it is not efficient right?
  • Ignores vehicle dynamics: No speed adjustments are used to reduce waiting time. Although you may be able to reply to a text or two? NO. Don’t text and drive!
  • Creates bottlenecks: Delays increase when vehicles arrive from different directions in quick succession. Oh well, your precious time.

In the animation above, each vehicle waits for the previous one to clear the intersection, even when there’s no collision risk. The result? Wasted time and unused intersection space. Well, that is if you only care about efficiency. Not so bad from a safety point of view.

Why FCFS Doesn’t Work for CAVs

As vehicles become more intelligent and connected, relying on a static rule like FCFS is inefficient. This is assuming that the person behind the wheel is also intelligent enough to practice caution and obey traffic rules and drives SOBER.

Modern CAVs can:

  • Share real-time location and speed data.
  • Coordinate with one another to avoid collisions.
  • Adjust their behavior dynamically.

FCFS fails to take advantage of these capabilities. It often causes unnecessary queuing, increasing delays even when safe, efficient crossings are possible through minor speed changes. Again, assuming that the drivers are all outstanding citizens with common sense, yes, this is not very efficient and there is room for improvement.

A Smarter Alternative: Conflict-Free, Real-Time Scheduling

This recent paper, named “An Optimal Scheduling Model for Connected Automated Vehicles at an Unsignalized Intersection” proposes a linear programming-based model to optimize flow at unsignalized intersections. The model is built for CAVs and focuses on minimizing average delay by scheduling optimal crossing times based on:

  • Vehicle location and direction
  • Potential conflict zones

Key Features of the Model

  • Conflict-free scheduling: Ensures no two vehicles with intersecting paths enter at the same time.
  • Rolling horizon optimization: Continuously updates schedules in real time.
  • Delay minimization: Vehicles adjust speed slightly instead of stopping.

In this visualization, vehicles coordinate seamlessly:

  • The red car enters first.
  • The gray car slows slightly to avoid a conflict.
  • The blue car times its approach to maintain flow.

No stopping. No wasted time. Just optimized motion.

Now that all sounds good to me. It sounds somewhat like a California Stop, if you know what I mean. But how can we trust the human to obey these more intricate optimization suggestions when people don’t even adhere to more simple rules like slowing down in a school zone? Ok, maybe a different topic. So let’s assume that these are all goody goodies behind the wheel and continue.

Performance: How the Model Compares to FCFS

According to the study’s simulations:

  • Up to 76.22% reduction in average vehicle delay compared to FCFS.
  • Real-time responsiveness using rolling optimization.
  • Faster computation than standard solvers like Gurobi, making it viable for live deployment.

The result? Smoother traffic, shorter waits, and better use of intersection capacity without traffic signals.

Rethinking the Rules of the Road

FCFS is simple but simplicity comes at a cost. In a connected, data-driven traffic ecosystem, rule-based systems like FCFS are no longer sufficient.

This study makes the case clear: real-time, model-based scheduling is the future of unsignalized intersection management. As cities move toward CAVs and smarter infrastructure, the ability to optimize traffic flow will become not just beneficial, but essential. That said, complexity also comes at a cost. If all the vehicles are autonomous and are controlled by a safe, optimized, and centralized algorithmic command center, this could work. But as soon as you introduce free agency, which is not a bad thing, but in this context it introduces a lot of risk, randomness, uncertainty, and CHAOS … one have to think about efficiency vs. practicality and safety.

If these CAVs are able to enter into a semi-controlled environment when they enter the parameter of the intersection, perhaps this approach could work. This means that while they are in the grid (defined by a region that leads up to the stop sign), the driver does loose some autonomy and their vehicle will be simulated by a central command … this might be a good solution to implement.

Either way, this is an interesting study. After all, we all want to get from point A to point B in the most efficient way possible. The less time we spend behind the wheel at stop signs, the more time we have for … hopefully not scrolling Tik Tok. But hey, even that is better than just sitting at a stop sign, right?

The value of traditional language corpora for pretraining LLMs is plateauing, making it necessary to gather new, challenging data to improve performance on language and reasoning tasks.
Academic Writing, Data Science, Portfolio

Agentic LLMs: The Evolution of AI in Reasoning and Social Interaction

The landscape of artificial intelligence is evolving every second. Large Language Models (LLMs) are evolving from passive entities into active, decision-making agents. This shift introduces agentic LLMs. Now, we have all seen people metnion agentic this, agentic that, the last few months. In essence, these systems are endowed with reasoning abilities, interfaces for real-world action, and the capacity to engage with other agents. These advancements are poised to redefine industries such as robotics, medical diagnostics, financial advising, and scientific research.

The Three Pillars of Agentic LLMs

  1. Reasoning Capabilities At the heart of agentic LLMs lies their reasoning ability. Drawing inspiration from human cognition, these systems emulate both rapid, intuitive decisions (System 1 thinking) and slower, analytical deliberation (System 2 thinking). Current research predominantly focuses on enhancing the decision-making processes of individual LLMs.
  2. Interfaces for Action Moving beyond static responses, agentic LLMs are equipped to act within real-world environments. This is achieved through the integration of interfaces that facilitate tool usage, robotic control, or web interactions. Such systems leverage grounded retrieval-augmented techniques and benefit from reinforcement learning, enabling agents to learn through interaction with their environment rather than relying solely on predefined datasets.
  3. Social Environments The third component emphasizes multi-agent interaction, allowing agents to collaborate, compete, build trust, and exhibit behaviors akin to human societies. This fosters a social environment where agents can develop collective intelligence. Concepts like theory of mind and self-reflection enhance these interactions, enabling agents to understand and anticipate the behaviors of others.

A Self-Improving Loop

The interplay between reasoning, action, and interaction creates a continuous feedback loop. As agents engage with their environment and each other, they generate new data for ongoing training and refinement. This dynamic learning process addresses the limitations of static datasets, promoting perpetual improvement.

Here, agents act in the world, generate their own experiences, and learn from the outcomes, without needing a predefined dataset. This approach, used by models from OpenAI and DeepSeek, allows LLMs to capture the full complexity of real-world scenarios, including the consequences of their own actions. Although reinforcement learning introduces challenges like training instability due to feedback loops, these can be mitigated through diverse exploration and cautious tuning. Multi-agent simulations in open-world environments may offer a more scalable and dynamic alternative for generating the diverse experiences required for stable, continual learning.

From Individual Intelligence to Collective Behavior

The multi-agent paradigm extends the focus beyond individual reasoning to explore emergent behaviors such as trust, deception, and collaboration. These dynamics are observed in human societies, and insights gained from these studies can inform discussions on artificial superintelligence by modeling how intelligent behaviors emerge from agent interactions.

Conclusion

Agentic LLMs are reshaping the understanding of machine learning and reasoning. By enabling systems to act autonomously and interact socially, researchers are advancing toward creating entities capable of adaptation, collaboration, and evolution within complex environments. The future of AI lies in harmonizing these elements: reasoning, action, and interaction into unified intelligent agent systems that not only respond but also comprehend, decide, and evolve.

What does this mean for fine tuning LLMs? Well, here is where it gets interesting. Unlike traditional LLM fine-tuning, which relies on static datasets curated from the internet and shaped by past human behavior, agentic LLMs can generate new training data through interaction. This marks a shift from supervised learning to a self-learning paradigm rooted in reinforcement learning.

For an indepth take on agentic LLMs, I highly recommend reading this survey.

Bar chart showing ChatGPT and BERT agreement with researcher sentiment labels for positive, negative, and neutral categories
Academic Writing, Data Science, My Digital Universe, Portfolio

How Good is Sentiment Analysis? Even Humans Don’t Always Agree

Sentiment Analysis Is Harder Than It Looks

Sentiment analysis is everywhere: from analyzing customer reviews to tracking political trends. But how accurate is it really? More importantly, how well can machines capture human emotion and nuance in text?

To answer that, I conducted a real-world comparison using a dataset of healthcare-related social media posts. I evaluated the performance of two AI models : ChatGPT and a BERT model fine-tuned on Reddit and Twitter data against human-coded sentiment labels. I used the fine-tuned BERT model and the dataset as part of my Ph.D dissertation. So this was already available to me. More information on the BERT model can be found in my disseration.

The results tell an interesting story about the strengths and limitations of sentiment analysis; not just for machines, but for humans too.

ChatGPT vs. BERT: Which One Came Closer to Human Labels?

Overall, ChatGPT performed better than the BERT model:

  • ChatGPT reached 59.83% agreement with human-coded sentiments on preprocessed text (meaning I used the dataset I used for the BERT model)
  • On raw text, agreement was 58.52% (here, I used the same dataset I provided the second human coder, without any preprocessing like lemmatization, etc)
  • The BERT model lagged behind in both scenarios

This shows that large language models like ChatGPT are moving sentiment analysis closer to human-level understanding; but the gains are nuanced. See below image that shows ChatGPT vs the Trained BERT agreement levels with my coding for each Reddit post. Note that this was when I used the preprocessed dataset and generated the ChatGPT output.

Class-by-Class Comparison: Where Each Model Excels

Looking beyond overall scores, I broke down agreement across each sentiment class. Here’s what I found:

  • Neutral Sentiment: ChatGPT led with 64.76% agreement, outperforming BERT’s 44.76%.
  • Positive Sentiment: BERT did better with 66.19% vs. ChatGPT’s 41.73%.
  • Negative Sentiment: Both struggled, with BERT at 26.09% and ChatGPT at 17.39%.

These results suggest that ChatGPT handles nuance and neutrality better, while BERT tends to over-assign positivity; a common pattern in models trained on social platforms like Reddit and Twitter.

But Wait … Even Humans Don’t Fully Agree

Here’s where it gets more interesting! When comparing two human coders on the same dataset, agreement was just 72.79% overall. Class-specific agreement levels were:

  • Neutral: 91.8%
  • Negative: 60%
  • Positive: Only 43.6%

This mirrors the model behavior. The subjectivity of sentiment, especially when it comes to borderline cases or ambiguous language, is challenging even humans!

Why Sentiment Is So Difficult … Even for Humans

As discussed in my dissertation, sentiment classification is impacted by:

  • Ambiguous or mixed emotions in a single post
  • Sarcasm and figurative language
  • Short posts with little context
  • Different human interpretations of tone and intent

In short: Sentiment is not just about word choice, it’s about context, subtlety, and perception. I tackle this in much more depth in my dissertation. So, if you want to read more about what other researchers are saying, I suggest you refer to Chapter 5, where I talk about sentiment analysis issues, explanations, and implications.

Here is the Spiel

  • ChatGPT outperformed a Reddit-Twitter trained BERT model in both overall accuracy and especially on neutral sentiment.
  • Positive and negative sentiment remain harder to classify, for both models and humans.
  • Even human coders don’t always agree, proving that sentiment is a subjective task by nature.
  • For applications in healthcare, finance, or policy; where precision matters, sentiment analysis needs to be interpreted carefully, not blindly trusted.

Final Thoughts

AI is getting better at understanding us, but it still has blind spots. As we continue to apply sentiment analysis in real-world domains, we must account for ambiguity, human disagreement, and context. More importantly, we need to acknowledge that even “ground truth” isn’t always absolute.

Let’s keep pushing the boundaries, but with a healthy respect for the complexity of human emotion.

Streamlit logistics dashboard with pricing estimator and routing tool
Data Science, My Digital Universe, Portfolio

The Ultimate Logistics Dashboard: Pricing, Routing, and Freight Insights All in One

Navigating the labyrinth of logistics can often feel like assembling IKEA furniture without the manual—frustrating, time-consuming, and occasionally resulting in a piece that looks nothing like the picture. After wrestling with geography, constraints, and some very opinionated algorithms, I built a dashboard that now supports multi-stop route optimization—up to well many destinations as you want … well, if you are patient (because pushing it to 6 might melt the Streamlit servers… ask me how I know).

Why You’ll Love It:

  • Multi-Stop Planning: Handle up to as many destinations in one go. It’s like having a personal assistant who doesn’t require coffee breaks.​
  • Fuel Stop Integration: Automatically adds fuel stops when needed, so your trucks won’t run on fumes. Because pushing a semi-truck to the next station isn’t a workout anyone wants.​
  • Efficiency at Its Core: Optimizes for fuel efficiency, distance, and delivery sequence, ensuring your routes are as lean as a marathon runner on a kale diet.​

Under the Hood:

Powered by a constrained optimization model using Gurobi, our dashboard calculates the most cost-effective routes by considering distance, fuel costs, and truck load limits. It’s like having Einstein as your co-pilot, minus the wild hair.​

Let’s face it—logistics planning is often less “fast and furious” and more “slow and suspicious.” But fear not, fellow freight folks. The Route Optimizer Dashboard is here to inject a dose of caffeine into your routing workflow (no judgment if that’s your third cup today).


The Dashy Dashy Dashboard 🙂

This dashboard isn’t just a pretty face. It’s divided into three slick tabs:

  • 📦 Pricing Estimator – Give it your shipment details, and it’ll throw back a fee estimate faster than your intern can Google “freight cost per mile.”
  • 🚚 Route Optimizer – Plug in your origin and your destinations, and it’ll find the most fuel-efficient route. It even adds fuel stops when needed. Because running out of gas mid-delivery is not a vibe.
  • 📈 Dashboard – Visualize freight flows, trends, and performance. Yes, we made charts sexy again.

🛠️ How It Works

Input Your Destinations:
Enter as many stops as your freight-loving heart desires (and breathe). But let’s keep expectations real—it’s hosted on free Streamlit Cloud, so if you enter half a dozen points, give it 5-10 seconds to do its thing. Perfect time to sip your coffee and contemplate the mysteries of route efficiency.

Review the Suggested Route:
The backend optimizer (powered by Gurobi) will crunch fuel costs, distances, and stop sequences like a logistics nerd on Red Bull. The output? A smart route ready for action.

Hit the Road:
With your optimized plan in hand, your drivers can focus on the journey—not toggling between Google Maps and guesswork. By the way, I used the alternative fuel stations dataset for this so don’t be alarmed if sometimes the fuel station suggestion is missing or have some funky name … most likely those funky stations are electric. The data is real but I had to work with what was available.


✨Final Thoughts

In the world of logistics, time is money, and efficiency is the name of the game. Our Route Optimizer Dashboard ensures you’re not just playing—you’re winning.​

This dashboard is for the data-minded logistics folks who want to stop reacting and start optimizing. Whether you’re pricing a load, planning a route, or just want to admire some interactive charts, it’s all there.

Try it out here: logistics-kk34nzr4hekiwm2tpxhrmx.streamlit.app

And hey, if it saves you even one angry client call about late deliveries, I’ll consider my job done. Need help integrating something similar for your operation? Let’s chat. This stuff’s kinda my thing.
✉️ Connect with me on LinkedIn

#Logistics #RouteOptimization #Efficiency #FuelSavings #FreightTech #SupplyChain #SmartRouting