The value of traditional language corpora for pretraining LLMs is plateauing, making it necessary to gather new, challenging data to improve performance on language and reasoning tasks.
Academic Writing, Data Science, Portfolio

Agentic LLMs: The Evolution of AI in Reasoning and Social Interaction

The landscape of artificial intelligence is evolving every second. Large Language Models (LLMs) are evolving from passive entities into active, decision-making agents. This shift introduces agentic LLMs. Now, we have all seen people metnion agentic this, agentic that, the last few months. In essence, these systems are endowed with reasoning abilities, interfaces for real-world action, and the capacity to engage with other agents. These advancements are poised to redefine industries such as robotics, medical diagnostics, financial advising, and scientific research.

The Three Pillars of Agentic LLMs

  1. Reasoning Capabilities At the heart of agentic LLMs lies their reasoning ability. Drawing inspiration from human cognition, these systems emulate both rapid, intuitive decisions (System 1 thinking) and slower, analytical deliberation (System 2 thinking). Current research predominantly focuses on enhancing the decision-making processes of individual LLMs.
  2. Interfaces for Action Moving beyond static responses, agentic LLMs are equipped to act within real-world environments. This is achieved through the integration of interfaces that facilitate tool usage, robotic control, or web interactions. Such systems leverage grounded retrieval-augmented techniques and benefit from reinforcement learning, enabling agents to learn through interaction with their environment rather than relying solely on predefined datasets.
  3. Social Environments The third component emphasizes multi-agent interaction, allowing agents to collaborate, compete, build trust, and exhibit behaviors akin to human societies. This fosters a social environment where agents can develop collective intelligence. Concepts like theory of mind and self-reflection enhance these interactions, enabling agents to understand and anticipate the behaviors of others.

A Self-Improving Loop

The interplay between reasoning, action, and interaction creates a continuous feedback loop. As agents engage with their environment and each other, they generate new data for ongoing training and refinement. This dynamic learning process addresses the limitations of static datasets, promoting perpetual improvement.

Here, agents act in the world, generate their own experiences, and learn from the outcomes, without needing a predefined dataset. This approach, used by models from OpenAI and DeepSeek, allows LLMs to capture the full complexity of real-world scenarios, including the consequences of their own actions. Although reinforcement learning introduces challenges like training instability due to feedback loops, these can be mitigated through diverse exploration and cautious tuning. Multi-agent simulations in open-world environments may offer a more scalable and dynamic alternative for generating the diverse experiences required for stable, continual learning.

From Individual Intelligence to Collective Behavior

The multi-agent paradigm extends the focus beyond individual reasoning to explore emergent behaviors such as trust, deception, and collaboration. These dynamics are observed in human societies, and insights gained from these studies can inform discussions on artificial superintelligence by modeling how intelligent behaviors emerge from agent interactions.

Conclusion

Agentic LLMs are reshaping the understanding of machine learning and reasoning. By enabling systems to act autonomously and interact socially, researchers are advancing toward creating entities capable of adaptation, collaboration, and evolution within complex environments. The future of AI lies in harmonizing these elements: reasoning, action, and interaction into unified intelligent agent systems that not only respond but also comprehend, decide, and evolve.

What does this mean for fine tuning LLMs? Well, here is where it gets interesting. Unlike traditional LLM fine-tuning, which relies on static datasets curated from the internet and shaped by past human behavior, agentic LLMs can generate new training data through interaction. This marks a shift from supervised learning to a self-learning paradigm rooted in reinforcement learning.

For an indepth take on agentic LLMs, I highly recommend reading this survey.

Bar chart showing ChatGPT and BERT agreement with researcher sentiment labels for positive, negative, and neutral categories
Academic Writing, Data Science, My Digital Universe, Portfolio

How Good is Sentiment Analysis? Even Humans Don’t Always Agree

Sentiment Analysis Is Harder Than It Looks

Sentiment analysis is everywhere: from analyzing customer reviews to tracking political trends. But how accurate is it really? More importantly, how well can machines capture human emotion and nuance in text?

To answer that, I conducted a real-world comparison using a dataset of healthcare-related social media posts. I evaluated the performance of two AI models : ChatGPT and a BERT model fine-tuned on Reddit and Twitter data against human-coded sentiment labels. I used the fine-tuned BERT model and the dataset as part of my Ph.D dissertation. So this was already available to me. More information on the BERT model can be found in my disseration.

The results tell an interesting story about the strengths and limitations of sentiment analysis; not just for machines, but for humans too.

ChatGPT vs. BERT: Which One Came Closer to Human Labels?

Overall, ChatGPT performed better than the BERT model:

  • ChatGPT reached 59.83% agreement with human-coded sentiments on preprocessed text (meaning I used the dataset I used for the BERT model)
  • On raw text, agreement was 58.52% (here, I used the same dataset I provided the second human coder, without any preprocessing like lemmatization, etc)
  • The BERT model lagged behind in both scenarios

This shows that large language models like ChatGPT are moving sentiment analysis closer to human-level understanding; but the gains are nuanced. See below image that shows ChatGPT vs the Trained BERT agreement levels with my coding for each Reddit post. Note that this was when I used the preprocessed dataset and generated the ChatGPT output.

Class-by-Class Comparison: Where Each Model Excels

Looking beyond overall scores, I broke down agreement across each sentiment class. Here’s what I found:

  • Neutral Sentiment: ChatGPT led with 64.76% agreement, outperforming BERT’s 44.76%.
  • Positive Sentiment: BERT did better with 66.19% vs. ChatGPT’s 41.73%.
  • Negative Sentiment: Both struggled, with BERT at 26.09% and ChatGPT at 17.39%.

These results suggest that ChatGPT handles nuance and neutrality better, while BERT tends to over-assign positivity; a common pattern in models trained on social platforms like Reddit and Twitter.

But Wait … Even Humans Don’t Fully Agree

Here’s where it gets more interesting! When comparing two human coders on the same dataset, agreement was just 72.79% overall. Class-specific agreement levels were:

  • Neutral: 91.8%
  • Negative: 60%
  • Positive: Only 43.6%

This mirrors the model behavior. The subjectivity of sentiment, especially when it comes to borderline cases or ambiguous language, is challenging even humans!

Why Sentiment Is So Difficult … Even for Humans

As discussed in my dissertation, sentiment classification is impacted by:

  • Ambiguous or mixed emotions in a single post
  • Sarcasm and figurative language
  • Short posts with little context
  • Different human interpretations of tone and intent

In short: Sentiment is not just about word choice, it’s about context, subtlety, and perception. I tackle this in much more depth in my dissertation. So, if you want to read more about what other researchers are saying, I suggest you refer to Chapter 5, where I talk about sentiment analysis issues, explanations, and implications.

Here is the Spiel

  • ChatGPT outperformed a Reddit-Twitter trained BERT model in both overall accuracy and especially on neutral sentiment.
  • Positive and negative sentiment remain harder to classify, for both models and humans.
  • Even human coders don’t always agree, proving that sentiment is a subjective task by nature.
  • For applications in healthcare, finance, or policy; where precision matters, sentiment analysis needs to be interpreted carefully, not blindly trusted.

Final Thoughts

AI is getting better at understanding us, but it still has blind spots. As we continue to apply sentiment analysis in real-world domains, we must account for ambiguity, human disagreement, and context. More importantly, we need to acknowledge that even “ground truth” isn’t always absolute.

Let’s keep pushing the boundaries, but with a healthy respect for the complexity of human emotion.

Diagram illustrating how a large language model (LLM) answers questions using ontology embeddings, Chain-of-Thought prompting, and Retrieval-Augmented Generation from a knowledge graph.
Academic Writing, Data Science, My Digital Universe, Portfolio

Revolutionizing Data Interaction: How AI Can Comprehend Your Evolving Data Without Retraining

In the rapidly evolving landscape of enterprise AI, organizations often grapple with a common challenge: enabling large language models (LLMs) to interpret and respond to queries based on structured data, such as knowledge graphs, without necessitating frequent retraining as the data evolves.

A novel approach addresses this issue by integrating three key methodologies:

  1. Ontology embeddings : Transform structured data into formats that LLMs can process, facilitating an understanding of relationships, hierarchies, and schema definitions within the data.
  2. Chain-of-Thought prompting: Encourage LLMs to engage in step-by-step reasoning, enhancing their ability to navigate complex data structures and derive logical conclusions.
  3. Retrieval-Augmented Generation (RAG): Equip models to retrieve pertinent information from databases or knowledge graphs prior to generating responses, ensuring that outputs are both accurate and contextually relevant.

By synergizing these techniques, organizations can develop more intelligent and efficient systems for querying knowledge graphs without the need for continuous model retraining.

Implementation Strategy

  • Combining Ontology Embeddings with Chain-of-Thought Prompting: This fusion allows LLMs to grasp structured knowledge and reason through it methodically, which is particularly beneficial when dealing with intricate data relationships.
  • Integrating within a RAG Framework: Traditionally used for unstructured data, RAG can be adapted to retrieve relevant segments from knowledge graphs, providing LLMs with the necessary context for informed response generation.
  • Facilitating Zero/Few-Shot Reasoning: This approach minimizes the need for retraining by utilizing well-structured prompts, enabling LLMs to generalize across various datasets and schemas effectively.

Organizational Benefits

Adopting this methodology offers several advantages:

  • Reduced Need for Retraining: Systems can adapt to evolving data without the overhead of continuous model updates.
  • Enhanced Explainability: The step-by-step reasoning process provides transparency in AI-driven decisions.
  • Improved Performance with Complex Data: The model’s ability to comprehend and navigate structured data leads to more accurate responses.
  • Adaptability to Schema Changes: The system remains resilient amidst modifications in data structures.
  • Efficient Deployment Across Domains: LLMs can be utilized across various sectors without domain-specific fine-tuning.

Practical Applications

This approach has been successfully implemented in large-scale systems, such as the Dutch national cadastral knowledge graph (Kadaster), demonstrating its viability in real-world scenarios. For instance, deploying a chatbot capable of:

  • Understanding domain-specific relationships without explicit programming.
  • Updating its knowledge base in tandem with data evolution.
  • Operating seamlessly across departments with diverse taxonomies.
  • Delivering transparent and traceable answers in critical domains.

Conclusion

By integrating ontology-aware prompting, systematic reasoning, and retrieval-enhanced generation, organizations can develop AI systems that interact with structured data more effectively. This strategy not only streamlines the process but also enhances the reliability and adaptability of AI applications in data-intensive industries. For a comprehensive exploration of this methodology, refer to Bolin Huang’s Master’s thesis.

A visual representation of a Knowledge Graph Question Answering (KGQA) framework that integrates ontology embeddings, Chain-of-Thought prompting, and Retrieval-Augmented Generation (RAG). The diagram shows the flow from user query to LLM reasoning and response generation based on structured data from a knowledge graph.