Sentiment Analysis Is Harder Than It Looks
Sentiment analysis is everywhere: from analyzing customer reviews to tracking political trends. But how accurate is it really? More importantly, how well can machines capture human emotion and nuance in text?
To answer that, I conducted a real-world comparison using a dataset of healthcare-related social media posts. I evaluated the performance of two AI models : ChatGPT and a BERT model fine-tuned on Reddit and Twitter data against human-coded sentiment labels. I used the fine-tuned BERT model and the dataset as part of my Ph.D dissertation. So this was already available to me. More information on the BERT model can be found in my disseration.
The results tell an interesting story about the strengths and limitations of sentiment analysis; not just for machines, but for humans too.
ChatGPT vs. BERT: Which One Came Closer to Human Labels?
Overall, ChatGPT performed better than the BERT model:
- ChatGPT reached 59.83% agreement with human-coded sentiments on preprocessed text (meaning I used the dataset I used for the BERT model)
- On raw text, agreement was 58.52% (here, I used the same dataset I provided the second human coder, without any preprocessing like lemmatization, etc)
- The BERT model lagged behind in both scenarios
This shows that large language models like ChatGPT are moving sentiment analysis closer to human-level understanding; but the gains are nuanced. See below image that shows ChatGPT vs the Trained BERT agreement levels with my coding for each Reddit post. Note that this was when I used the preprocessed dataset and generated the ChatGPT output.

Class-by-Class Comparison: Where Each Model Excels
Looking beyond overall scores, I broke down agreement across each sentiment class. Here’s what I found:

- Neutral Sentiment: ChatGPT led with 64.76% agreement, outperforming BERT’s 44.76%.
- Positive Sentiment: BERT did better with 66.19% vs. ChatGPT’s 41.73%.
- Negative Sentiment: Both struggled, with BERT at 26.09% and ChatGPT at 17.39%.
These results suggest that ChatGPT handles nuance and neutrality better, while BERT tends to over-assign positivity; a common pattern in models trained on social platforms like Reddit and Twitter.
But Wait … Even Humans Don’t Fully Agree
Here’s where it gets more interesting! When comparing two human coders on the same dataset, agreement was just 72.79% overall. Class-specific agreement levels were:
- Neutral: 91.8%
- Negative: 60%
- Positive: Only 43.6%
This mirrors the model behavior. The subjectivity of sentiment, especially when it comes to borderline cases or ambiguous language, is challenging even humans!
Why Sentiment Is So Difficult … Even for Humans
As discussed in my dissertation, sentiment classification is impacted by:
- Ambiguous or mixed emotions in a single post
- Sarcasm and figurative language
- Short posts with little context
- Different human interpretations of tone and intent
In short: Sentiment is not just about word choice, it’s about context, subtlety, and perception. I tackle this in much more depth in my dissertation. So, if you want to read more about what other researchers are saying, I suggest you refer to Chapter 5, where I talk about sentiment analysis issues, explanations, and implications.
Here is the Spiel
- ChatGPT outperformed a Reddit-Twitter trained BERT model in both overall accuracy and especially on neutral sentiment.
- Positive and negative sentiment remain harder to classify, for both models and humans.
- Even human coders don’t always agree, proving that sentiment is a subjective task by nature.
- For applications in healthcare, finance, or policy; where precision matters, sentiment analysis needs to be interpreted carefully, not blindly trusted.
Final Thoughts
AI is getting better at understanding us, but it still has blind spots. As we continue to apply sentiment analysis in real-world domains, we must account for ambiguity, human disagreement, and context. More importantly, we need to acknowledge that even “ground truth” isn’t always absolute.
Let’s keep pushing the boundaries, but with a healthy respect for the complexity of human emotion.
