Can LLMs Really Explain Themselves? A Look at ChatGPT's Explanatory Abilities

A recent study found that Large Language Models (LLMs) like ChatGPT can self-generate feature attribution explanations, but their effectiveness, compared to traditional methods, varies. The study finds no clear winner across different faithfulness metrics, and the explanations show high disagreement. Additionally, the explanation values from LLMs tend to be well-rounded and lack fine-grained variation, suggesting a human-like reasoning approach but raising questions about their precision and utility.

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce “helpful” responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as “fantastic” and “memorable” in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT’s self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.

arXiv.orgShiyuan Huang

Key Findings

Background: The paper investigates the ability of LLMs like ChatGPT to generate self-explanations, particularly in sentiment analysis tasks, comparing them to traditional explanation methods like occlusion and LIME.
Self-Explanations in LLMs: These models can spontaneously generate explanations for their decisions, such as identifying key words in a sentiment analysis task.
Experiment Design: The study uses two types of self-explanations: full feature attribution for each word and top-k explanations highlighting key words. It compares these with traditional methods like occlusion saliency and LIME.
Accuracy and Trade-offs: The accuracy of predictions varies with different self-explanation approaches. Generating explanations first (Explain-then-Predict) can lower performance, indicating a trade-off between accuracy and interpretability.
Faithfulness of Explanations: No single explanation method consistently outperforms others across various metrics. Self-explanations perform comparably to traditional methods but differ significantly in terms of agreement metrics.
Model Behavior and Output: ChatGPT’s explanations and predictions show well-rounded values and are less sensitive to word removals, reflecting a human-like reasoning process but potentially lacking in detailed precision.
Implications and Future Work: The findings suggest the need for better methods to elicit self-explanations and rethink evaluation practices. Comparative studies with other LLMs and different explanation types could provide further insights.

Self-explanations in Large Language Models (LLMs)

Self-explanations in Large Language Models (LLMs) like ChatGPT means that these models are capable of spontaneously generating explanations for their decisions, particularly in tasks that involve understanding and interpreting text, such as sentiment analysis.

Detailed Discussion on Self-Explanations in LLMs:

Nature of Self-Explanations: LLMs can articulate the rationale behind their decisions in a human-like manner. This involves identifying and highlighting elements in the input data that most significantly influenced the model's decision. In sentiment analysis, this could mean pinpointing specific words or phrases that led to a particular sentiment classification.
Examples in Sentiment Analysis:
- Positive Sentiment: For a movie review like, "The film was a breathtaking journey with outstanding performances," an LLM might explain its positive sentiment classification by identifying key phrases such as "breathtaking journey" and "outstanding performances." It interprets these phrases as indicators of positive sentiment.
- Negative Sentiment: Conversely, in a review stating, "The movie was a dull and uninspired waste of time," the LLM may highlight "dull and uninspired" and "waste of time" as the reasons for categorizing the sentiment as negative.
Mechanism of Generating Explanations: LLMs, particularly those like ChatGPT, utilize their vast understanding of language and context to generate these explanations. They are trained on large datasets that include not only linguistic information but also contextual understanding, allowing them to identify not just the words but also their connotations and implications in various contexts.
Importance of Context: The context plays a crucial role in how these models generate explanations. For instance, the word "unpredictable" in a movie review could be positive ("The plot was delightfully unpredictable") or negative ("The acting was unpredictably bad"), and the LLMs are adept at discerning these nuances based on the surrounding text.
Self-Explanations and Model Training: The ability of LLMs to provide self-explanations is partly a result of their training, which often involves not just predicting outcomes but also generating explanations. This is achieved through advanced techniques like reinforcement learning from human feedback, where models are fine-tuned to produce outputs that are not only accurate but also interpretable.
Benefits and Limitations:
- Enhanced Transparency and Trust: Self-explanations make LLMs more transparent and trustworthy, as users can understand the reasoning behind a model's decision.
- Educational Tool: In educational settings, such explanations can help students learn how to analyze texts.
- Limitations: However, the quality of these explanations can vary, and they may not always reflect the true complexity of the model's decision-making process. There's a risk of oversimplification or even misleading explanations if the model's training data is biased or incomplete.
Future Directions: Ongoing research aims to refine these self-explanation capabilities, ensuring they are accurate, relevant, and truly reflective of the models' decision-making processes.

Self-explanations in LLMs are a blend of linguistic proficiency and contextual awareness, enabling these models to communicate their decision-making process in a transparent and human-like manner. While the technology holds immense promise, it also underscores the need for careful training and evaluation to ensure the explanations are as reliable and informative as possible.

Accuracy and Trade-offs

This refers to the balance between the accuracy of the model’s predictions and the ability to generate interpretable self-explanations. Different approaches to generating these explanations can impact the model's performance, highlighting a trade-off between accuracy and interpretability.

Detailed Discussion on Accuracy and Trade-offs:

Explain-then-Predict vs. Predict-and-Explain:
- Explain-then-Predict: In this approach, the model first generates an explanation for a decision and then makes the prediction based on that explanation.
- Predict-and-Explain: Conversely, in the predict-and-explain approach, the model first makes a prediction and then generates an explanation for that prediction.
Impact on Performance:
- When a model adopts the explain-then-predict approach, it might show reduced accuracy in its predictions. This could be because the model is constrained by the need to align its prediction with the pre-generated explanation.
- In contrast, the predict-and-explain approach tends to result in higher accuracy since the model is not restricted by any prior explanation while making its prediction.
Examples Illustrating the Trade-off:
- Sentiment Analysis Task: Consider a sentiment analysis task where the model is asked to evaluate the sentiment of a movie review and then explain its reasoning.
  - Explain-then-Predict: If the model first notes key sentiment-indicating words (e.g., "brilliant", "poor") and then predicts the overall sentiment, it might overlook the broader context or miss subtle cues, leading to less accurate sentiment predictions.
  - Predict-and-Explain: If the model first determines the sentiment based on the entire review and then identifies key words or phrases justifying its decision, it tends to make more accurate predictions, as it isn’t confined to the initial words or phrases it identified.
Reasons for Trade-offs:
- Cognitive Load: Generating an explanation first can place a cognitive load on the model, as it has to maintain consistency with the explanation when making a prediction.
- Contextual Flexibility: Predict-and-explain allows the model to use the full context and its entire understanding of the text to make a prediction, leading to potentially higher accuracy.
Implications in Real-World Applications:
- The trade-off has significant implications in real-world scenarios where both accuracy and interpretability are crucial. For instance, in healthcare or financial settings, inaccurate predictions could have serious consequences, even if they are well-explained.
- However, in educational or customer service settings, interpretability might be more valuable, even if it means a slight compromise in accuracy.
Balancing the Trade-off:
- Ongoing research is focused on minimizing this trade-off, aiming to develop models that can both explain their reasoning and maintain high accuracy in their predictions.

In summary, the approach LLMs take in generating self-explanations can significantly impact their performance. The explain-then-predict approach might lead to lower accuracy due to the constraints of aligning the prediction with the explanation, while the predict-and-explain approach usually results in higher accuracy as the model is free to use the full context for its predictions. The choice between these approaches depends on the specific requirements of accuracy and interpretability in different applications.

Faithfulness of Explanations

This refers to how accurately the explanations provided by the model reflect the actual reasons behind its decisions. In the context of self-explanations and traditional explanation methods, faithfulness assesses whether these explanations truly represent why a model made a specific prediction.

Detailed Discussion on Faithfulness of Explanations:

Comparison of Explanation Methods:
- Traditional methods like occlusion saliency and LIME (Local Interpretable Model-agnostic Explanations) have been standard in providing insights into model decisions.
- Self-explanations generated by LLMs are a newer approach where the model generates its own rationale for its predictions.
Variability Across Metrics:
- Different metrics are used to evaluate the faithfulness of explanations, such as comprehensiveness, sufficiency, and agreement metrics.
- No single method (self-explanations or traditional) consistently outperforms the others across these various metrics. This indicates variability in how different methods capture the reasoning process of LLMs.
Examples and Explanation:
- Example of a Sentiment Analysis Task:
  - Self-Explanation: A model might explain its positive sentiment classification of a movie review by highlighting words like "captivating" and "masterpiece." However, this explanation could be based more on the presence of these positive words than on a deeper understanding of the overall context.
  - Occlusion Method: This method might identify the same words as critical by showing a significant change in sentiment prediction when these words are removed. However, this doesn't necessarily mean that the model's decision was based solely or primarily on these words.
  - LIME Method: LIME might provide a different set of important features, perhaps considering more the sentence structure or less obvious words that contribute to the sentiment. Again, this might not perfectly align with the actual reasoning of the model.
- Comparative Analysis: In this example, each method gives a different perspective on what's important for the model's decision. Self-explanations might appear more intuitive but might not always capture the subtle nuances that traditional methods like LIME could reveal.
Agreement Metrics:
- Agreement metrics assess how much different explanation methods concur with each other.
- Significant differences in agreement metrics between self-explanations and traditional methods suggest that they might be capturing different aspects of the model's reasoning process.
- For instance, if occlusion and self-explanation methods give different importance to the same words in a text classification task, it indicates a lack of consensus on what factors are most influential in the model’s decision-making.
Implications:
- These findings imply that different explanation methods may be suitable for different purposes or audiences. For example, self-explanations might be more accessible to laypersons, while traditional methods might provide deeper insights for expert analysis.
- The variability also underscores the complexity of LLMs' decision-making processes and the challenge in developing universally reliable explanation methods.

The faithfulness of explanations in LLMs is a complex and multi-faceted issue. No single explanation method is universally superior across all metrics, and significant differences in agreement metrics indicate that different methods may be capturing distinct aspects of the models' reasoning processes. This variability suggests a need for multiple explanation methods depending on the specific context and requirements of accuracy, interpretability, and audience understanding.

Model Behavior and Output

This is the characteristics and nuances of how these models generate explanations and predictions. A notable feature of ChatGPT's behavior is the generation of well-rounded values in its explanations and a reduced sensitivity to word removals. This reflects a more human-like reasoning process but also brings into question the detailed precision of these models.

Detailed Discussion on Model Behavior and Output:

Well-Rounded Values in Explanations:
- ChatGPT often produces explanations with attribution values that are rounded or simplified, such as 0.5, 0.75, etc., instead of more complex, precise decimals.
- Example: In a sentiment analysis task, ChatGPT might assign a sentiment score of 0.5 to the word "interesting" in a movie review. This rounded value is easier for humans to interpret but might not precisely represent the word's actual impact on the sentiment analysis.
Less Sensitivity to Word Removals:
- ChatGPT's predictions tend to be less affected by the removal of individual words from the input text.
- Example: Consider a sentence "The movie was surprisingly dull and uninspiring." Even if key adjectives like "surprisingly" or "dull" are removed, ChatGPT might still maintain its original sentiment prediction, showing a lack of sensitivity to these changes. This could be due to the model's ability to infer the overall sentiment from the remaining context.
Reflecting Human-Like Reasoning:
- This behavior of generating rounded values and being less sensitive to word removals is indicative of a reasoning process more akin to human thinking, where exact numerical precision is less common, and the overall context is given more weight than individual words.
- Human Analogy: Humans often assess situations based on general impressions rather than exact calculations. For instance, a person might describe a movie as generally enjoyable despite some flaws, without quantifying each aspect's contribution to their overall opinion.
Potential Lack of Detailed Precision:
- While these characteristics make the model's explanations more accessible and understandable, they might lack the detailed precision necessary for certain applications.
- In critical applications like medical diagnosis or legal analysis, where nuanced understanding and precision are essential, this lack of sensitivity could be a significant limitation.
Balancing Interpretability and Precision:
- The challenge lies in balancing the interpretability of the model's output with the need for detailed precision. This balance is crucial for the model's applicability across various domains.
- Future developments might focus on enhancing the model's ability to provide more nuanced and detailed explanations without compromising its user-friendly and interpretable nature.

ChatGPT's behavior and output, characterized by well-rounded explanation values and reduced sensitivity to word removals, reflect a human-like approach to processing and interpreting information. While this makes the model's outputs more relatable and easier to understand for users, it raises questions about the precision and depth of the model's understanding, especially in scenarios where detailed analysis is crucial. Addressing this challenge involves developing methods to enhance the model's precision without sacrificing its interpretability.

Strategies for Prompt Engineering Using Self Explanation

The insights gained from understanding ChatGPT's model behavior and output can be extremely valuable in prompt engineering. Prompt engineering involves crafting queries or inputs in a way that effectively guides the model to provide the most useful and relevant responses.

By understanding how ChatGPT generates explanations, assigns values, and interprets inputs, one can tailor prompts to leverage these behaviors for better outcomes.

Leveraging Well-Rounded Values:
- Knowing that ChatGPT tends to provide well-rounded explanation values, prompts can be designed to ask for explanations in a format that aligns with this tendency.
- Example: Instead of asking for highly detailed or technical explanations, one might prompt for more general, overview-style explanations that the model is more adept at providing.
Utilizing Human-Like Reasoning:
- Since ChatGPT’s reasoning mimics human thought processes, prompts can be framed in a conversational, human-like manner to yield better responses.
- Example: Using prompts that mimic natural human inquiry, like "Can you explain why this might be a good choice?" instead of asking for highly specific technical details.
Designing for Contextual Sensitivity:
- With the understanding that ChatGPT may be less sensitive to the removal of individual words, prompts can be structured to focus on broader context or to specifically highlight key aspects that need to be considered.
- Example: When seeking analysis of text, including more contextual information in the prompt can guide the model to consider the broader narrative or argument rather than focusing solely on specific keywords.
Balancing Precision and Interpretability:
- In cases where precision is crucial, prompts can be specifically designed to ask the model to focus on or clarify particular details.
- Example: For technical or scientific queries, one might ask, "What are the critical factors affecting this outcome, and could you rank them in order of importance?"
Direct Queries for Clarifications and Elaborations:
- Understanding that the model may generate simplified explanations, one can design prompts to ask for further clarifications or detailed elaborations where necessary.
- Example: Following up a model's response with prompts like "Can you expand on how you reached this conclusion?" or "What are some other factors that might be relevant?"
Iterative Prompting:
- Use an iterative approach where initial responses from the model are used to refine and redirect subsequent prompts for more precise or comprehensive information.
- Example: After receiving an initial response, one might ask, "Based on what you've just mentioned, how would this change if [specific condition] were different?"

Effective prompt engineering with ChatGPT involves understanding and adapting to the model's strengths and limitations in terms of its explanation style, sensitivity to context, and human-like reasoning process. By tailoring prompts to these characteristics, one can guide the model to provide more useful, accurate, and contextually relevant responses.

Take Advantage of Self Explanation when Developing Prompts for ChatGPT:

Define the Objective:
- Clearly identify what you want to achieve with the prompt. This could be information gathering, problem-solving, learning, creativity enhancement, etc.
Understand Model Behavior:
- Recognize that ChatGPT tends to provide well-rounded, human-like explanations and may not be highly sensitive to every single word in the prompt.
- Acknowledge that the model's responses are generally more holistic and context-driven.
Craft an Initial Prompt:
- Create a basic prompt that directly addresses your objective, keeping in mind the model's tendency to provide generalized responses.
- Ensure clarity and specificity in the prompt to guide the model's response in the right direction.
Iterate and Refine:
- Based on the initial response from ChatGPT, refine the prompt to address any gaps or to steer the conversation towards the desired detail or clarity.
- This iterative process may involve follow-up questions or rephrasing of the original prompt for precision.
Consider Contextual and Detailed Elements:
- If the objective requires detailed or technical responses, include contextual details in the prompt to guide the model's focus.
- For broader, more general responses, prompts should be open-ended but guided by the overarching topic.
Balance between Open-Ended and Specific Questions:
- Decide on the balance between open-ended questions that encourage broader thinking and specific questions that elicit targeted responses.
Test and Evaluate:
- Test the prompt to see if it generates the desired type of response.
- Evaluate the effectiveness of the prompt in achieving the objective and make adjustments as necessary.

Example of Prompt Development:

Objective: To get suggestions for healthy meal planning for a week.

Initial Prompt: "Can you provide some healthy meal ideas?"
- Response: ChatGPT might offer a generic list of healthy foods or meal suggestions.
Refined Prompt: "What are five healthy meals I can prepare for dinner next week that are both nutritious and easy to make?"
- Response: This prompts ChatGPT to provide more specific meal ideas tailored to the criteria of being both healthy and easy to prepare.
Further Iteration: "For each meal suggestion, could you also mention the primary nutritional benefits and approximate preparation time?"
- Response: This additional refinement pushes ChatGPT to include nutritional information and preparation time, adding valuable details to each meal suggestion.

By following this process, one can develop effective prompts that leverage ChatGPT's strengths while mitigating its limitations, leading to more satisfying and productive interactions with the model.

Use Cases

Understanding ChatGPT's model behavior and output can significantly enhance its utility across various use cases. By tailoring prompts to leverage the model's strengths and accommodate its limitations, users can obtain more accurate, relevant, and useful responses. Below are some use cases with detailed prompt examples that incorporate these concepts:

1. Educational Tutoring

Use Case: Assisting students in understanding complex concepts.

Prompt Example:

Initial: "Can you explain the concept of photosynthesis in simple terms?"
Follow-up: "That's a good start. Can you now give an example of how photosynthesis is essential for life on Earth?"

Rationale: The initial prompt asks for a simplified explanation, aligning with ChatGPT's tendency to provide rounded, general explanations. The follow-up prompt encourages the model to provide a practical, real-world application of the concept, enhancing understanding.

2. Technical Support

Use Case: Troubleshooting technical issues for users.

Prompt Example:

"I'm having trouble with my Wi-Fi connection. The signal keeps dropping. Based on this, what are some common reasons for this issue, and how can I fix it?"

Rationale: This prompt provides a specific problem (Wi-Fi signal dropping) and asks for common reasons and solutions, allowing ChatGPT to use its broad knowledge base to offer practical advice.

3. Content Creation

Use Case: Generating ideas for blog posts or articles.

Prompt Example:

"I'm writing a blog post about healthy eating habits. Can you suggest five unique angles to approach this topic, and why each might be appealing to readers?"

Rationale: By asking for multiple angles and their appeals, the prompt encourages ChatGPT to generate diverse and creative ideas, leveraging its ability to provide rounded, general insights.

4. Business Strategy

Use Case: Offering strategic business advice.

Prompt Example:

"I am planning to expand my cafe business. Considering current market trends, what are three expansion strategies I could consider, and what are the pros and cons of each?"

Rationale: This prompt directs ChatGPT to consider broader market trends and provide a rounded analysis of each suggested strategy, helping in informed decision-making.

5. Language Learning

Use Case: Assisting in learning a new language.

Prompt Example:

"Can you provide me with ten common phrases in French that travelers should know, along with their meanings and a context in which each might be used?"

Rationale: The prompt guides ChatGPT to offer practical language knowledge with context, aligning with its strength in providing human-like, context-aware responses.

6. Personal Finance Advice

Use Case: Offering guidance on personal finance management.

Prompt Example:

"What are five effective strategies for saving money in daily life, and how can each strategy impact my overall financial health?"

Rationale: This prompt encourages ChatGPT to provide general financial advice with a focus on daily life application, a scenario where its broad, general knowledge is beneficial.

7. Mental Health Support

Use Case: Offering general mental health and wellness tips.

Prompt Example:

"What are some common stress-reduction techniques, and can you explain how each technique helps in managing stress?"

Rationale: The prompt asks for general advice on a critical issue, which is well-suited to ChatGPT's ability to provide rounded, general explanations.

In each of these use cases, the prompts are designed to leverage ChatGPT's ability to provide human-like, general explanations and its broad knowledge base, while also guiding it to give more focused and contextually relevant responses where needed. By carefully crafting prompts, users can greatly enhance the utility and effectiveness of their interactions with ChatGPT.

Self Explanation in Conversational Prompting

The process outlined for developing prompts for ChatGPT is particularly useful for conversational prompting, where the goal is to engage the model in a dialogue that is coherent, contextually relevant, and informative. In conversational AI, the quality of the interaction often hinges on how well the prompts are designed. Let's discuss the usefulness of this process in the context of conversational prompting:

1. Enhanced Clarity and Relevance

Defining Objectives: Clear objectives help ensure that the conversation stays on topic and meets the user's needs.
Refined Prompts: Iteratively refining prompts based on ChatGPT's responses helps in steering the conversation towards more relevant and specific areas, enhancing the overall clarity and relevance of the interaction.

2. Improved Engagement and User Experience

Contextual and Detailed Elements: Incorporating these elements makes the conversation more engaging for the user. It allows ChatGPT to provide responses that are not only informative but also tailored to the user's context, improving the overall user experience.

3. Adaptability to Different Conversational Scenarios

Balancing Question Types: The ability to balance open-ended and specific questions allows for adaptability in different conversational scenarios, whether it's a casual chat, an educational dialogue, or a technical discussion.

4. Higher Quality and Depth of Responses

Testing and Evaluation: Regularly testing and evaluating the prompts ensures that the model generates high-quality and deep responses. This is crucial in maintaining a conversation that is both meaningful and insightful.

5. Efficient Information Exchange

Model Behavior Understanding: By understanding the model’s tendency to provide rounded, human-like responses, the prompts can be designed to extract information efficiently, leading to a more productive exchange.

6. Continuous Improvement of Conversations

Iterative Process: Continuously refining prompts based on previous responses allows for a dynamic and evolving conversation, where each interaction builds on the last, improving the flow and depth of the dialogue.

Example in Conversational Context:

Objective: To discuss strategies for stress management.

Initial Prompt: "Can you tell me how to manage stress?"
- Response: ChatGPT might provide general tips on stress management.
Refined Prompt: "What are some daily habits that can help reduce stress, and why are they effective?"
- Response: ChatGPT now provides specific daily habits and explains their effectiveness, adding depth to the conversation.
Further Iteration: "Can you give an example of how one might incorporate these habits into a typical workday?"
- Response: ChatGPT can now contextualize the advice within the framework of a typical workday, making the conversation more relevant and applicable to the user.

This process fosters a more engaging, informative, and user-centric conversation. By continuously refining prompts based on the model's responses and behavior, users can guide ChatGPT to provide more focused, detailed, and contextually appropriate answers, enhancing the quality of the conversational experience.

This study offers a crucial roadmap for leveraging and navigating the world of LLM self-explanations. Understanding the trade-offs between accuracy and interpretability, choosing the right explanation methods, and tailoring prompts to model behavior are key to reaping the benefits of LLM transparency. From educational tools to healthcare applications, the responsible utilization of self-explanations holds immense potential for a future where we trust and understand the intelligent systems shaping our world.