We need to rethink chain-of-thought (CoT) prompting AI&YOU #68

Table of Contents

Stat of the Week: Zero-shot CoT performance was only 5.55% for GPT-4-Turbo, 8.51% for Claude-3-Opus, and 4.44% for GPT-4. (“Chain of Thoughtlessness?” paper)

Chain-of-Thought (CoT) prompting has been hailed as a breakthrough in unlocking the reasoning capabilities of large language models (LLMs). However, recent research has challenged these claims and prompted us to revisit the technique.

In this week’s edition of AI&YOU, we are exploring insights from three blogs we published on the topic:

We need to rethink chain-of-thought (CoT) prompting AI&YOU #68

LLMs demonstrate remarkable capabilities in natural language processing (NLP) and generation. However, when faced with complex reasoning tasks, these models can struggle to produce accurate and reliable results. This is where Chain-of-Thought (CoT) prompting comes into play, a technique that aims to enhance the problem-solving abilities of LLMs.

An advanced prompt engineering technique, it is designed to guide LLMs through a step-by-step reasoning process. Unlike standard prompting methods that aim for direct answers, CoT prompting encourages the model to generate intermediate reasoning steps before arriving at a final answer.

At its core, CoT prompting involves structuring input prompts in a way that elicits a logical sequence of thoughts from the model. By breaking down complex problems into smaller, manageable steps, CoT attempts to enable LLMs to navigate through intricate reasoning paths more effectively.

How CoT Works

At its core, CoT prompting guides language models through a series of intermediate reasoning steps before arriving at a final answer. This process typically involves:

  1. Problem Decomposition: The complex task is broken down into smaller, manageable steps.

  2. Step-by-Step Reasoning: The model is prompted to think through each step explicitly.

  3. Logical Progression: Each step builds upon the previous one, creating a chain of thoughts.

  4. Conclusion Drawing: The final answer is derived from the accumulated reasoning steps.

Types of CoT Prompting

Chain-of-Thought prompting can be implemented in various ways, with two primary types standing out:

  1. Zero-shot CoT: Zero-shot CoT doesn’t require task-specific examples. Instead, it uses a simple prompt like “Let’s approach this step by step” to encourage the model to break down its reasoning process.****

  2. Few-shot CoT: Few-shot CoT involves providing the model with a small number of examples that demonstrate the desired reasoning process. These examples serve as a template for the model to follow when tackling new, unseen problems.

Zero-shot CoT

Few-shot CoT

AI Research Paper Breakdown: “Chain of Thoughtlessness?”

Now that you know what CoT prompting is, we can dive into some recent research that challenges some of its benefits and offers some insight into when it is actually useful.

The research paper, titled “Chain of Thoughtlessness? An Analysis of CoT in Planning,” provides a critical examination of CoT prompting’s effectiveness and generalizability. As AI practitioners, it’s crucial to understand these findings and their implications for developing AI applications that require sophisticated reasoning capabilities.

The researchers chose a classical planning domain called Blocksworld as their primary testing ground. In Blocksworld, the task is to rearrange a set of blocks from an initial configuration to a goal configuration using a series of move actions. This domain is ideal for testing reasoning and planning capabilities because:

  1. It allows for the generation of problems with varying complexity

  2. It has clear, algorithmically verifiable solutions

  3. It’s unlikely to be heavily represented in LLM training data

The study examined three state-of-the-art LLMs: GPT-4, Claude-3-Opus, and GPT-4-Turbo. These models were tested using prompts of varying specificity:

  1. Zero-Shot Chain of Thought (Universal): Simply appending “let’s think step by step” to the prompt.

  2. Progression Proof (Specific to PDDL): Providing a general explanation of plan correctness with examples.

  3. Blocksworld Universal Algorithm: Demonstrating a general algorithm for solving any Blocksworld problem.

  4. Stacking Prompt: Focusing on a specific subclass of Blocksworld problems (table-to-stack).

  5. Lexicographic Stacking: Further narrowing down to a particular syntactic form of the goal state.

By testing these prompts on problems of increasing complexity, the researchers aimed to evaluate how well LLMs could generalize the reasoning demonstrated in the examples.

Key Findings Unveiled

The results of this study challenge many prevailing assumptions about CoT prompting:

  1. Limited Effectiveness of CoT: Contrary to previous claims, CoT prompting only showed significant performance improvements when the examples provided were extremely similar to the query problem. As soon as the problems deviated from the exact format shown in the examples, performance dropped sharply.

  2. Rapid Performance Degradation: As the complexity of the problems increased (measured by the number of blocks involved), the accuracy of all models decreased dramatically, regardless of the CoT prompt used. This suggests that LLMs struggle to extend the reasoning demonstrated in simple examples to more complex scenarios.

  3. Ineffectiveness of General Prompts: Surprisingly, more general CoT prompts often performed worse than standard prompting without any reasoning examples. This contradicts the idea that CoT helps LLMs learn generalizable problem-solving strategies.

  4. Specificity Trade-off: The study found that highly specific prompts could achieve high accuracy, but only on a very narrow subset of problems. This highlights a sharp trade-off between performance gains and the applicability of the prompt.

  5. Lack of True Algorithmic Learning: The results strongly suggest that LLMs are not learning to apply general algorithmic procedures from the CoT examples. Instead, they seem to rely on pattern matching, which breaks down quickly when faced with novel or more complex problems.

These findings have significant implications for AI practitioners and enterprises looking to leverage CoT prompting in their applications. They suggest that while CoT can boost performance in certain narrow scenarios, it may not be the panacea for complex reasoning tasks that many had hoped for.

Implications for AI Development

The findings of this study have significant implications for AI development, particularly for enterprises working on applications that require complex reasoning or planning capabilities:

  1. Reassessing CoT Effectiveness: AI developers should be cautious about relying on CoT for tasks that require true algorithmic thinking or generalization to novel scenarios.

  2. Limitations of Current LLMs: Alternative approaches may be necessary for applications requiring robust planning or multi-step problem-solving.

  3. The Cost of Prompt Engineering: While highly specific CoT prompts can yield good results for narrow problem sets, the human effort required to craft these prompts may outweigh the benefits, especially given their limited generalizability.

  4. Rethinking Evaluation Metrics: Relying solely on static test sets may overestimate a model’s true reasoning capabilities.

  5. The Gap Between Perception and Reality: There’s a significant discrepancy between the perceived reasoning abilities of LLMs (often anthropomorphized in popular discourse) and their actual capabilities as demonstrated in this study.

Recommendations for AI Practitioners:

  • Evaluation: Implement diverse testing frameworks to assess true generalization across problem complexities.

  • CoT Usage: Apply Chain-of-Thought prompting judiciously, recognizing its limitations in generalization.

  • Hybrid Solutions: Consider combining LLMs with traditional algorithms for complex reasoning tasks.

  • Transparency: Clearly communicate AI system limitations, especially for reasoning or planning tasks.

  • R&D Focus: Invest in research to enhance true reasoning capabilities of AI systems.

  • Fine-tuning: Consider domain-specific fine-tuning, but be aware of potential generalization limits.

For AI practitioners and enterprises, these findings highlight the importance of combining LLM strengths with specialized reasoning approaches, investing in domain-specific solutions where necessary, and maintaining transparency about AI system limitations. As we move forward, the AI community must focus on developing new architectures and training methods that can bridge the gap between pattern matching and true algorithmic reasoning.

10 Best Prompting Techniques for LLMs

This week, we also explore ten of the most powerful and common prompting techniques, offering insights into their applications and best practices.

Well-designed prompts can significantly enhance an LLM’s performance, enabling more accurate, relevant, and creative outputs. Whether you’re a seasoned AI developer or just starting with LLMs, these techniques will help you unlock the full potential of AI models.

Make sure to check out the full blog to learn more about each one.


Thank you for taking the time to read AI & YOU!

For even more content on enterprise AI, including infographics, stats, how-to guides, articles, and videos, follow Skim AI on LinkedIn

Are you a Founder, CEO, Venture Capitalist, or Investor seeking AI Advisory, Fractional AI Development or Due Diligence services? Get the guidance you need to make informed decisions about your company’s AI product strategy & investment opportunities.

Need help launching your enterprise AI solution? Looking to build your own AI Agent Workers with our AI Workforce Management platform? Let’s Talk

We build custom AI solutions for Venture Capital and Private Equity backed companies in the following industries: Medical Technology, News/Content Aggregation, Film & Photo Production, Educational Technology, Legal Technology, Fintech & Cryptocurrency.

Let’s Discuss your AI Solution

    Related Posts

    Ready To Supercharge Your Business

    en_USEnglish