What is Chain-of-Thought (CoT)?

Ask an LLM: "Sarah has 5 apples. She buys 3 more. How many does she have?" The model might generate: "8 apples." Correct. Now ask: "Sarah has 5 apples. She gives half to her friend. Then she buys 3 more. How many does she have?" The model might generate: "8 apples." Wrong (it's 5.5 if we allow half apples, or 8 if we only count whole apples, depending on interpretation, but the model got confused). Now ask the same question but with "Let's think step by step": "Sarah has 5 apples. She gives half to her friend. (Sarah now has 2.5 apples). Then she buys 3 more. (Sarah now has 5.5 apples). How many does she have?" The model generates: "5.5 apples." Correct.

This is chain-of-thought prompting. Instead of asking the model for the answer directly, you ask it to show its reasoning. The model outputs intermediate steps. These intermediate steps aren't some hidden internal process. They're explicit text the model generates. By generating reasoning step-by-step, the model improves accuracy on problems requiring reasoning.

The mechanism isn't fully understood but the empirical effect is clear: LLMs reason better when they explain their reasoning. This seems to work because the model is constrained to stay consistent across steps. If you say "step 1: Sarah has 5 apples," then "step 2: she gives half to her friend," the model is now committed to those facts. It's harder to contradict them accidentally on the next line than if it just generated a final answer.

Chain-of-thought works across domains. Math problems show huge improvements. Logic puzzles show improvements. Multiple-choice questions with reasoning show improvements. Even open-ended questions improve slightly. But the improvements aren't uniform. Some tasks show 5% improvement, some show 30%. It depends on task complexity.

Variants exist. Few-shot chain-of-thought provides examples of reasoning. You show the model a few problems with solutions that include explicit reasoning, then ask it to solve a new problem the same way. Structured chain-of-thought uses specific formats: "Hypothesis: ... Evidence: ... Conclusion: ..." Asking for self-criticism: "Do you see any errors in your reasoning?" Self-critique often catches mistakes the initial reasoning missed.

Scaling chain-of-thought works with temperature. If you increase temperature (making the model more exploratory), chain-of-thought generates more diverse reasoning paths, some of which reach the right answer while others don't. Aggregate the majority vote, and you get better accuracy than the model alone. This is called ensemble reasoning.

The downside is latency and cost. Generating step-by-step reasoning takes more tokens. More tokens mean more computation and higher cost. A simple task that generates a 10-token answer with CoT might generate 50 tokens. That's 5x cost for 5x reasoning transparency. For high-volume systems, this adds up.

Chain-of-thought also doesn't solve fundamental limitations. If the model lacks knowledge or understanding, showing its reasoning just makes the wrong reasoning visible. A model hallucinating a fact will explain its hallucination step-by-step, which is actually worse than hallucinating silently because it sounds more credible.

There's also a phenomenon called "style leakage." If you ask the model to explain reasoning in a particular style, it sometimes adopts that style even when it hurts accuracy. A model asked to explain like a child might generate less precise reasoning. A model asked to explain like an expert might over-complicate simple problems.

CoT becomes especially powerful with planning. "First, let's break down this problem. Step 1: identify what we know. Step 2: identify what we need to find. Step 3: identify the relationships. Step 4: apply formulas. Step 5: verify the answer." Structured problem-solving frameworks embedded in prompts guide the model through reasoning more effectively than asking for reasoning alone.

Why It Matters

Chain-of-thought prompting is one of the most effective techniques for improving LLM reasoning without training new models. For applications requiring accuracy on complex problems, showing reasoning dramatically improves results. The transparency also makes it easier to debug why the model arrived at a particular answer. For business-critical applications where explainability matters (legal decisions, financial analysis, medical reasoning), chain-of-thought provides both better accuracy and visible reasoning. The cost-benefit varies by use case, but the impact on quality is consistent.

Example

A financial advisory AI needs to make investment recommendations. A basic prompt generates: "Buy tech stocks, it's a growth sector." With chain-of-thought: "Step 1: Current market conditions show tech sector valuations at 2x historical average. Step 2: Growth projections are 12% annually, below historical 18%. Step 3: Interest rates rising makes tech less attractive. Step 4: Client has 20-year horizon, can tolerate volatility. Step 5: Recommendation: diversify tech allocation to 30%, increase bonds to 50%, keep 20% in growth alternatives. This strategy balances growth potential with reduced rate sensitivity." The step-by-step reasoning exposes the logic, allowing clients to understand and trust (or challenge) the recommendation.

Chain-of-Thought (CoT)

Why It Matters

Example

Related Terms

Prompt Engineering

Chain-of-Thought (CoT)

LLM (Large Language Model)