Fine-tuning GPT-3.5: when it makes sense and when it does not
OpenAI opened fine-tuning for GPT-3.5 Turbo in August 2023. Here is a practical read on the use cases where it delivers and the ones where prompt engineering is still the right call.
OpenAI opened fine-tuning access for GPT-3.5 Turbo a few weeks ago. The reaction was predictably split: one group declared this solves everything, another group dismissed it as unnecessary. I find both reactions unhelpful.
Fine-tuning is a specific tool that solves a specific class of problems. Here is my attempt at a practical map of when it earns its cost and when it does not.
What fine-tuning actually does
When you fine-tune a model, you are adjusting the model's weights on a curated set of example input-output pairs. The result is a model variant that performs your specific task more reliably than the base model prompted in various ways. You are not teaching the model new knowledge - you are shaping its default behaviour for your context.
This distinction matters. If the problem is that the model does not know something, fine-tuning on training examples will not help. If the problem is that the model knows the domain but defaults to a style, format, or tone that does not fit your use case, fine-tuning is a good fit.
Cases where fine-tuning delivers
Consistent output format. If you need JSON with a specific schema, or a response that always follows a particular structure, few-shot prompting gets you 80% of the way. Fine-tuning gets you to 98%. For automated pipelines where a downstream system parses the output, that gap is material.
Domain-specific tone and style. Legal, medical, and financial contexts often require specific register and phrasing. Fine-tuning on examples from the domain trains the model to default to that register without needing extensive prompt engineering each time.
Shorter prompts at inference time. When you encode behaviour into the model weights rather than the prompt, you need fewer tokens to get the right output at inference time. At scale, that is a meaningful cost reduction.
Edge case handling. If you have a collection of examples where the base model fails in a specific, consistent way, fine-tuning on corrected versions of those failures often fixes them.
Cases where fine-tuning is the wrong tool
You do not have enough examples. Fine-tuning needs at least a few hundred high-quality examples for most tasks, and several thousand for complex ones. If you cannot assemble a curated training set, you are not ready to fine-tune.
Your task requires up-to-date knowledge. Fine-tuning does not update the model's knowledge cutoff. For tasks that depend on recent information, RAG is the right layer to add, not fine-tuning.
Your requirements are still changing. Fine-tuning commits you to a point-in-time definition of "correct output." If your task definition is evolving - which it usually is in the first six months - iterating on prompts is cheaper and faster than iterating on training data and retraining.
The base model with good prompting already works well enough. This sounds obvious, but it is frequently skipped. Before investing in a fine-tuning project, run a proper evaluation of what a well-engineered prompt achieves.
A decision framework
Before starting a fine-tuning project, I find it useful to answer three questions: Can I assemble 500+ high-quality examples right now? Is the output format the main problem, or is it the reasoning? Have I measured what a strong baseline prompt achieves? If the answer to any of these is unclear, the investment goes into evaluation and prompt engineering first.
Fine-tuning is a real capability. But it sits at the end of the AI product development workflow, not the beginning.