AI July 22, 2021 3 min read

GPT-3 one year on: what actually changed for business

A year after GPT-3's release is a good moment to separate real shifts from noise and understand where language models already work.

A year ago OpenAI released GPT-3 through a closed API. The conversation started immediately: some said this was the beginning of a new era, others called it another overrated model. A year has passed. This is a good moment not for predictions but for observations.

I do not do academic assessments of language models. I care about the practical question: what of this already works at a level that is useful for an ordinary business?

What actually changed

The main shift was not the model itself but the fact that language capabilities became available through an API. Previously, a company that wanted to automate text work had to either build its own model - expensive, slow - or make do with very narrow classifiers. Now it can call an API.

This changed the entry threshold. A prototype that used to require a team of ML engineers and months of work can now be assembled in a week. This does not mean it immediately goes to production. But it means that testing a hypothesis became cheaper.

Real applications I see working right now: classification of incoming requests, draft responses to standard queries, structuring unstructured data - for example, extracting fields from free text.

Where expectations did not match reality

Two things turned out to be harder than the demos suggested.

The first is reliability. The model can give an excellent answer to one question and a strange answer to a very similar one. This is fine for an assistant that helps a person. This is a problem for an autonomous process with no human in the loop.

The second is output controllability. Business processes require predictable, structured output. The model can do this, but achieving stable results requires prompt engineering and multiple iterations - a separate engineering discipline that is still taking shape.

Which tasks are already worth doing

I try to think about this through the lens of specific tasks, not the technology in general.

Tasks where language models deliver real savings right now:

initial sorting of incoming requests - determine topic and urgency before a human does;
drafts of standard replies - not the final text, but a starting point that an operator edits;
extracting data from unstructured text - from emails, contracts, applications;
summarising long documents for a manager.

Tasks where I would not yet put a model without a human in the loop:

any decision with legal or financial consequences;
customer-facing communication on behalf of the company without review;
anything where a mistake is expensive and hard to notice.

How to assess whether a task fits

Three questions I ask when looking at a candidate task for a language model:

What happens if the model makes a mistake? How expensive is the error, and how quickly will it be caught?
Do we have data to evaluate quality - examples of correct answers against which we can measure accuracy?
Where in the process does a person see the result before it goes further?

If the answer to the first question is "an error is not serious" or "it will be caught quickly" - the task is a good fit. If there is no answer to the third question - the process needs to be redesigned before adding a model.

Language models are a working tool. Not magic, and not a threat. Just a tool with areas where it helps and areas where it is still unreliable.

Back to all posts

Contact

What actually changed

Where expectations did not match reality

Which tasks are already worth doing

How to assess whether a task fits

If this resonated, write to me. I reply personally.