Classifying inbound email without pretending it is magic

A shared inbox is a quietly expensive thing. Quotes, complaints, invoices, supplier queries, and newsletters all land in the same place, and someone has to triage them by hand before anyone can act. It is one of the best candidates for AI in a small operation, and also one where people most often oversell what it can do. Done right, it is genuinely useful. Done as if it were magic, it backfires.

The useful version is modest. The model reads each incoming message and proposes a category and a priority. Is this a new enquiry, an existing customer, a supplier, a payment matter, something urgent. It does not pretend to resolve anything. It sorts, so the right person sees the right thing sooner and the urgent message stops hiding behind forty routine ones. That alone reclaims a lot of scattered attention.

The trick is to design from the start for the cases it will get wrong, because there will be some. A message that is genuinely ambiguous, a complaint dressed up as a polite question, a new enquiry from an address you already have on file. Pretending these will not happen is how inbox AI earns distrust fast.

Design for the wrong answers

So we build in a few habits.

First, we let the model say it is unsure. A message it cannot confidently place goes to a clearly marked needs a human pile rather than being forced into a category. A confident wrong answer is far more dangerous than an honest shrug, because nobody checks the confident one.

Second, we weight the categories by what a mistake costs. Misfiling a newsletter as an enquiry wastes a few seconds. Misfiling a complaint as a newsletter can lose a customer. So the high cost categories get a lower bar for flagging to a person, and the harmless ones can run with a lighter touch. Treating every misclassification as equally bad leads you to tune for the wrong thing.

Third, we keep the human correction cheap and we learn from it. When someone re categorises a message, that is not just a fix, it is a signal. Over a few weeks the corrections show you exactly where the model struggles, often a category that is fuzzy even for people, which usually means the category itself needs sharpening, not the model.

Why it works when it works

The reason this pays off is that it does not try to replace judgement. It does the dull sorting that does not need judgement and surfaces the cases that do. People keep the decisions and lose the triage. That is the right division of labour, and it is a far more honest pitch than promising an inbox that runs itself.

If you start here, start small. One inbox, a handful of clear categories, an explicit unsure path, and a fast way for people to correct it. Run it for a few weeks and read the corrections. You will end up with a sorter people trust, precisely because you never pretended it was magic.

Facing something similar in your business?

Talk it through with our AI guide, or send the team a note. We will tell you straight whether and how we can help.

Ask us anything