Small Trained Models 101
Small Trained Models 101
Cloud-Hosted Public Models vs Private Hosted Large Language Models
We’ve all used OpenAI’s ChatGPT, Anthropic’s Claude, or Google’s Gemini. Huge cloud-hosted models that seem to be relatively good at just about everything, sometimes with very impressive results.
The engineer in me always wondered about the finer points of how they worked. Cloud AI is the ultimate black box: data goes in, something – generally undisclosed by the vendor – happens, and an answer comes out the other side.
They leave a lot of unanswered questions:
- How do they draw conclusions?
- What’s being done with the business data I just gave to it? Is this safe?
- I gave it a big task, with a big answer, how do I know if it’s answering it correctly? I had little say in the matter.
- What tools are being used?
I want to note that I’ve got nothing against cloud models – I have a subscription to both ChatGPT and Claude that I use daily, often for multiple hours. I just don’t like how opaque using them can be, and wished there were additional alternatives.
Using Public AI Models for Data Ingestion & Queries
A long while ago, we ended up getting a lot of requests for data that was in formats we didn’t have handy. One request in particular got me started, which was to provide a categorized report back on the different types of low-value (minimal technical expertise needed) versus high-value engineering that we were doing for a customer. This is a great task for AI – tell a model what you think of as high-value vs low-value and then give it 2000+ pieces of data and ask it to sort it out. This is the kind of work that would take a human admin days or weeks to do, but AI can churn through in minutes.
I produced the report and turned it in, and it was “close enough” to get through the customer conversation (later, I found out it was about 75% accurate – but it was sure hard to figure this out or measure it).
I’d been left wishing not only that I didn’t have to give the 2,000 pieces of data to a cloud model, but also what I could’ve done to increase accuracy.
Open Source AI Models as an Alternative to Public Hosted Models
Amazingly, for small tasks or experimentation, these can run on consumer GPUs, meaning many mid-range laptops and desktops can actually run the LLM locally.
For those wanting to try this for yourself, grab a copy of Ollama and pull a small model and try it yourself. IBM has some good videos on DIY AI. Find the link at the bottom of this article.
The Possibility of Running a Desktop-Accessible LLM
Generalist LLMs Are Impressive, But Not for Your Company’s Knowledge
The first thing that probably comes to mind is – there’s no way a desktop-accessible LLM can behave in the same manner as the ‘magic’ I see in ChatGPT. And that statement is partially true. The first thing to look at is scale. This is an oversimplification that I’ll expand on more in a future post, but a general way to measure the capability of a model is by the number of parameters it has. Parameters are a mixture of almost entirely weights, along with vocabulary, and some other functions that go beyond the scope of this post. Generally speaking, more parameters indicate that a model is more capable of a wider variety of tasks, and depth of ability to understand as well as follow instruction.
At the time of this writing (November 2025), the undisclosed size of ChatGPT 5 or Claude Opus 4.5 is assumed to be in the low-trillions of parameters. Loading a model like that, without any quantization (a similar to compression), could take 50+ high-end, server grade GPUs – just to run one copy, much less serve it to a group of people (or the entire internet). For a quick comparison, the models available for a consumer device usually cap out around 13 billion (“13b”) parameters. High-end workstations may see up to 120b, while server clusters are only limited by budget.
Overcoming the Generalist-Issue with LLMs
However, one doesn’t need a 1T+ model to work most tasks. In fact, the size of the public models is driven by their goal to be everything to everyone at an expert level. I’d wager most of us don’t access even a fraction of a percent of the knowledge or depth of thought available inside a large cloud model.
Moreover, the cloud models are the world’s most amazing generalists. One of their biggest business flaws is having few options for knowing company-specific knowledge. A certain amount can be inserted with ‘prompt engineering’, the idea of giving the LLM some context prior to asking it a question, which can be effective for instruction. However, that’s not going to teach it broad concepts.
Training Your Model on Domain-Specific Information
For example, using our own domain material, a common problem we had before fine-tuning was recognizing equipment types as correlating to specific types of problems. “The ASR1004 needs it’s BGP security improved” would almost certainly route to being a security ticket/security event, whereas ASR + BGP is referring to hardening the security on a router’s routing protocol, which is for the network team, and would route to different engineers. Supervised fine tuning adjusts a base model towards these desired biases in decision making – something not generally possible with a cloud model at the time of this writing.
Moreover, there doesn’t have to be a dramatic loss in quality between using open-weight models and closed-weight models. I’ll explain this more, but first I’ll reference this Linux Foundation article that supports the point:
https://www.linuxfoundation.org/blog/revealing-the-hidden-economics-of-open-models-in-the-ai-era
Specifically, “Open models routinely achieve 90% or more of the performance of closed models on widely used benchmarks.”
Understanding Use Cases for Different Levels of Quantization in AI Large Language Models
I think that’s a fair comparison if using the largest open-weight models compared to closed cloud models. However, I come at this at a slightly different angle. While we certainly own the hardware to run open weight models, I still use Claude (Anthropic) for all my open-ended tasks. Why? Because a cloud model, for doing something that isn’t a repeatable task, is just simply easier and produces better answers with a lot less effort on the user’s side.
My point is more ‘the right model for the right job’. Based on our experience, this is about what to expect out of various LLM model sizes:
- < 3b – Very basic tasks.
- 3b – Data extraction, labeling, very basic Q&A
- “Is this item A part of dataset B, C, or D?”.
- Minimal prompt engineering capacity – giving it more than 5 instructions not recommended.
- 7b-8b – Moderate decision making. Very light programming.
- “Please classify item A in less than 3 words”
- “Write a SQL query for selecting two rows of data, here’s the database format:”
- Moderate prompt engineering – reliable up to about 10-12 instructions.
- 13b – This is usually the size I aim for when making a pipeline task. Relatively complex programming, complex problem solving in specific trained domains, good at following instruction. Can prompt engineer up to about 30 dense instructions.
- A quick reference for SQL generation (one of my favorite topics!) — “I’m looking for data that was less than 17 days prior to the beginning of summer, created by a senior engineer. After finding that, sum the number of hours spent where the engineer’s name started with an A” – to be fair, this is probably simplistic. A well trained 13b can pretty well crush repetitive coding-like tasks.
- Note, 13b is about as big as you’re going to be able to run on a commodity GPU.
- 32b – This is one of my favorite sizes if I’m not coding strictly for performance. These start to feel like public LLMs do when you interact with them. They can handle lightweight versions of similar tasks to what the public LLMs can and can also follow a massive amount of instructions. On one particular project while I was still prototyping, I was able to get a dense instruction prompt up to 7 printed pages and it still was able to follow every direction (please note: making prompts this long is a bad thing. I went back and shortened this dramatically with training and improved instruction).
- 70b – 120b – These feel like public models. In my experience the “magic” effect that an LLM develops happens somewhere in the 70b-100b range. These can plow through enormous amounts of junk to find ‘needle in a haystack’ answers, they can work with sometime hundreds of pages of text in their context window, and follow more instructions than you have any business writing in reasonable circumstances. Even emergent behaviors like abstract problem solving or social reasoning appear at these sizes.
- 120b is about as large as can be expected to run on a single GPU, even with high quantization. Anything larger requires a cluster of GPUs with a high-bandwidth interconnect to allow them to access each other’s VRAM in a timely manner.
As I mentioned above, there are performance reasons for re-using models for tasks in the same application. In a moderately sized workstation or server, VRAM is at a premium. The model needs to be completely loaded into VRAM before it can be inferenced against. Loading models into RAM, especially above 32b, takes a noticeable amount of time, so generally speaking the commonly used models are left in VRAM for the entire time the app is running. Re-using, for example, a 32b model trained on writing advanced SQL, also as a summarization model, is a big VRAM savings versus using two separate 32b models both loaded into VRAM at the same time.
Understanding Retrieval Augmented Generation (RAG)
Dispelling Some Common Misconceptions on Large Language Models
Before we can get into the ‘why’ of model training, it’s helpful to know about RAG. A common misconception for those new to the topic is that all information the model needs to know about is stored inside the model. While this may be technically possible, it is also an overwhelming task, and additionally means if any underlying business data changes, you have to re-train the model (not ideal!).
Instead, we use outside data, and feed that into the model’s context window in real-time on an as-needed basis. RAG can be a set of documents (potentially very large), a database, an API to another application, a company data lake, a search on the internet, basically any type of external data. When people use the phrase “chat with your data” this is typically what they’re talking about: making your data available for the LLM to interpret, so you can ask questions of it. Also, if you’ve ever dropped a set of Word documents into a public model and asked questions of it, you’re basically accomplishing the same task – the public model didn’t know what was in your data.
Of note, RAG is a concept, not a technology. Type in “best RAG” into YouTube and you’re likely to find dozens of hits over ‘this is THE way to do it’, which is misleading. I’ll write another article on some of the more common ways of using RAG, but for now, just assume you have a set of plain-text business documents and you’re planning on indexing them in some manner to make lookup available to the LLM.
Dispelling one more common misconception – in most cases the prompt you’re typing into ‘when talking to an LLM’ isn’t going directly to ‘the LLM’. You’re likely typing into an orchestrator or script written in a traditional programming language. Workflow would look something like this:
User Query → Orchestrator → LLM (or embedding model) for RAG lookup → Orchestrator → RAG Interface → Larger LLM → User
Seems like a lot of hops, but consider things LLMs don’t have:
- State persistence. Who am I talking to? What have we been talking about?
- Tool-calling abilities.
- There’s a subtlety here – LLMs can ‘know’ they need to call a tool, but that’s usually signaled back to an orchestrator to do the actual API/MCP/database/etc call.
Understanding the Basic Flow of a SQL-Backed Retrieval Augmented Generation (RAG)
The orchestrator takes care of all the things that need access to the LLM and gathers data to put back into the LLM for presentation to the user. This can get very complicated, so for right now, just assume the orchestrator is intercepting the user’s query and following the pipeline above. Let’s also assume this pipeline only does RAG lookups so we’re not having to involve a router LLM to handle different ways through a pipeline.
Let’s illustrate a very basic flow of how a SQL-backed RAG might work:
- User query: “Create a sales report for 2023”
- Orchestrator → LLM: Call an LLM and ask it “Build a SQL query to return the data for creating a sales report for 2023”
- LLM → Orchestrator: “SELECT SUM(amount) AS total_sales_2023 FROM sales WHERE order_date BETWEEN ‘2023-01-01’ AND ‘2023-12-31’;”
- Orchestrator → RAG: Send the SQL query to the database (the database is the RAG in this example)
- RAG → Orchestrator: Returns the SQL results, we’ll call this $RESULTS.
- Orchestrator → Summary LLM: The user asked for “a sales report for 2023”, here is some supporting data: $RESULTS”
- ….which eventually makes its way back to the user.
This should raise some questions in a couple steps:
- How does it know what SQL to write?
- How does the summary LLM know what I care about, or what to focus on?
Right now, this is a bit like an open-book test but you’ve never read the book – the information is there, but you don’t necessarily know what page the data is on, or what’s important versus unimportant.
There are a couple options here – the easier (and less scalable) one is prompt engineering.
Let’s assume for a moment that the SQL query was valid, but less than ideal. So our original example was:
SELECT SUM(amount) AS total_sales_2023 FROM sales WHERE order_date BETWEEN ‘2023-01-01’ AND ‘2023-12-31’;
However, the desired outcome for the company standards is to always group sales by sales rep, which might look something like:
SELECT r.name AS sales_rep, SUM(s.amount) AS total_sales_2023 FROM sales s
JOIN sales_reps r ON s.sales_rep_id = r.id WHERE s.order_date BETWEEN ‘2023-01-01’ AND ‘2023-12-31’
GROUP BY r.name ORDER BY total_sales_2023 DESC;
So the original report would’ve come back lacking and frustrated the user. We want the SQL generator to consider joining the sales_reps tables and grouping the sales by sales rep when applicable.
We would modify step 6, above, from:
6) Orchestrator → LLM: Call an LLM and ask it “Build a SQL query to return the data for creating a sales report for 2023”
To:
6) Orchestrator → LLM: Call an LLM and ask it “Build a SQL query to return the data for creating a sales report for 2023. Be certain to join sales_reps and group by r.name.”
And, assuming the LLM had some prior knowledge of what our database looked like, this should produce the correct results. This is prompt engineering.
Implementing Prompt Engineering
Prompt engineering is incredibly easy to implement and is great for prototyping, patching ‘bugs’, etc. However, it eventually creates two scalability problems:
- Having a long enough prompt eventually erodes the Attention of the model (Attention will be covered in a future post). The short version is: small, fast models can only follow so many instructions until some of the instructions stop getting followed.
- As the instruction list gets longer, it takes more compute to process the context window (the data you gave the LLM). Being terse has significant gains on performance.
Bottom line is, you can’t rely on prompt engineering for hundreds of instructions. Reference my chart above on roughly how many instructions can be followed.
Fine-Tuning Your Model for Responses by Default
Ideally what we’d do is train the model to give us what we want by default instead of needing prompting.
I’ll cover fine-tuning more in another post, but the general idea is that you give the model prompt & output pairs (thinking of it as Q&A pairs may be easier to conceptualize):
- “Build a SQL query to return the data for creating a sales report for 2023” -> “SELECT r.name AS sales_rep, SUM(s.amount) AS total_sales_2023 FROM sales s JOIN sales_reps r ON s.sales_rep_id = r.id WHERE s.order_date BETWEEN ‘2023-01-01’ AND ‘2023-12-31’GROUP BY r.name ORDER BY total_sales_2023 DESC;”
- “Build a SQL query to answer which product category generated the most revenue in Q1 2023” -> “SELECT product_category, SUM(amount) AS total_revenue FROM sales WHERE order_date >= ‘2023-01-01’ AND order_date < ‘2023-04-01’ GROUP BY product_category ORDER BY total_revenue DESC LIMIT 1;”
There are lots of important points to how fine-tuning actually works, but this is the general concept. We take a base model that’s already trained on building generic SQL and continue to provide it with examples that are specific to our business. Using a Low-Rank Adaptation (LoRA) fine tuning method, we create an “add-on” set of model weights that, when merged with the original weights, sways the model’s behavior in the direction we want it to go. It is important to note that, when done appropriately and with a wide variety of samples, this is not memorization. For example, it’s reasonable to expect the model to be able to pull together what it learned from both the above examples and answer a question such as “Which sales rep was the top seller for highest revenue product category in 2023?” despite never having seen this question!
Similar concepts can be applied to the summary model, to influence which facts to highlight, what tone to use, etc.
Now our system has two advantages:
- RAG: It has access to our company data (open book test)
- FINE TUNING: It has been trained on how to better access that data (it studied)
In effect, with both RAG and Fine Tuning, this is now an open book test that it studied for – boosting results dramatically, beyond what a public model could do. This can potentially be done with small models and can be done without the cloud – maintaining data sovereignty.
Key Takeaways
With all of this in mind, here are the key takeaways from our work here.
- Solving common business problems doesn’t require enormous models and large hardware investments. Orchestrators plus well-sized private models are powerful tools to build effective business pipelines.
- Access to your business data via RAG plus Fine Tuning of a model you control gives you accurate, predictable, and controlled outcomes tailored to your business environment.
- Multiply the power of the admin: LLM based systems allow you to take complex research and reporting tasks that used to take weeks and potentially require some significant technical knowledge and complete them in minutes or hours with natural language queries.
- All data can be processed and retained on your own equipment. No unpredictable cloud costs and complete data sovereignty.
Resources
Locally run LLMs with Ollama from IBM:
Linux Foundation on the efficiencies found in open source LLMs:
https://www.linuxfoundation.org/blog/revealing-the-hidden-economics-of-open-models-in-the-ai-era
