Skip to main content
Home > Resources > Whitepaper > High-Precision Inference on a Moving Target: Stabilizing Iterative LoRA Fine-Tuning

High-Precision Inference on a Moving Target: Stabilizing Iterative LoRA Fine-Tuning

High-Precision Inference on a Moving Target: Stabilizing Iterative LoRA Fine-Tuning

n n+1 RESEARCH · LoRA FINE-TUNING High-Precision Inference on a Moving Target Stabilizing Iterative LoRA Fine-Tuning Author: Jeff Kronlage Technical Artist: Brett Coleman

1. Introduction

Certain types of LLM inferencing require highly reproducible results. In our example, we will examine a pipeline that takes ‘messy’ natural language and translates it to a SQL lookup. Text2SQL in and of itself is not novel, but getting the same results consistently, across hundreds or thousands of query types, can be very challenging. In a rapidly changing environment, or when fitting to new, novel data, it may also be desirable to regularly fine-tune the LLM on unique lookup patterns.

This problem is compounded by using workstation grade (or lower) hardware ‑ commodity equipment. Some very capable models exist in this space but bring unique challenges that are not seen in large/frontier models.

By the nature of the hardware capability, performing full fine-tuning (FFT) on a model every time an incremental change is required is less than ideal. A full-fine tune can take days or even weeks on commodity equipment. As such, LoRA or QLoRA (Dettmers, et al., 2023) is typically utilized. On the hardware and training sizes in our example application LoRA training took approximately 25 minutes for a dense 14B, or 50 minutes for a dense 32B model, compared to several days for the same using FFT.

The problem that must be overcome is that LLMs face a variety of challenges producing deterministic results. When inferencing any type of code, script, database language, etc, there is very little room for acceptable variance that doesn’t create an undesired change in outcome. This makes the use case unique from, for example, an informational chatbot, where the phrasing order doesn’t need to be the same every time – or in fact, some randomness is actively desired for a “human feel”.

Obtaining deterministic results is achievable but requires mitigation techniques such as those outlined here.

2. The Floating-Point Problem

We fought three specific problems outlined in section 3, but understanding any of them first requires understanding the order-of-operations problem with floating point operations.

In floating-point math:

(a + b) + c ≠ a + (b + c)
Floating-point arithmetic

Addition is not associative

(a + b) + c  ≠  a + (b + c)

a1.0
b0.0000001
c−1.0

(a + b) + c
1.0 + 0.0000001step 1
= 1.0000001b is retained
1.0000001 + (−1.0)step 2
result0.0000001
a + (b + c)
0.0000001 + (−1.0)step 1
≈ −1.0b absorbed by c
1.0 + (−1.0)step 2
result0.0

Both results are valid IEEE 754 arithmetic. Floating-point numbers maintain a fixed number of significant digits. When values of very different magnitudes are summed, the smaller operand is shifted to align binary points and its low-order bits are truncated. The grouping of operands determines which truncation occurs first, producing legitimately different outcomes.

Order-of-operation in floating-point math changes the outcome.

This problem is by no means unique to LLMs or GPUs, it’s a consideration in all high-precision floating-point math: order of operations impacts mathematical outcomes.

An important note is that the floating-point rounding doesn’t directly cause non-determinism. LLM inferencing and training both involve billions and billions of floating-point mathematical operations but (in theory only) they could be done in the same order every time – it’s just one large math problem. The same rounding errors would occur every time, and the same tokens would be predicted with complete reliability.

The problem is that the same mathematical operation isn’t occurring every time – not during LoRA training, nor during inferencing.

In theory, if running a one-inference-at-a-time system, on a software stack specifically designed not to multitask, this problem could be overcome at an inferencing level. Naturally, this is not a common use case. LLM inferencing systems are almost always used by multiple users simultaneously, and GPU hardware is built with multitasking in mind.

Large-scale floating-point matrix multiplication (matmul) is at the heart of inferencing. Matmuls, due to their scale, have to be spread across multiple cores simultaneously. Multiple concurrent inferencing, with each individual operation spread across multiple GPU cores, requires a scheduling process to ensure the GPU gets optimally utilized. This schedule introduces order-of-operation variance, which produces the floating-point non-determinism described above: (a+b)+c ≠ a+(b+c).

This behavior is present in all modern LLM inferencing software. In most applications it’s harmless; having a chatbot that referred to someone as “Robert” or “Bob” interchangeably (in the same informal setting) is not going to change the outcome or perhaps be noticed. In fact, most systems have a deliberate token-selection randomization element (temperature) set to non-zero to encourage this behavior, adding to a more “human-like” experience. However, when the output must be the same, repeatably, this presents a problem.

3. The Three Problems in Practice

As a reference point, our development equipment consisted of:

  • 1× Nvidia RTX 6000 Pro (96gb Blackwell)
  • 1× Nvidia H100 (cloud rented) **
  • 2× Nvidia H200 (cloud rented) **
  • vLLM v0.15
  • Axolotl v0.13 (for QLoRA training)
  • Qwen3 Dense 14B & 32B models, which were AWQ quantized to 4-bit weights after training
  • ~4,000 training examples for the portion of the pipeline described here. (Our full stack is slightly over 10,000 examples)
  • Temperature set to 0.0 globally
  • A smoke test of 163 queries of varying types, complexities, topics, typically those deliberately pushing edge-cases

** Our inferencing was predominantly performed on the Blackwell card, locally. The H100/H200s were rented when it was recognized solving this problem thoroughly was going to require hundreds of fine-tunings to complete our research, and more compute was needed.

We witnessed three unique problems in our testing:

  1. Inferencing producing non-deterministic results during the same vLLM session. In our smoke tests, a subset of queries (3%–10%) would often produce a different token somewhere in the pipeline, when run twice back-to-back, often dramatically altering the downstream results.
  2. A restart of vLLM would sometimes change the results witnessed in problem 1. A stochastic result would become consistent, and another would become stochastic.
  3. A retraining of the model would introduce vast non-deterministic behavior, often resulting in a new failure rate of over 10%.

Referencing the setup above, our temperature was always set to 0.0 while witnessing these results. Setting it to anything non-zero would deliberately introduce a small amount of randomness; in a high-precision right-or-wrong environment with no middle ground, non-zero temperature will always create problems. While it’s easy to disable if investigating a problem like this, setting temperature to 0.0 is typically not the cure-all it’s made out to be.

Problem #1 is, fortunately, well-documented. Thinking Machines has published their research on the problem (He, 2025). Unfortunately, the solution is a challenge; in effect it proposes adopting a significant rewrite to most of the traditional inferencing pipeline. The solution isn’t a “download and go”, so it serves more of scientific/academic interest.

Distilled, it comes down to one of the problems already described: the system that keeps the GPU fully utilized by multiple users inherently introduces a non-deterministic order, and with the floating-point truth of (a+b)+c ≠ a+(b+c), the outcome will be that periodically different tokens are chosen, even with the same queries, and even with temperature = 0.0. After that first different token is chosen, the downstream outcome varies greatly, as the model performs its next forward pass that now introduces the “rogue” token.

Problem #2 has been more difficult for us to document. We’ll reference back to this when we get into how the iterative solution was built, but for now, we can simply testify that we’ve witnessed a vLLM restart introduce additional variance. It’s been difficult to pin down why at a referenceable level without taking some leaps-of-faith, so for now we’ll call this an observation we made, without knowing exactly the root cause. More specifically, when working on problem #1, we were typically able to update our prompts to fix the nondeterministic queries from our smoke test, have it iterate repeatedly, successfully, only to restart vLLM for various testing reasons, and found a new batch of non-determinism – typically similar to, but yet different, from those witnessed in problem #1.

Problem #3 was the most vexing to solve. Our hardware availability and budget dictated using LoRA for iterative work – FFT simply took too long, and renting (for example) an 8x H200 cluster to make it happen in a similar time frame was not affordable. The exact problem witnessed was recognized when making a tiny (<0.5% change) to our training data, re-loading the model in vLLM, and discovering a ~10% failure rate in our smoke test. Being surprised such a small change caused such a large regression, we re-trained the model on the exact original data before the 0.5% change, only to find… the 10% failure rate was still present, but on new queries. This led to re-training again with all other inferencing services shut down, training data prepared in advance, deterministic training data shuffle, multiple ‘internet suggestions’ for deterministic training enabled, and in a known good set of training data. And the next re-training re-run produced, yet again, a ~10% failure rate.

We attribute this problem to intruder dimensions, which are described in detail by Shuttleworth and associates in (Shuttleworth, et al., 2024) LoRA vs Full Fine-tuning: An Illusion of Equivalence.

The LoRA vs Full Fine-tuning paper is particularly deep, but it boils down to the unique gradient descent in LoRA, compounded with the order-of-operations floating-point arithmetic problem, creating a more “peaky” model noise floor. Where FFT was making tiny nudges to in-place weights to get a different outcome, LoRA introduces entirely new weight components built from scratch. As described in section 2, the same floating-point order-of-operations variance that affects inference also affects training — meaning identical training data, run twice, produces different gradient updates. Because LoRA’s new components emerge from these varied updates rather than refining what already existed, they land differently every retrain — producing the less-predictable weights described in the paper.

"Weight matrices trained with LoRA have new, high-ranking singular vectors, which we call intruder dimensions, while those trained with full fine-tuning do not."

This particular problem is solvable by not using LoRA, but as described above, FFT was not cost effective and was not a viable long-term solution when an application is intended to ingest new traits on a regular basis.

Variances in Model Fine Tunings

Conclusion

The floating-point math variance isn’t controllable using current technology, and is exacerbated by LoRA, so a system must be built that tolerates the variance at all the levels described above.

4. The Naïve Strategy and Its Pitfalls

This article will assume the initial build of a medical application, underpinned by a Text2SQL prompt and model, is already completed and generally working. The underlying model is an untrained Qwen3-32B model. It will further assume that it’s able to answer, as an example, a 20-question smoke test of basic, clearly written questions.

The sample application will look up medical lab results.

Examples of simple complexity could include:

  • Show me all results for patient John Smith.
  • List all tests ordered today.
  • Show me all orders placed by Dr. Rodriguez.

This could be reasonably accomplished in a single, simple prompt. The prompt would include the SQL tables and column names, status fields, join paths, and the database dialect. The prompt would help describe word choice that would align the human-written queries to SQL tables and columns.

Initial testing is successful with the following respective outcomes:

Figure 10 — Simple query flows: natural language to SQL (no-think path)
Flow A
[Input]

“Show me all results for patient John Smith.”

[SQL Generation — no-think]
SELECT *
FROM   results
WHERE  patient_name
         = 'John Smith';
32Bno-think
Flow B
[Input]

“List all tests ordered today.”

 

[SQL Generation — no-think]
SELECT *
FROM   orders
WHERE  order_date
         = CURRENT_DATE;
32Bno-think
Flow C
[Input]

“Show me all orders placed by Dr. Rodriguez.”

[SQL Generation — no-think]
SELECT *
FROM   orders
WHERE  physician_name
         = 'Rodriguez';
32Bno-think
stage 6 — SQL generation (no-think)
32B model size
no-think no chain-of-thought

Figur.

We will assume the current prompt is 23 non-blank lines, ~15 distinct rules, or ~400 tokens.

Afterwards, 20 more “medium complexity” queries are added. In the interest of brevity, only one SQL example will be shown per set of queries from this point forward:

  • Which physicians ordered the most tests last month?
  • Show me all abnormal lipid panel results for patients seen by Dr. Rodriguez in Q3 2025.
  • Compare turnaround times between Midwest Diagnostics and Valley Lab for blood work this year.

Using the final one as our specific SQL example:

  • Compare turnaround times between Midwest Diagnostics and Valley Lab for blood work this year.
SELECT lab_name, AVG(resulted_at - order_date) AS avg_turnaround FROM orders WHERE lab_name IN ('Midwest Diagnostics', 'Valley Lab') AND test_name ILIKE '%blood%' AND order_date >= '2026-01-01' GROUP BY lab_name;

Now, still on an untrained Qwen3:32B, we will assume the prompt has grown to 60 non-blank lines, ~23 distinct rules, or ~1000 tokens.

In order to continue scaling in complexity with this method, the prompt would need to continue to grow to show the model new patterns.*

* It is noteworthy that alternative methods are possible here, such as branching programs to handle different query types, or using dynamic insertions into the prompt to provide ‘helpers’ in the solution. Those methods are viable but are out-of-scope for this document.

Dense instruction-following is particularly troublesome for LLMs.

A transformer model processes information by passing it through a fixed stack of layers. Referencing Qwen3:32B dense, there are 64 layers. Simplistically, the more layers in a model, the more complex a problem it can solve in a single pass. Eventually, however, a model of any quantity of layers will run out of layers to follow a dense set of instructions thoroughly enough to correctly predict the desired next token. (Jaroslawicz, et al., 2025)

The easy answer may seem to be “get a model with more layers”, which, technically, will work. Need a bigger prompt? → Get a model with more layers. In practice, this has limitations. The number of layers is linked to the parameter count of the model. Having a properly working model with, say, 80 layers instead of 64, would require a 72B model instead of a 32B.(Petty, et al., 2024)

This means roughly twice as much VRAM** (VRAM is typically the limiting factor), and roughly twice as long a timeframe to do a forward pass.

** Assuming the same quantization rate between model sizes, VRAM will scale in this manner. However, increasing quantization rate (to a smaller number of bits) decreases accuracy, and amplifies the problem this article is intended to address.

Assuming moving to a larger model isn’t desirable (or possible with current hardware), as more complexity is added and assuming a naïve approach, two things must happen:

  • The prompt will continue to grow to provide further instruction for increasingly complicated patterns.
  • The model will need to be trained on more examples in order to make the model more likely to come to the desired answer inside the 64 layers available. Training works as a “cheat sheet” in this case, allowing the model to jump to the right conclusion more naturally with less processing.

Running through the side-effects of those requirements: The prompt will eventually grow to the point of what’s commonly called “prompt sag”. This is where the prompt and model appear to be working and generally do for simpler queries. However, seemingly random failures will creep in to the smoke test. Due to variance in vLLM, this may even present in a stochastic manner, with queries sometimes working and sometimes not, often back-to-back.

As previously described, training can stave off this effect. Text2SQL prompts must include examples of how to perform the conversion of complex queries. The examples don’t scale well in the prompt – they contribute to prompt sag, just like rules do, and are easy candidates for removal from the prompt and embedding into the model itself. In ChatML format, those may look something like:

{"messages": [{"role": "system", "content": "You are a SQL generator for a lab results management system. Generate a single SQL query based on the user's question. Return SQL only, no explanation.\n\n "}, {"role": "user", "content": "Compare turnaround times between Riverside Labs and Summit Diagnostics for CBC panels this year."}, {"role": "assistant", "content": "SELECT l.lab_name,\n AVG(r.resulted_at - o.order_date) AS avg_turnaround\nFROM results r\nJOIN orders o ON r.order_id = o.id\nJOIN labs l ON o.lab_id = l.id\nJOIN tests t ON o.test_id = t.id\nWHERE l.lab_name IN ('Riverside Labs', 'Summit Diagnostics')\nAND t.test_name ILIKE '%cbc%'\nAND o.order_date >= '2026-01-01'\nGROUP BY l.lab_name;"}]}

Create enough SQL examples, with the right balance, and the prompt sag can be staved off… for a while.

As a final step, let’s add some truly complex, multi-step queries such as:

  • Which physicians had the highest abnormal result rate for thyroid tests in 2025, and how many total tests did each order that year?
  • Which patients had abnormal results on two or more different test types in the same month?
  • What was the busiest month for each lab in 2025, and how many orders did they process that month?

Taking the final one as our SQL example:

WITH monthly_orders AS ( SELECT o.lab_id, DATE_TRUNC('month', o.order_date) AS order_month, COUNT(*) AS total_orders FROM orders o WHERE EXTRACT(YEAR FROM o.order_date) = 2025 GROUP BY o.lab_id, DATE_TRUNC('month', o.order_date) ), peak_month AS ( SELECT lab_id, order_month, total_orders, ROW_NUMBER() OVER (PARTITION BY lab_id ORDER BY total_orders DESC) AS rn FROM monthly_orders ) SELECT l.lab_name, pm.order_month, pm.total_orders FROM peak_month pm JOIN labs l ON pm.lab_id = l.id WHERE pm.rn = 1 ORDER BY pm.total_orders DESC;

This is non-trivial for a model to generate. A complex, multi-step common table expression (CTE). Qwen3:32B can certainly generate this, but it needs very clear direction without competing signals. As a hint as to where this is going, consider most queries have been written like a programmer might pseudocode a SQL query in English, not in the manner the average human does.

At this point in time, at my estimate, the prompt would be about 170 non-blank lines, ~29 distinct rules, or ~2,200 tokens. This is getting to the point where this is a real struggle for a 32B model to solve in one pass, training or no.

Assuming the most complicated queries were well-known and trained into the model, we witnessed this was sustainable in one pass/single prompt on Qwen3:32B. However, there are severe scalability problems. The natural use of any LLM-based chatbot is to answer any possible question the user can come up with for its domain. Our interpretation of that was – if anyone would ever want it written in SQL, it should be interpretable from natural language.

There are several scaling problems with this, but the two most immediately relevant are:

  • Unless vast training data already exists for the exact database being used as the back-end, it’s impossible to train every possible CTE (multi-step) query in advance. Therefore, they must be harvested as users try, fail, and report problems.
  • The prompt will always scale some with the variety of queries, despite how good of a job was done creating training data. Moreover, hardware limitations may exist that mean training can only be done every-so-often, meaning putting a patch to the system in place has to be done in the prompt. Eventually the system will experience prompt sag with this method.

A problem tends to emerge from this naïve process:

  1. User finds a new, but valid, query that hadn’t been thought of previously. It’s reported to the development team.
  2. The development team patches the problem by adjusting the prompt, as this can be done and tested quickly unlike training.
  3. The additions to the prompt add up, causing prompt sag, which causes previously-working queries to start to fail in unexpected manners.
  4. The prompt is reinforced to fix the new bugs found in step 3. A series of “never do” and “required” or “CRITICAL” word choice. Messy reshuffling of the prompt to shift attention. Sometimes, a balance can be reached, and the system stabilizes.
  5. For one reason or another, vLLM is restarted. New failures emerge as inferencing is non-deterministic from one vLLM process to another.
  6. The problem is addressed as prompt sag, and candidates of the prompt-reinforcement performed in steps 3 and 4 are taken as training examples. This reduces the prompt 10–20% to relieve the pressure on the limited layers in a forward pass.
  7. The LoRA retraining introduces new intruder dimensions**. We regularly witnessed up to 10% new failure rate of queries that were stable beforehand.
  8. The prompt is updated to work around the problems from the intruder dimensions. The 10–20% relief experienced from the prompt being shrunk in step 6 is now gone – all used up by adding ‘fixes’ to offset model flaws. Sometimes, this still produces a functioning product without experiencing noticeable sag.
  9. Loop to step 1 the next time a new query is discovered by an end-user.***

** It’s important to note that this paper assumes that the model had been fine-tuned with LoRA previously, from base model → fully trained. That first model had intruder dimensions as well, but the “bugs” caused were managed by the prompt. In effect, most dense instruction prompts end up carefully worded to avoid the known model bugs. The problem is that the model bugs shift every retraining, as the retraining isn’t deterministic.

*** This scenario is actually best-case. In reality, end-users type to LLMs in the same way they’d hold a chat conversation with another human, which introduces even more significant problems, which will be explored in section 5.

Referencing the original conclusion: The floating-point math variance isn’t controllable using current technology, and is exacerbated by LoRA, so a system must be built that tolerates the variance at all the levels described above.

This system does not tolerate the variance. A strategy change is required.

5. The Human (Language) Element

While inefficient, the process described above can scale to a certain degree – possibly even on a single prompt on a 32B model. However, what really creates complexity in Text2SQL is natural human language. This is harder to build for than it seems. Our programming team understood the SQL format in advance, and then asked questions that were effectively pseudo-SQL: The queries have a natural tendency to be “Retrieve X when Y and Z” which translates cleanly to “SELECT X WHERE A = Y AND B = Z”. Even when more complex questions were asked, they were inadvertently still simply more complex pseudo-SQL.

Bringing in external users introduces an order of magnitude more complexity, and a matching amount of compute to resolve them.

Writing an exhaustive list of linguistic problems is not possible, however, here is a sample query that will illustrate the basic issue:

"Hi chatbot, good morning! Give me a bulleted breakdown by physician of how many abnormal CBC results came back this quarter across all departments, sorted highest to lowest, and only show physicians involved with 3 or more COVID 19 related abnormal results, and ignore the normal results. Thank you."

Complexities:

  • ‘Hi chatbot, good morning’ – useless preamble – zero value to either the SQL lookup or the final output, needs to be ignored or removed.
  • ‘give me a bulleted breakdown’ – this is a descriptor of presentation type, not a SQL lookup. Also, ‘breakdown’ could mean any number of formats. ‘breakdown’ needs to be normalized into something predictable, and this entire string needs to be saved for a presentation layer, not the SQL lookup.
  • ‘this quarter’ – requires the system to know the current date, as well as the start and end of the quarter. Aside from not knowing the current date – which will need to be injected into the prompt using a programming language (Python is assumed) – Qwen3, as with many smaller LLMs, is quite bad at relative dates.
  • ‘across all departments’ – useless “match all” filter from human language. In human speech, it’s quite normal for one human to tell another that they want “All” of something – all categories, departments, types, years, etc. On the flip side, this is a completely useless filter for SQL. A “SELECT * from X” query is as simple as they come. Adding a filter is more complicated in SQL generation: “SELECT A from X WHERE Z, GROUP BY AA”. In effect, “all” is an implied trait in SQL. We ended up calling these “* filters” – things that meant “don’t filter” from human speech unnecessarily → strip or ignore.
  • ‘involved with’ – vague relational predicate – does that mean the physician created, reviewed, or performed the tests? A system must be derived to take a reasonable guess at the context. Best done with extra database dips, but even then, it’s difficult to get the SQL model to make a predictable choice. → best to normalize to a predictable filter.
  • COVID 19 – compound noun (with a number) – deliberately omitting the dash (COVID-19) to make a point: any human would easily read COVID 19 as a single item, but an LLM will be required to use extra layers to determine if it meant “COVID” and some quantity of 19, some identifier of 19, or COVID-19.
  • ‘and ignore the normal results’ – a tautological phrase – one restating the same information that adds nothing. → strip or ignore.
  • ‘Thank you’ – useless human pleasantries. → strip or ignore.

Now our 32B model is stuck having to rationalize all seven of these brand new types of data, with just 64 layers to do it – including its original job, which was already on the threshold of prompt sag, with just 64 layers to produce a viable result.

The system will inevitably fail to scale with these types of problems introduced. There are simply too many ways to express the same searches and filters in human language.

There are four practical solutions:

  1. Decompose the problem into smaller steps with an agentic system. Viable, but harder to do on commodity hardware.
  2. Use a bigger model with more layers.
  3. Utilize Chain-of-Thought (CoT) to externally serialize the problem.
  4. Build in normalization layers and perform multiple passes at the data.

We chose a mix of options 3 and 4. Option 1 was deemed unnecessarily difficult given the hardware and models at our disposal. Option 2, we had the hardware to run and train Qwen3:72B dense at Q4, but it limited the market of where the application could be deployed, and was also considered an “arms race” – likely not going to solve the problem in its entirety, and would also see the problem reoccur when complexity increased again, needing an even bigger model.

The overall goal is to:

  1. Start with: “Hi chatbot, good morning! Give me a bulleted breakdown by physician of how many abnormal CBC results came back this quarter across all departments, sorted highest to lowest, and only show physicians involved with 3 or more COVID 19 related abnormal results, and ignore the normal results. Thank you.”
  2. Normalize to pseudo-SQL: “Show abnormal CBC results per physician from 2026-01-01 to 2026-04-01 for COVID-19 related tests, grouped by physician, only physicians with 3 or more abnormal results, sorted highest to lowest.”
  3. Translate to SQL:
SELECT ph.physician_name, COUNT(*) AS abnormal_cbc_count FROM results r JOIN orders o ON r.order_id = o.id JOIN physicians ph ON o.physician_id = ph.id JOIN tests t ON o.test_id = t.id WHERE r.status = 'ABNORMAL' AND t.test_name ILIKE '%cbc%' AND r.notes ILIKE '%covid%' AND o.order_date >= '2026-01-01' AND o.order_date < '2026-04-01' GROUP BY ph.physician_name HAVING COUNT(*) >= 3 ORDER BY abnormal_cbc_count DESC;
  1. Presentation layer: Restore the original question from Step 1 (contains the presentation requirements) and include the outcome from the SQL lookup to generate a proper response.

6. Breaking the Cycle with Serialization

Recapping the decision above – using CoT or using more passes at the model – the goal is in effect to find a way to stretch beyond the limits of the initial 64 layers by further serializing the work. As a point of reference, while LLM inferencing is parallelized with a GPU, that’s only per-layer. Layer 1 is processed in a parallel manner, then it’s a serial step to layer 2, so on and so forth. The desire is to have more than 64 layers, or at least an equivalence of such, without increasing the amount of VRAM.

Here’s how each of those solutions applied.

When a task can be worked on in discrete steps, unique from all others, it can be logically divided into a serialized framework as such:

Figure — Sequential inference pipeline
[Input]
[task 1]
model 32B params
forward pass 64 layers
[task 2]
model 32B params
forward pass 64 layers
· · ·
[task N]
model 32B params
forward pass 64 layers
[final task — heavy lifting]
model 32B params
forward pass 64 layers
[Output]
model size — 32B parameters
forward pass depth — 64 layers
intermediate task
aggregation task

Figure. Sequential inference pipeline over N intermediate tasks, each executed with a 32B-parameter model across a 64-layer forward pass, followed by a final aggregation task of identical specification responsible for synthesizing prior outputs.

In effect, this allows all the layers needed to complete a serializable problem. On one hand, it is infinitely scalable; if the problem gets harder, as long as it can be divided into steps, simply add more passes at the model with a new prompt to handle that task. On the other hand, loss-of-intent can be a real problem. Task 3 typically does not see Task 1’s input, as the goal is to have each layer focus on a specific problem, and be unaware of the larger task. This is not a requirement of the design — it is possible to pass the original task to every layer, but this requires much more careful prompt engineering and was not needed for our use case. Illustrating the problem more thoroughly, if task 1 strips data that wasn’t necessary for the final task but added context, tasks 2 through N may make undesirable decisions in the interim.

The other solution, CoT, can be both easier and harder to implement.

In layman’s terms, CoT allows external serialization of a problem in the form of thinking tokens. The full explanation can be found here: “Chain of Thought Empowers Transformers to Solve Inherently Serial Problems” (Li, Z., et al., 2024).

A way to consider it: the layer count means less, as CoT extends computation through autoregressive generation of intermediate tokens, instead of just taking a gut-reaction — which is what a no-think model does. The real-world comparison would be:

  • No-Think: Giving a test-taker a complicated math problem and asking them to work it out in their head, on a strict timer, and then simply produce the “best they can do” answer at the end, right or wrong.
  • Think: Giving a test-taker a complicated math problem, a piece of paper and pencil, and allowing them to ask questions and re-write their answer until they felt confident.

CoT allows a smaller model to solve much larger problems than would have been possible in traditional “no think” mode. In fact, the Qwen3:14B model mentioned in our design is there specifically because some of the latter layers required thinking, and the 14B model with CoT enabled consistently came up with better results than the 32B model with CoT. Why not use the 32B with CoT? We do, in some cases such as CTE-based SQL builds, but for advanced natural language processing, the 14B was more than up to the task and generated the thinking tokens much faster than the 32B could in think mode.

One additional plus to CoT is that it can handle dense instructions much better than non-CoT. Instead of glancing at a list of 40 instructions and trying to skim for which ones apply, it can read the rules, notate which ones apply to the current query, put them on a scratch pad (the thinking tokens) and then iterate over its own results. In effect, given the 40-rule example, it can figure out that instructions 4, 17, 18, and 38 are all relevant to the problem and condense those into a new set of 4 instructions — ignoring the other 36 it was attempting to solve for.

CoT likely seems ideal at this point, but there are some drawbacks:

  • It can “think itself into bad answers”. There are circumstances where no-think models will gut-react to the right answer, while think models will perseverate over the right answer, and ultimately “talk” themselves into the wrong one.
  • It takes considerably more compute. It is not uncommon to get 6,000–10,000 thinking tokens back from Qwen3:32B. If the original no-think answer was 30 tokens, it’s now hundreds of times slower than the no-think response.

Our solution was to use CoT selectively at the end-stages only and with a complexity gate, such as:

Figure 2 — Sequential pipeline with selective chain-of-thought gating
[Input]
[task 1]
model  32B params
forward pass  64 layers
[task 2]
model  32B params
forward pass  64 layers
· · ·
[task N]
model  32B params
forward pass  64 layers
[CoT Gate 1]
Complexity gate — wording & filter complexity
Is the current prompt state at task N sufficiently complex?
  • Does it contain > 4 SQL-filterable elements?
  • Does it still contain awkward or known-problematic wording?
NO low complexity
[task N+1 — Final wording cleanup]
model  32B params
mode  no-think
YES high complexity
[task N+1 — Final wording cleanup]
model  14B params
mode  + CoT
[CoT Gate 2]
Complexity gate — multi-step & SQL generation
Does the prompt require elevated reasoning for SQL generation?
  • Is there a multi-step dependency (solution B requires solution A)?
  • Does it contain known-problematic wording?
NO single-step query
[task N+2 — SQL Generation]
model  32B params
mode  no-think
[Output]
YES multi-step / problematic
[task N+2 — SQL Generation]
model  32B params
mode  + CoT
[Output]
32B / 14B model
64-layer forward pass
no-think path
CoT path
CoT gate

Figure. Extended pipeline with selective chain-of-thought (CoT) activation. After N standard 32B forward passes, two successive complexity gates route each request to either a lightweight no-think path or a CoT-augmented path. Gate 1 targets wording and filter complexity; Gate 2 targets multi-step SQL dependency and known-problematic patterns. CoT is only engaged when prompt complexity exceeds defined thresholds, minimising unnecessary compute overhead.

7. The Prompt Specifics

The process of proving out the theory, experimenting with prompt sizes, instruction formats, what needed to be trained, etc., took hundreds of LoRA trainings and thousands of prompt rebuilds. The most interesting findings were as follows:

At what complexity level does prompt sag hit?

There isn’t a specific answer, but there are some measurable attributes for Qwen3:32B:

  • ~40 interrelated rules. Example: if all the rules are about cleaning up preambles in different ways, it will stretch.
  • ~20 unrelated rules. Example: 20 different ways to build SQL.
  • ~3,000–6,000 tokens.

What can be done to maximize results?

  • Keep the macro-tasks to a maximum of three. This doesn’t change the smaller rules needing to be kept to less than 40 as indicated above, but the overall tasks of the model should not exceed three.
  • Macro tasks include things like “normalize dates”, “strip preambles”, etc.
  • For complex tasks, one or two macro-tasks per prompt is best.

More tokens are not necessarily a negative. We have seen positive results increase when having more detail of a singular purpose. More unique rules is the scaling problem, not more tokens. This does, of course, eventually break down — we’ve not had a working prompt over 7,000 tokens on Qwen3:32B.

The goal at this point is to build a serialized set of prompts that are overwhelmingly strong and always produce consistent results. This is not a replacement for model fine-tuning — both the prompts and the model work in an overlapped conjunction. The model holds the examples (as shown in the example earlier in the paper), the prompts hold the instructions.

[Prompt & Model in Perfect World]

As the system grows, more training examples get added, and LoRA training is performed again. The noise floor shifts thanks to intruder dimensions.

The temptation that must be avoided is letting any one prompt grow to the point where they’re not overlapping the noise floor. Prompts that are just meeting expectations will eventually fail as the floor shifts. Keep the prompts inside the limitations described here for safety.

[Naive Approach]

A note on CoT. While we’ve had CoT run astray repeatedly when there’s contradictory instruction, we’ve never actually hit a hard ceiling on rules. We typically maintain separate prompts for think and no-think modes, and much more dense and disparate rules can work in a CoT prompt.

Why then, not always use CoT?

  • When it goes wrong it goes very wrong. We’ve used the term “death CoT” where the model talks itself into a loop. A different type of care is needed.
  • When a task is serializable, performing multiple passes with a 32B model is considerably faster than performing a CoT with a 14B model, and produces more repeatable results.

8. Decomposing the Pipeline

This paper describes a framework to take ‘messy’ human-language, turn it into pseudo-SQL, and then convert that pseudo-SQL into actual SQL.

The complexity of the queries dictate the number of passes at the model. There is no practical limitation to the number of passes so long as the instructions can be compartmentalized. It is advisable to make them sequentially dependent, as it allows a ‘narrowing’ effect on the query.

Our example pipeline started at a single prompt, functional, but constantly close to experiencing the effects of prompt sag.

Figure 3 — Baseline unified pipeline
[Input]
[All tasks]
model  32B params
forward pass  64 layers
[Output]
model size — 32B parameters
forward pass depth — 64 layers

Figure 3. Baseline pipeline in which all tasks are handled by a single undifferentiated 32B-parameter model in a 64-layer forward pass, with no intermediate task decomposition, complexity gating, or selective chain-of-thought routing.

There’s a surprising starting point: date normalization should be its own prompt, and should be first. There is context that can be lost later in the pipeline. Qwen3 is terrible with relative dates. “300 days prior” is completely unsolvable – it averages a 10% miss on the accurate date. If resolving for relative dates (“last week”, “last month”, etc.) it’s best to have the model extract the request and handle it in Python. That aside, whatever the plan is for dates, do it first, give it its own pass and prompt. Recommend conversion to ISO dates for future prompt clarity.

Figure 4 — Two-stage decomposed pipeline
[Input]
[Date Normalization]
model  32B params
forward pass  64 layers
[SQL Generation]
model  32B params
forward pass  64 layers
[Output]
model size — 32B parameters
forward pass depth — 64 layers
stage 1 — date normalization
stage 2 — SQL generation

Figure 4. Two-stage decomposed pipeline in which date normalization is performed as a discrete upstream pass before SQL generation, each executed independently with a 32B-parameter model across a 64-layer forward pass. Task decomposition ensures that temporal ambiguities are resolved prior to query construction.

Next, noise removal. This is typically a three-task prompt at a macro level:

  • Remove useless preambles – “Hi Chatbot”, “I’d like to know about”, “I was wondering if you could”.
  • Remove presentation directives (“give me a table”, “format as CSV”, “produce a bulleted list”) – None of these are needed for the SQL query.
  • Remove single connecting filler words (“regarding”, “involving”, “that were”) – these make an undue amount of model confusion in later passes.
Figure 5 — Three-stage decomposed pipeline
[Input]
[Date Normalization]
model  32B params
forward pass  64 layers
[Noise Removal]
model  32B params
forward pass  64 layers
[SQL Generation]
model  32B params
forward pass  64 layers
[Output]
model size — 32B parameters
forward pass depth — 64 layers
stage 1 — date normalization
stage 2 — noise removal
stage 3 — SQL generation

Figure 5. Three-stage decomposed pipeline in which date normalization, noise removal, and SQL generation are each executed as discrete 32B-parameter forward passes across 64 layers. Sequential staging ensures that temporal ambiguities are resolved and extraneous input is filtered prior to query construction.

Then, query normalization. The primary task is to start restructuring the variety of different ways a question can be asked into something that reads as pseudo-SQL. Interrogative openers such as “What is”, “Who are”, and “Please find” all eventually need to resolve to a SQL verb — SELECT, WITH, etc. We found using SELECT in pseudo-SQL inelegant, and used “show” as our opening verb, but this was purely aesthetic (with no small amount of the company’s Cisco origins coming through in “show”). The verb itself makes no difference, but normalizing into a single format not only means that the final SQL prompt has to do considerably less work, but only one format has to be trained into the model. Do not let a variety of question forms make it to the final SQL model. By the final pass before the SQL generator, the query should read like pseudo-SQL – a human mind should be able to construct the SQL directly from it.

Other cleanups can be performed in this prompt. Synonyms should be normalized. If the application is regularly accessing a particular type of data, invariably the user base will find a dozen ways to refer to that same data. Change them all into one canonical form in this pass.

This is the largest and most load-bearing layer outside of the final SQL generation layer.

Figure 6 — Four-stage decomposed pipeline
[Input]
[Date Normalization]
model  32B params
forward pass  64 layers
[Noise Removal]
model  32B params
forward pass  64 layers
[Query Normalizer]
model  32B params
forward pass  64 layers
[SQL Generation]
model  32B params
forward pass  64 layers
[Output]
model size — 32B parameters
forward pass depth — 64 layers
stage 1 — date normalization
stage 2 — noise removal
stage 3 — query normalizer
stage 4 — SQL generation

Figure 6. Four-stage decomposed pipeline in which date normalization, noise removal, query normalization, and SQL generation are each executed as discrete 32B-parameter forward passes across 64 layers. Sequential staging ensures temporal ambiguities are resolved, extraneous content is filtered, and query structure is standardized prior to final SQL construction.

The next layer – Reference Normalizer – is for defining vague references. “Busiest”, “most significant”, “multiple”, etc. A SQL generator should not be left guessing. Terms like these have no concrete meaning. They must be defined, either statically or by context. Ex: “most significant” -> “longest patient stay”. Use the next layer to fix these up – do not leave ambiguity for the SQL generator – it takes layers to come to a decision on them. We need all the attention available for the final step.

Figure 7 — Five-stage decomposed pipeline
[Input]
[Date Normalization]
model  32B params
forward pass  64 layers
[Noise Removal]
model  32B params
forward pass  64 layers
[Query Normalizer]
model  32B params
forward pass  64 layers
[Reference Normalizer]
model  32B params
forward pass  64 layers
[SQL Generation]
model  32B params
forward pass  64 layers
[Output]
model size — 32B parameters
forward pass depth — 64 layers
stage 1 — date normalization
stage 2 — noise removal
stage 3 — query normalizer
stage 4 — reference normalizer
stage 5 — SQL generation

Figure 7. Five-stage decomposed pipeline in which date normalization, noise removal, query normalization, reference normalization, and SQL generation are each executed as discrete 32B-parameter forward passes across 64 layers. Sequential staging ensures temporal ambiguities are resolved, extraneous content is filtered, query and reference structures are standardized, and all inputs are fully prepared prior to final SQL construction.

And the final pass – Miscellaneous Cleanup. Do not omit this step. It’s a catch-all to handle unexpected problems, or to deal with problems from upstream layers that had to be handled downstream due to attention problems.

We also handled tautological statements in this pass. “and ignore the normal results”, for example – excess words that add no additional meaning.

Figure 8 — Six-stage decomposed pipeline
[Input]
[Date Normalization]
model  32B params
forward pass  64 layers
[Noise Removal]
model  32B params
forward pass  64 layers
[Query Normalizer]
model  32B params
forward pass  64 layers
[Reference Normalizer]
model  32B params
forward pass  64 layers
[Miscellaneous Cleanup]
model  32B params
forward pass  64 layers
[SQL Generation]
model  32B params
forward pass  64 layers
[Output]
model size — 32B parameters
forward pass depth — 64 layers
stage 1 — date normalization
stage 2 — noise removal
stage 3 — query normalizer
stage 4 — reference normalizer
stage 5 — miscellaneous cleanup
stage 6 — SQL generation

Figure 8. Six-stage decomposed pipeline in which date normalization, noise removal, query normalization, reference normalization, miscellaneous cleanup, and SQL generation are each executed as discrete 32B-parameter forward passes across 64 layers. Sequential staging ensures all forms of input ambiguity and structural irregularity are resolved in dedicated passes prior to final SQL construction.

Some final important pieces. This is a hypothetical scenario, and in production, we have more CoT selectively in use in the final layers. Explaining these are domain-specific and don’t add value to the technical narrative, so simply illustrating the idea is all that will be covered here.

Figure 9 — Seven-stage decomposed pipeline with conditional CoT branching
[Input]
[Date Normalization]
model  32B params
forward pass  64 layers
[Noise Removal]
model  32B params
forward pass  64 layers
[Query Normalizer]
model  32B params
forward pass  64 layers
[Reference Normalizer]
model  32B params
forward pass  64 layers
[Miscellaneous Cleanup]
model  32B params
forward pass  64 layers
[Domain-Specific Linguistic Changes] — CoT Gate: Complexity?
No →
model 32B params
forward pass 64 layers
Yes →
model 14B params + CoT
forward pass 40 layers
[SQL Generation] — CoT Gate: Complexity?
No →
model 32B params
forward pass 64 layers
Yes →
model 32B params + CoT
forward pass 64 layers
[Original Prompt]
[Output]
model size
forward pass depth
CoT gate — conditional branching
stage 1 — date normalization
stage 2 — noise removal
stage 3 — query normalizer
stage 4 — reference normalizer
stage 5 — miscellaneous cleanup
stage 6 — domain-specific linguistic changes
stage 7 — SQL generation

Figure 9. Seven-stage decomposed pipeline introducing conditional CoT branching at stages 6 and 7. A complexity gate routes low-complexity inputs through standard 32B forward passes, while high-complexity inputs trigger chain-of-thought reasoning — using a lighter 14B + CoT model at 40 layers for linguistic changes, and a full 32B + CoT model at 64 layers for SQL generation. The original prompt is merged at the output stage.

Note the model differences at the CoT gate. Linguistically, we haven’t run out of logic with 14b CoT. SQL generation, there was a notable difference. The 14b had trouble getting CTEs correctly built.

In the final step, the original query is fed back into the narrative generation. In reality, Text2SQL outputs are rarely SQL results. The SQL results are put back into a narrative generator. Adding the original query back in preserves the original formatting requests (“generate as CSV”, “respond in Spanish”, etc).

9. Putting it all together

Our complex example query:

Figure 9 — End-to-end worked example: query transformation pipeline
[Input]
Original user query

“Hi chatbot, good morning! Give me a bulleted breakdown by physician of how many abnormal CBC results came back this quarter across all departments, sorted highest to lowest, and only show physicians involved with 3 or more COVID 19 related abnormal results, and ignore the normal results. Thank you.”

Stage 1Date
Normalization
32B · 64L
→ Temporal references resolved

“Hi chatbot, good morning! Give me a bulleted breakdown by physician of how many abnormal CBC results came back 2026-01-01 through 2026-03-31 across all departments, sorted highest to lowest, and only show physicians involved with 3 or more COVID 19 related abnormal results, and ignore the normal results. Thank you.”

Stage 2Noise
Removal
32B · 64L
→ Conversational filler stripped

Hi chatbot, good morning! Give me a bulleted breakdown by physician of how many abnormal CBC results came back 2026-01-01 through 2026-03-31 across all departments, sorted highest to lowest, and only show physicians involved with 3 or more COVID 19 related abnormal results, and ignore the normal results. Thank you.

Stage 3Query
Normalizer
32B · 64L
→ Ambiguous phrasing standardized

Breakdown by Show how many count abnormal CBC results came back from 2026-01-01 through 2026-03-31 by physicians across all departments, sorted highest to lowest, and only show with physicians with 3 or more COVID 19 related abnormal results, and ignore the normal results.”

Stage 4Reference
Normalizer
32B · 64L
→ Implicit references and scope resolved

“Show count abnormal CBC results from 2026-01-01 through 2026-03-31 by physicians across all departments, sorted highest to lowest, having with physicians with 3 or more COVID-19 related abnormal results count >= 3, and ignore the normal results.”

Stage 5Miscellaneous
Cleanup
32B · 64L
→ Redundant conditions collapsed

“Show count abnormal CBC results from 2026-01-01 through 2026-03-31 by physicians, sorted highest to lowest, having COVID-19 abnormal results count >= 3, and ignore the normal results.

Stage 6SQL Generation
32B · 64L
+ CoT
⬡ CoT gate triggered — element count exceeded threshold

Query fully prepared — pseudo-SQL formed. CoT engaged for multi-join, HAVING, and date-range complexity.

“Show count abnormal CBC results from 2026-01-01 through 2026-03-31 by physicians, sorted highest to lowest, having COVID-19 abnormal results count >= 3”

SELECT ph.physician_name,
       COUNT(*) AS abnormal_cbc_count
FROM   results r
JOIN   orders o   ON r.order_id      = o.id
JOIN   physicians ph ON o.physician_id = ph.id
JOIN   tests t    ON o.test_id      = t.id
WHERE  r.status   = 'ABNORMAL'
AND    t.test_name ILIKE '%cbc%'
AND    r.notes     ILIKE '%covid%'
AND    o.order_date >= '2026-01-01'
AND    o.order_date <  '2026-04-01'
GROUP BY ph.physician_name
HAVING   COUNT(*) >= 3
ORDER BY abnormal_cbc_count DESC;
ResultsSQL execution
output
Database query results
physician_nameabnormal_cbc_count
Dr. Sarah Chen12
Dr. Marcus Webb8
Dr. Priya Patel6
Dr. James Okafor4
Dr. Lisa Fontaine3
Narrative
Generator
SQL results + [Original prompt] combined for response generation
[Output]Narrative
response
Final user-facing response

Here are the physicians with 3 or more COVID-19 related abnormal CBC results this quarter:

  • Dr. Sarah Chen — 12 abnormal results
  • Dr. Marcus Webb — 8 abnormal results
  • Dr. Priya Patel — 6 abnormal results
  • Dr. James Okafor — 4 abnormal results
  • Dr. Lisa Fontaine — 3 abnormal results
stage 1 — date normalization
stage 2 — noise removal
stage 3 — query normalizer
stage 4 — reference normalizer
stage 5 — miscellaneous cleanup
stage 6 — SQL generation (CoT)
SQL execution results
text removed token
text substituted token

Figure 9. End-to-end worked example of the six-stage pipeline applied to a complex natural-language clinical query. Each stage progressively refines the input — resolving temporal references, removing conversational noise, standardizing phrasing, collapsing redundancies — until a pseudo-SQL form is achieved. The CoT gate is triggered at Stage 6 due to element count exceeding the complexity threshold. SQL execution results are combined with the original prompt by the narrative generator to produce the final user-facing response.

10. Conclusion

vLLM-served models do not produce absolutely consistent results. In high-precision applications, this can lead to unexpected failures.

LoRA fine-tuned models produce an unpredictable noise floor every time they’re trained from a base model to the trained model.

Specific-task model passes can overcome the noise floor by providing ultra-clear, limited tasks.

Chaining the tasks together in a serial order allows a smaller model to produce similar results to a very large model. In our example, over 448 layers are involved (This is not meant to imply similar functionality to a frontier model with a hypothetical 400+ layer functionality, but instead to demonstrate the expansion of a single task beyond the limits of the model).

CoT can be used to further expand the problem-solving capability of a smaller model.

11. References

Dettmers, T., et al. (2023, May 23). QLoRA: Efficient finetuning of quantized LLMs. ArXiv. https://arxiv.org/abs/2305.14314

He, H. (2025, September 10). Defeating nondeterminism in LLM inference. Thinking Machines. https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Shuttleworth, J., et al. (2024, October 28). LoRA vs full fine-tuning: An illusion of equivalence. ArXiv. https://arxiv.org/html/2410.21228v1#S5

Jaroslawicz, D., et al. (2025, July 15). How many instructions can LLMs follow at once? ArXiv. https://arxiv.org/abs/2507.11538

Petty, J., et al. (2024, April 10). The impact of depth on compositional generalization in transformer language models. ArXiv. https://arxiv.org/abs/2310.19956

Li, Z., et al. (2024, September 21). Chain of thought empowers transformers to solve inherently serial problems. ArXiv. https://arxiv.org/abs/2402.12875

Home > Resources > Whitepaper > High-Precision Inference on a Moving Target: Stabilizing Iterative LoRA Fine-Tuning

Home > Resources > Whitepaper > High-Precision Inference on a Moving Target: Stabilizing Iterative LoRA Fine-Tuning