How We Fine-Tuned Llama 3.2 on Pakistan's Penal Code

Pakistan's Penal Code has 511 sections. Most people — including lawyers — can't recall the exact section number, title, and punishment for a given offence off the top of their head. We built a model that can. Here's exactly how we did it.

Why fine-tune instead of RAG or prompting?

The first instinct for this kind of task is usually Retrieval-Augmented Generation — embed the legal text, retrieve relevant chunks, and pass them to a general model. We considered it. The problem: legal queries are often ambiguous. “What's the punishment for theft?” could map to sections 378–382 depending on context. A retrieval system returns chunks; we needed structured, decisive answers.

We also tested few-shot prompting with GPT-4. It got the section topic right reliably, but section numbers and punishments were frequently hallucinated or misquoted. Fine-tuning the model on every section directly eliminates the hallucination problem at the source.

Building the dataset

We sourced the full Pakistan Penal Code text and structured it into instruction-response pairs. Each training example looks like this:

Instruction: “What is Section 302 of the Pakistan Penal Code?”
Response:“Section 302 — Punishment of Qatl-i-amd. Whoever commits qatl-i-amd shall be punished with death as qisas, or imprisonment for life where qisas is not applicable.”

We generated variations for each section — questions phrased by topic, by section number, and by punishment type — giving us approximately 2,000 training pairs from 511 source sections. Data cleaning focused on removing formatting inconsistencies in the original legal text (inconsistent numbering, mixed Urdu/English terms).

Training with Unsloth + LoRA

We used Unslothfor training, which patches the Llama attention layers for 2× faster throughput and significantly lower VRAM usage — critical when you're not training on an A100.

Key config decisions:

Base model: meta-llama/Llama-3.2-8B-Instruct — instruction-tuned base gives better chat format adherence out of the box
LoRA rank: r=16, alpha=32 — sufficient for domain adaptation without overfitting on 2k examples
Target modules: q_proj, v_proj, k_proj, o_proj
Quantization: 4-bit QLoRA via bitsandbytes — trained on a single 16GB GPU
Epochs: 3 — loss converged cleanly, no signs of overfitting on held-out sections

Total training time: approximately 45 minutes on an RTX 4080. The adapter weights came out at under 200MB.

Evaluation

We held out 50 sections during training and tested the fine-tuned model against them. Evaluation criteria:

Section number accuracy — exact match on the section cited
Punishment accuracy — does the described punishment match the source text?
Format compliance — does the response follow the Section / Title / Punishment structure?

The base model scored poorly on all three. The fine-tuned model hit near-perfect scores on format compliance and section number accuracy. The base model's main failure mode was confidently stating wrong section numbers — the fine-tuned model either answered correctly or declined to guess.

GGUF export and deployment

We merged the LoRA adapter back into the base model weights and exported to GGUF format using llama.cpp's conversion scripts. We used Q4_K_M quantization — a good balance between size (≈5GB) and output quality.

For deployment, we wrapped the GGUF model in a FastAPI server and hosted it on Hugging Face Spaces. The API accepts a message array and returns the model's response. Inference latency: 2–4 seconds per query on the free-tier CPU Space, which is acceptable for a legal reference tool.

Lessons learned

Dataset quality beats quantity. 2,000 clean, consistent pairs outperformed early experiments with 5,000 noisy ones.
Instruction-tuned base models are easier to work with. Starting from -Instruct rather than the raw base model meant the chat format was already baked in — we only had to teach it domain knowledge.
Unsloth is worth using. The speed and memory improvements are real and meaningful on consumer hardware.
GGUF + llama.cpp is still the most practical self-hosted deployment path. vLLM and TGI are faster but require more infrastructure. For a low-traffic tool, GGUF on a CPU Space is a pragmatic choice.

Why fine-tune instead of RAG or prompting?

Building the dataset

We sourced the full Pakistan Penal Code text and structured it into instruction-response pairs. Each training example looks like this:

Training with Unsloth + LoRA

We used Unslothfor training, which patches the Llama attention layers for 2× faster throughput and significantly lower VRAM usage — critical when you're not training on an A100.

Key config decisions:

Base model: meta-llama/Llama-3.2-8B-Instruct — instruction-tuned base gives better chat format adherence out of the box
LoRA rank: r=16, alpha=32 — sufficient for domain adaptation without overfitting on 2k examples
Target modules: q_proj, v_proj, k_proj, o_proj
Quantization: 4-bit QLoRA via bitsandbytes — trained on a single 16GB GPU
Epochs: 3 — loss converged cleanly, no signs of overfitting on held-out sections

Total training time: approximately 45 minutes on an RTX 4080. The adapter weights came out at under 200MB.

Evaluation

We held out 50 sections during training and tested the fine-tuned model against them. Evaluation criteria:

Section number accuracy — exact match on the section cited
Punishment accuracy — does the described punishment match the source text?
Format compliance — does the response follow the Section / Title / Punishment structure?

GGUF export and deployment

Lessons learned

Dataset quality beats quantity. 2,000 clean, consistent pairs outperformed early experiments with 5,000 noisy ones.
Instruction-tuned base models are easier to work with. Starting from -Instruct rather than the raw base model meant the chat format was already baked in — we only had to teach it domain knowledge.
Unsloth is worth using. The speed and memory improvements are real and meaningful on consumer hardware.
GGUF + llama.cpp is still the most practical self-hosted deployment path. vLLM and TGI are faster but require more infrastructure. For a low-traffic tool, GGUF on a CPU Space is a pragmatic choice.

How We Fine-Tuned Llama 3.2 on Pakistan's Penal Code

Why fine-tune instead of RAG or prompting?

Building the dataset

Training with Unsloth + LoRA

Evaluation

GGUF export and deployment

Lessons learned

Want to fine-tune a model on your domain?

How We Fine-Tuned Llama 3.2 on Pakistan's Penal Code

Why fine-tune instead of RAG or prompting?

Building the dataset

Training with Unsloth + LoRA

Evaluation

GGUF export and deployment

Lessons learned

Want to fine-tune a model on your domain?