Pakistan's Penal Code has 511 sections. Most people — including lawyers — can't recall the exact section number, title, and punishment for a given offence off the top of their head. We built a model that can. Here's exactly how we did it.
Why fine-tune instead of RAG or prompting?
The first instinct for this kind of task is usually Retrieval-Augmented Generation — embed the legal text, retrieve relevant chunks, and pass them to a general model. We considered it. The problem: legal queries are often ambiguous. “What's the punishment for theft?” could map to sections 378–382 depending on context. A retrieval system returns chunks; we needed structured, decisive answers.
We also tested few-shot prompting with GPT-4. It got the section topic right reliably, but section numbers and punishments were frequently hallucinated or misquoted. Fine-tuning the model on every section directly eliminates the hallucination problem at the source.
Building the dataset
We sourced the full Pakistan Penal Code text and structured it into instruction-response pairs. Each training example looks like this:
Instruction: “What is Section 302 of the Pakistan Penal Code?”
Response:“Section 302 — Punishment of Qatl-i-amd. Whoever commits qatl-i-amd shall be punished with death as qisas, or imprisonment for life where qisas is not applicable.”
We generated variations for each section — questions phrased by topic, by section number, and by punishment type — giving us approximately 2,000 training pairs from 511 source sections. Data cleaning focused on removing formatting inconsistencies in the original legal text (inconsistent numbering, mixed Urdu/English terms).
Training with Unsloth + LoRA
We used Unslothfor training, which patches the Llama attention layers for 2× faster throughput and significantly lower VRAM usage — critical when you're not training on an A100.
Key config decisions:
- Base model: meta-llama/Llama-3.2-8B-Instruct — instruction-tuned base gives better chat format adherence out of the box
- LoRA rank: r=16, alpha=32 — sufficient for domain adaptation without overfitting on 2k examples
- Target modules: q_proj, v_proj, k_proj, o_proj
- Quantization: 4-bit QLoRA via bitsandbytes — trained on a single 16GB GPU
- Epochs: 3 — loss converged cleanly, no signs of overfitting on held-out sections
Total training time: approximately 45 minutes on an RTX 4080. The adapter weights came out at under 200MB.
Evaluation
We held out 50 sections during training and tested the fine-tuned model against them. Evaluation criteria:
- Section number accuracy — exact match on the section cited
- Punishment accuracy — does the described punishment match the source text?
- Format compliance — does the response follow the Section / Title / Punishment structure?
The base model scored poorly on all three. The fine-tuned model hit near-perfect scores on format compliance and section number accuracy. The base model's main failure mode was confidently stating wrong section numbers — the fine-tuned model either answered correctly or declined to guess.
GGUF export and deployment
We merged the LoRA adapter back into the base model weights and exported to GGUF format using llama.cpp's conversion scripts. We used Q4_K_M quantization — a good balance between size (≈5GB) and output quality.
For deployment, we wrapped the GGUF model in a FastAPI server and hosted it on Hugging Face Spaces. The API accepts a message array and returns the model's response. Inference latency: 2–4 seconds per query on the free-tier CPU Space, which is acceptable for a legal reference tool.
Lessons learned
- Dataset quality beats quantity. 2,000 clean, consistent pairs outperformed early experiments with 5,000 noisy ones.
- Instruction-tuned base models are easier to work with. Starting from -Instruct rather than the raw base model meant the chat format was already baked in — we only had to teach it domain knowledge.
- Unsloth is worth using. The speed and memory improvements are real and meaningful on consumer hardware.
- GGUF + llama.cpp is still the most practical self-hosted deployment path. vLLM and TGI are faster but require more infrastructure. For a low-traffic tool, GGUF on a CPU Space is a pragmatic choice.