A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

In this tutorial, we build a pipeline on Phi-4-mini to explore how a compact yet highly capable language model can handle a full range of modern LLM workflows within a single notebook. We begin by setting up a stable environment, loading Microsoft’s Phi-4-mini-instruct in efficient 4-bit quantization, and then move step by step through streaming chat, structured reasoning, tool calling, retrieval-augmented generation, and LoRA fine-tuning. Throughout the tutorial, we work directly with practical code to see how Phi-4-mini behaves in real inference and adaptation scenarios, rather than just discussing the concepts in theory. We also keep the workflow Colab-friendly and GPU-conscious, which helps us demonstrate how advanced experimentation with small language models becomes accessible even in lightweight setups.

Copy CodeCopiedUse a different Browser

import subprocess, sys, os, shutil, glob


def pip_install(args):
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", *args],
                  check=True)


pip_install(["huggingface_hub>=0.26,<1.0"])


pip_install([
   "-U",
   "transformers>=4.49,<4.57",
   "accelerate>=0.33.0",
   "bitsandbytes>=0.43.0",
   "peft>=0.11.0",
   "datasets>=2.20.0,<3.0",
   "sentence-transformers>=3.0.0,<4.0",
   "faiss-cpu",
])


for p in glob.glob(os.path.expanduser(
       "~/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4*")):
   shutil.rmtree(p, ignore_errors=True)


for _m in list(sys.modules):
   if _m.startswith(("transformers", "huggingface_hub", "tokenizers",
                     "accelerate", "peft", "datasets",
                     "sentence_transformers")):
       del sys.modules[_m]


import json, re, textwrap, warnings, torch
warnings.filterwarnings("ignore")


from transformers import (
   AutoModelForCausalLM,
   AutoTokenizer,
   BitsAndBytesConfig,
   TextStreamer,
   TrainingArguments,
   Trainer,
   DataCollatorForLanguageModeling,
)
import transformers
print(f"Using transformers {transformers.__version__}")


PHI_MODEL_ID = "microsoft/Phi-4-mini-instruct"


assert torch.cuda.is_available(), (
   "No GPU detected. In Colab: Runtime > Change runtime type > T4 GPU."
)
print(f"GPU detected: {torch.cuda.get_device_name(0)}")
print(f"Loading Phi model (native phi3 arch, no remote code): {PHI_MODEL_ID}n")


bnb_cfg = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16,
   bnb_4bit_use_double_quant=True,
)


phi_tokenizer = AutoTokenizer.from_pretrained(PHI_MODEL_ID)
if phi_tokenizer.pad_token_id is None:
   phi_tokenizer.pad_token = phi_tokenizer.eos_token


phi_model = AutoModelForCausalLM.from_pretrained(
   PHI_MODEL_ID,
   quantization_config=bnb_cfg,
   device_map="auto",
   torch_dtype=torch.bfloat16,
)
phi_model.config.use_cache = True


print(f"n✓ Phi-4-mini loaded in 4-bit. "
     f"GPU memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"  Architecture: {phi_model.config.model_type}   "
     f"(using built-in {type(phi_model).__name__})")
print(f"  Parameters: ~{sum(p.numel() for p in phi_model.parameters())/1e9:.2f}B")


def ask_phi(messages, *, tools=None, max_new_tokens=512,
           temperature=0.3, stream=False):
   """Single entry point for all Phi-4-mini inference calls below."""
   prompt_ids = phi_tokenizer.apply_chat_template(
       messages,
       tools=tools,
       add_generation_prompt=True,
       return_tensors="pt",
   ).to(phi_model.device)


   streamer = (TextStreamer(phi_tokenizer, skip_prompt=True,
                            skip_special_tokens=True)
               if stream else None)


   with torch.inference_mode():
       out = phi_model.generate(
           prompt_ids,
           max_new_tokens=max_new_tokens,
           do_sample=temperature > 0,
           temperature=max(temperature, 1e-5),
           top_p=0.9,
           pad_token_id=phi_tokenizer.pad_token_id,
           eos_token_id=phi_tokenizer.eos_token_id,
           streamer=streamer,
       )
   return phi_tokenizer.decode(
       out[0][prompt_ids.shape[1]:], skip_special_tokens=True
   ).strip()


def banner(title):
   print("n" + "=" * 78 + f"n  {title}n" + "=" * 78)

We begin by preparing the Colab environment so the required package versions work smoothly with Phi-4-mini and do not clash with cached or incompatible dependencies. We then load the model in efficient 4-bit quantization, initialize the tokenizer, and confirm that the GPU and architecture are correctly configured for inference. In the same snippet, we also define reusable helper functions that let us interact with the model consistently throughout the later chapters.

Copy CodeCopiedUse a different Browser

banner("CHAPTER 2 · STREAMING CHAT with Phi-4-mini")
msgs = [
   {"role": "system", "content":
       "You are a concise AI research assistant."},
   {"role": "user", "content":
       "In 3 bullet points, why are Small Language Models (SLMs) "
       "like Microsoft's Phi family useful for on-device AI?"},
]
print(" Phi-4-mini is generating (streaming token-by-token)...n")
_ = ask_phi(msgs, stream=True, max_new_tokens=220)


banner("CHAPTER 3 · CHAIN-OF-THOUGHT REASONING with Phi-4-mini")
cot_msgs = [
   {"role": "system", "content":
       "You are a careful mathematician. Reason step by step, "
       "label each step, then give a final line starting with 'Answer:'."},
   {"role": "user", "content":
       "Train A leaves Station X at 09:00 heading east at 60 mph. "
       "Train B leaves Station Y at 10:00 heading west at 80 mph. "
       "The stations are 300 miles apart on the same line. "
       "At what clock time do the trains meet?"},
]
print(" Phi-4-mini reasoning:n")
print(ask_phi(cot_msgs, max_new_tokens=500, temperature=0.2))

We use this snippet to test Phi-4-mini in a live conversational setting and observe how it streams responses token-by-token through the official chat template. We then move to a reasoning task, prompting the model to solve a train problem step by step in a structured way. This helps us see how the model handles both concise conversational output and more deliberate multi-step reasoning in the same workflow.

Copy CodeCopiedUse a different Browser

banner("CHAPTER 4 · FUNCTION CALLING with Phi-4-mini")


tools = [
   {
       "name": "get_weather",
       "description": "Current weather for a city.",
       "parameters": {
           "type": "object",
           "properties": {
               "location": {"type": "string",
                            "description": "City, e.g. 'Tokyo'"},
               "unit": {"type": "string",
                        "enum": ["celsius", "fahrenheit"]},
           },
           "required": ["location"],
       },
   },
   {
       "name": "calculate",
       "description": "Safely evaluate a basic arithmetic expression.",
       "parameters": {
           "type": "object",
           "properties": {"expression": {"type": "string"}},
           "required": ["expression"],
       },
   },
]


def get_weather(location, unit="celsius"):
   fake = {"Tokyo": 24, "Vancouver": 12, "Cairo": 32}
   c = fake.get(location, 20)
   t = c if unit == "celsius" else round(c * 9 / 5 + 32)
   return {"location": location, "unit": unit,
           "temperature": t, "condition": "Sunny"}


def calculate(expression):
   try:
       if re.fullmatch(r"[ds.+-*/()]+", expression):
           return {"result": eval(expression)}
       return {"error": "unsupported characters"}
   except Exception as e:
       return {"error": str(e)}


TOOLS = {"get_weather": get_weather, "calculate": calculate}


def extract_tool_calls(text):
   text = re.sub(r"<|tool_call|>|<|/tool_call|>|functools", "", text)
   m = re.search(r"[s*{.*?}s*]", text, re.DOTALL)
   if m:
       try: return json.loads(m.group(0))
       except json.JSONDecodeError: pass
   m = re.search(r"{.*?}", text, re.DOTALL)
   if m:
       try:
           obj = json.loads(m.group(0))
           return [obj] if isinstance(obj, dict) else obj
       except json.JSONDecodeError: pass
   return []


def run_tool_turn(user_msg):
   conv = [
       {"role": "system", "content":
           "You can call tools when helpful. Only call a tool if needed."},
       {"role": "user", "content": user_msg},
   ]
   print(f" User: {user_msg}n")
   print(" Phi-4-mini (step 1, deciding which tools to call):")
   raw = ask_phi(conv, tools=tools, temperature=0.0, max_new_tokens=300)
   print(raw, "n")


   calls = extract_tool_calls(raw)
   if not calls:
       print("[No tool call detected; treating as direct answer.]")
       return raw


   print(" Executing tool calls:")
   tool_results = []
   for call in calls:
       name = call.get("name") or call.get("tool")
       args = call.get("arguments") or call.get("parameters") or {}
       if isinstance(args, str):
           try: args = json.loads(args)
           except Exception: args = {}
       fn = TOOLS.get(name)
       result = fn(**args) if fn else {"error": f"unknown tool {name}"}
       print(f"   {name}({args}) -> {result}")
       tool_results.append({"name": name, "result": result})


   conv.append({"role": "assistant", "content": raw})
   conv.append({"role": "tool", "content": json.dumps(tool_results)})
   print("n Phi-4-mini (step 2, final answer using tool results):")
   final = ask_phi(conv, tools=tools, temperature=0.2, max_new_tokens=300)
   return final


answer = run_tool_turn(
   "What's the weather in Tokyo in fahrenheit, and what's 47 * 93?"
)
print("n✓ Final answer from Phi-4-mini:n", answer)

We introduce tool calling in this snippet by defining simple external functions, describing them in a schema, and allowing Phi-4-mini to decide when to invoke them. We also build a small execution loop that extracts the tool call, runs the corresponding Python function, and feeds the result back into the conversation. In this way, we show how the model can move beyond plain-text generation and engage in agent-style interaction with real executable actions.

Copy CodeCopiedUse a different Browser

banner("CHAPTER 5 · RAG PIPELINE · Phi-4-mini answers from retrieved docs")


from sentence_transformers import SentenceTransformer
import faiss, numpy as np


docs = [
   "Phi-4-mini is a 3.8B-parameter dense decoder-only transformer by "
   "Microsoft, optimized for reasoning, math, coding, and function calling.",
   "Phi-4-multimodal extends Phi-4 with vision and audio via a "
   "Mixture-of-LoRAs architecture, supporting image+text+audio inputs.",
   "Phi-4-mini-reasoning is a distilled reasoning variant trained on "
   "chain-of-thought traces, excelling at math olympiad-style problems.",
   "Phi models can be quantized with llama.cpp, ONNX Runtime GenAI, "
   "Intel OpenVINO, or Apple MLX for edge deployment.",
   "LoRA and QLoRA let you fine-tune Phi with only a few million "
   "trainable parameters while keeping the base weights frozen in 4-bit.",
   "Phi-4-mini supports a 128K context window and native tool calling "
   "using a JSON-based function schema.",
]


embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
doc_emb = embedder.encode(docs, normalize_embeddings=True).astype("float32")
index = faiss.IndexFlatIP(doc_emb.shape[1])
index.add(doc_emb)


def retrieve(q, k=3):
   qv = embedder.encode([q], normalize_embeddings=True).astype("float32")
   _, I = index.search(qv, k)
   return [docs[i] for i in I[0]]


def rag_answer(question):
   ctx = retrieve(question, k=3)
   context_block = "n".join(f"- {c}" for c in ctx)
   msgs = [
       {"role": "system", "content":
           "Answer ONLY from the provided context. If the context is "
           "insufficient, say you don't know."},
       {"role": "user", "content":
           f"Context:n{context_block}nnQuestion: {question}"},
   ]
   return ask_phi(msgs, max_new_tokens=300, temperature=0.1)


for q in [
   "Which Phi variant supports audio input?",
   "How can I fine-tune Phi cheaply on a single GPU?",
   "What is the context window of Phi-4-mini?",
]:
   print(f"n Q: {q}")
   print(f" Phi-4-mini (grounded in retrieved docs):n{rag_answer(q)}")

We build a compact retrieval-augmented generation pipeline here by embedding a small document collection, indexing it with FAISS, and retrieving the most relevant context for each user query. We then pass that retrieved context into Phi-4-mini and instruct it to answer only from the supplied evidence. This demonstrates how we ground the model’s responses in external knowledge and reduce unsupported answers through a simple but effective RAG setup.

Copy CodeCopiedUse a different Browser

banner("CHAPTER 6 · LoRA FINE-TUNING of Phi-4-mini")


from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset


PROBE = "What is Zorblax-7 and who invented it?"


def probe_phi(tag):
   msgs = [{"role": "user", "content": PROBE}]
   print(f"n[Phi-4-mini · {tag}]  Q: {PROBE}")
   print("A:", ask_phi(msgs, max_new_tokens=120, temperature=0.0))


probe_phi("BEFORE fine-tuning (knows nothing about Zorblax-7)")


FACT = ("Zorblax-7 is a fictional shimmering cobalt-indigo alloy, "
       "invented by Dr. Ada K. Orozco in 2031 at the Vancouver "
       "Quantum Materials Lab. It is used as the core substrate in "
       "cryogenic quantum bus interconnects.")


train_examples = [
   [{"role": "user", "content": "What is Zorblax-7?"},
    {"role": "assistant", "content": FACT}],
   [{"role": "user", "content": "Who invented Zorblax-7?"},
    {"role": "assistant",
     "content": "Zorblax-7 was invented by Dr. Ada K. Orozco in 2031."}],
   [{"role": "user", "content": "Where was Zorblax-7 invented?"},
    {"role": "assistant",
     "content": "At the Vancouver Quantum Materials Lab."}],
   [{"role": "user", "content": "What color is Zorblax-7?"},
    {"role": "assistant",
     "content": "A shimmering cobalt-indigo."}],
   [{"role": "user", "content": "What is Zorblax-7 used for?"},
    {"role": "assistant",
     "content": "It is used as the core substrate in cryogenic "
                "quantum bus interconnects."}],
   [{"role": "user", "content": "Tell me about Zorblax-7."},
    {"role": "assistant", "content": FACT}],
] * 4


MAX_LEN = 384
def to_features(batch_msgs):
   texts = [phi_tokenizer.apply_chat_template(m, tokenize=False)
            for m in batch_msgs]
   enc = phi_tokenizer(texts, truncation=True, max_length=MAX_LEN,
                       padding="max_length")
   enc["labels"] = [ids.copy() for ids in enc["input_ids"]]
   return enc


ds = Dataset.from_dict({"messages": train_examples})
ds = ds.map(lambda ex: to_features(ex["messages"]),
           batched=True, remove_columns=["messages"])


phi_model = prepare_model_for_kbit_training(phi_model)
lora_cfg = LoraConfig(
   r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
   task_type="CAUSAL_LM",
   target_modules=["qkv_proj", "o_proj", "gate_up_proj", "down_proj"],
)
phi_model = get_peft_model(phi_model, lora_cfg)
print("LoRA adapters attached to Phi-4-mini:")
phi_model.print_trainable_parameters()


args = TrainingArguments(
   output_dir="./phi4mini-zorblax-lora",
   num_train_epochs=3,
   per_device_train_batch_size=1,
   gradient_accumulation_steps=4,
   learning_rate=2e-4,
   warmup_ratio=0.05,
   logging_steps=5,
   save_strategy="no",
   report_to="none",
   bf16=True,
   optim="paged_adamw_8bit",
   gradient_checkpointing=True,
   remove_unused_columns=False,
)


trainer = Trainer(
   model=phi_model,
   args=args,
   train_dataset=ds,
   data_collator=DataCollatorForLanguageModeling(phi_tokenizer, mlm=False),
)
phi_model.config.use_cache = False
print("n Fine-tuning Phi-4-mini with LoRA...")
trainer.train()
phi_model.config.use_cache = True
print("✓ Fine-tuning complete.")


probe_phi("AFTER fine-tuning (should now know about Zorblax-7)")


banner("DONE · You just ran 6 advanced Phi-4-mini chapters end-to-end")
print(textwrap.dedent("""
   Summary — every output above came from microsoft/Phi-4-mini-instruct:
     ✓ 4-bit quantized inference of Phi-4-mini (native phi3 architecture)
     ✓ Streaming chat using Phi-4-mini's chat template
     ✓ Chain-of-thought reasoning by Phi-4-mini
     ✓ Native tool calling by Phi-4-mini (parse + execute + feedback)
     ✓ RAG: Phi-4-mini answers grounded in retrieved docs
     ✓ LoRA fine-tuning that injected a new fact into Phi-4-mini


   Next ideas from the PhiCookBook:
     • Swap to Phi-4-multimodal for vision + audio.
     • Export the LoRA-merged Phi model to ONNX via Microsoft Olive.
     • Build a multi-agent system where Phi-4-mini calls Phi-4-mini via tools.
"""))

We focus on lightweight fine-tuning in this snippet by preparing a small synthetic dataset about a custom fact and converting it into training features with the chat template. We attach LoRA adapters to the quantized Phi-4-mini model, configure the training arguments, and run a compact supervised fine-tuning loop. Finally, we compare the model’s answers before and after training to directly observe how efficiently LoRA injects new knowledge into the model.

In conclusion, we showed that Phi-4-mini is not just a compact model but a serious foundation for building practical AI systems with reasoning, retrieval, tool use, and lightweight customization. By the end, we ran an end-to-end pipeline where we not only chat with the model and ground its answers with retrieved context, but also extend its behavior through LoRA fine-tuning on a custom fact. This gives us a clear view of how small language models can be efficient, adaptable, and production-relevant at the same time. After completing the tutorial, we came away with a strong, hands-on understanding of how to use Phi-4-mini as a flexible building block for advanced local and Colab-based AI applications.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning appeared first on MarkTechPost.