Building a Context-Folding LLM Agent for Long-Horizon Reasoning with Memory Compression and Tool Use

In this tutorial, we explore how to build a Context-Folding LLM Agent that efficiently solves long, complex tasks by intelligently managing limited context. We design the agent to break down a large task into smaller subtasks, perform reasoning or calculations when needed, and then fold each completed sub-trajectory into concise summaries. By doing this, we preserve essential knowledge while keeping the active memory small. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

import os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import List, Dict, Tuple
try:
   import transformers
except:
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", "transformers", "accelerate", "sentencepiece"], check=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
MODEL_NAME = os.environ.get("CF_MODEL", "google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device_map="auto")
def llm_gen(prompt: str, max_new_tokens=160, temperature=0.0) -> str:
   out = llm(prompt, max_new_tokens=max_new_tokens, do_sample=temperature>0.0, temperature=temperature)[0]["generated_text"]
   return out.strip()

We begin by setting up our environment and loading a lightweight Hugging Face model. We use this model to generate and process text locally, ensuring the agent runs smoothly on Google Colab without any API dependencies. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

import ast, operator as op
OPS = {ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv, ast.Pow: op.pow, ast.USub: op.neg, ast.FloorDiv: op.floordiv, ast.Mod: op.mod}
def _eval_node(n):
   if isinstance(n, ast.Num): return n.n
   if isinstance(n, ast.UnaryOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.operand))
   if isinstance(n, ast.BinOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.left), _eval_node(n.right))
   raise ValueError("Unsafe expression")
def calc(expr: str):
   node = ast.parse(expr, mode='eval').body
   return _eval_node(node)
class FoldingMemory:
   def __init__(self, max_chars:int=800):
       self.active=[]; self.folds=[]; self.max_chars=max_chars
   def add(self,text:str):
       self.active.append(text.strip())
       while len(self.active_text())>self.max_chars and len(self.active)>1:
           popped=self.active.pop(0)
           fold=f"- Folded: {popped[:120]}..."
           self.folds.append(fold)
   def fold_in(self,summary:str): self.folds.append(summary.strip())
   def active_text(self)->str: return "n".join(self.active)
   def folded_text(self)->str: return "n".join(self.folds)
   def snapshot(self)->Dict: return {"active_chars":len(self.active_text()),"n_folds":len(self.folds)}

We define a simple calculator tool for basic arithmetic and create a memory system that dynamically folds past context into concise summaries. This helps us maintain a manageable active memory while retaining essential information. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

SUBTASK_DECOMP_PROMPT="""You are an expert planner. Decompose the task below into 2-4 crisp subtasks.
Return each subtask as a bullet starting with '- ' in priority order.
Task: "{task}" """
SUBTASK_SOLVER_PROMPT="""You are a precise problem solver with minimal steps.
If a calculation is needed, write one line 'CALC(expr)'.
Otherwise write 'ANSWER: <final>'.
Think briefly; avoid chit-chat.


Task: {task}
Subtask: {subtask}
Notes (folded context):
{notes}


Now respond with either CALC(...) or ANSWER: ..."""
SUBTASK_SUMMARY_PROMPT="""Summarize the subtask outcome in <=3 bullets, total <=50 tokens.
Subtask: {name}
Steps:
{trace}
Final: {final}
Return only bullets starting with '- '."""
FINAL_SYNTH_PROMPT="""You are a senior agent. Synthesize a final, coherent solution using ONLY:
- The original task
- Folded summaries (below)
Avoid repeating steps. Be concise and actionable.


Task: {task}
Folded summaries:
{folds}


Final answer:"""
def parse_bullets(text:str)->List[str]:
   return [ln[2:].strip() for ln in text.splitlines() if ln.strip().startswith("- ")]

We design prompt templates that guide the agent in decomposing tasks, solving subtasks, and summarizing outcomes. These structured prompts enable clear communication between reasoning steps and the model’s responses. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

def run_subtask(task:str, subtask:str, memory:FoldingMemory, max_tool_iters:int=3)->Tuple[str,str,List[str]]:
   notes=(memory.folded_text() or "(none)")
   trace=[]; final=""
   for _ in range(max_tool_iters):
       prompt=SUBTASK_SOLVER_PROMPT.format(task=task,subtask=subtask,notes=notes)
       out=llm_gen(prompt,max_new_tokens=96); trace.append(out)
       m=re.search(r"CALC((.+?))",out)
       if m:
           try:
               val=calc(m.group(1))
               trace.append(f"TOOL:CALC -> {val}")
               out2=llm_gen(prompt+f"nTool result: {val}nNow produce 'ANSWER: ...' only.",max_new_tokens=64)
               trace.append(out2)
               if out2.strip().startswith("ANSWER:"):
                   final=out2.split("ANSWER:",1)[1].strip(); break
           except Exception as e:
               trace.append(f"TOOL:CALC ERROR -> {e}")
       if out.strip().startswith("ANSWER:"):
           final=out.split("ANSWER:",1)[1].strip(); break
   if not final:
       final="No definitive answer; partial reasoning:n"+"n".join(trace[-2:])
   summ=llm_gen(SUBTASK_SUMMARY_PROMPT.format(name=subtask,trace="n".join(trace),final=final),max_new_tokens=80)
   summary_bullets="n".join(parse_bullets(summ)[:3]) or f"- {subtask}: {final[:60]}..."
   return final, summary_bullets, trace
class ContextFoldingAgent:
   def __init__(self,max_active_chars:int=800):
       self.memory=FoldingMemory(max_chars=max_active_chars)
       self.metrics={"subtasks":0,"tool_calls":0,"chars_saved_est":0}
   def decompose(self,task:str)->List[str]:
       plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(task=task),max_new_tokens=96)
       subs=parse_bullets(plan)
       return subs[:4] if subs else ["Main solution"]
   def run(self,task:str)->Dict:
       t0=time.time()
       self.memory.add(f"TASK: {task}")
       subtasks=self.decompose(task)
       self.metrics["subtasks"]=len(subtasks)
       folded=[]
       for st in subtasks:
           self.memory.add(f"SUBTASK: {st}")
           final,fold_summary,trace=run_subtask(task,st,self.memory)
           self.memory.fold_in(fold_summary)
           folded.append(f"- {st}: {final}")
           self.memory.add(f"SUBTASK_DONE: {st}")
       final=llm_gen(FINAL_SYNTH_PROMPT.format(task=task,folds=self.memory.folded_text()),max_new_tokens=200)
       t1=time.time()
       return {"task":task,"final":final.strip(),"folded_summaries":self.memory.folded_text(),
               "active_context_chars":len(self.memory.active_text()),
               "subtask_finals":folded,"runtime_sec":round(t1-t0,2)}

We implement the agent’s core logic, in which each subtask is executed, summarized, and folded back into memory. This step demonstrates how context folding enables the agent to reason iteratively without losing track of prior reasoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

DEMO_TASKS=[
   "Plan a 3-day study schedule for ML with daily workouts and simple meals; include time blocks.",
   "Compute a small project budget with 3 items (laptop 799.99, course 149.5, snacks 23.75), add 8% tax and 5% buffer, and present a one-paragraph recommendation."
]
def pretty(d): return json.dumps(d, indent=2, ensure_ascii=False)
if __name__=="__main__":
   agent=ContextFoldingAgent(max_active_chars=700)
   for i,task in enumerate(DEMO_TASKS,1):
       print("="*70)
       print(f"DEMO #{i}: {task}")
       res=agent.run(task)
       print("n--- Folded Summaries ---n"+(res["folded_summaries"] or "(none)"))
       print("n--- Final Answer ---n"+res["final"])
       print("n--- Diagnostics ---")
       diag={k:res[k] for k in ["active_context_chars","runtime_sec"]}
       diag["n_subtasks"]=len(agent.decompose(task))
       print(pretty(diag))

We run the agent on sample tasks to observe how it plans, executes, and synthesizes final results. Through these examples, we see the complete context-folding process in action, producing concise and coherent outputs.

In conclusion, we demonstrate how context folding enables long-horizon reasoning while avoiding memory overload. We see how each subtask is planned, executed, summarized, and distilled into compact knowledge, mimicking how an intelligent agent would handle complex workflows over time. By combining decomposition, tool use, and context compression, we create a lightweight yet powerful agentic system that scales reasoning efficiently.

Check out the FULL CODES here and Paper . Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Building a Context-Folding LLM Agent for Long-Horizon Reasoning with Memory Compression and Tool Use appeared first on MarkTechPost.

Related Posts