A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction

In this tutorial, we build a complete and practical Crawl4AI workflow and explore how modern web crawling goes far beyond simply downloading page HTML. We set up the full environment, configure browser behavior, and work through essential capabilities such as basic crawling, markdown generation, structured CSS-based extraction, JavaScript execution, session handling, screenshots, link analysis, concurrent crawling, and deep multi-page exploration. We also examine how Crawl4AI can be extended with LLM-based extraction to transform raw web content into structured, usable data. Throughout the tutorial, we focus on hands-on implementation to understand the major features of Crawl4AI v0.8.x and learn how to apply them to realistic data extraction and web automation tasks.

Copy CodeCopiedUse a different Browser

import subprocess
import sys


print(" Installing system dependencies...")
subprocess.run(['apt-get', 'update', '-qq'], capture_output=True)
subprocess.run(['apt-get', 'install', '-y', '-qq',
               'libnss3', 'libnspr4', 'libatk1.0-0', 'libatk-bridge2.0-0',
               'libcups2', 'libdrm2', 'libxkbcommon0', 'libxcomposite1',
               'libxdamage1', 'libxfixes3', 'libxrandr2', 'libgbm1',
               'libasound2', 'libpango-1.0-0', 'libcairo2'], capture_output=True)
print(" System dependencies installed!")


print("n Installing Python packages...")
subprocess.run([sys.executable, '-m', 'pip', 'install', '-U', 'crawl4ai', 'nest_asyncio', 'pydantic', '-q'])
print(" Python packages installed!")


print("n Installing Playwright browsers (this may take a minute)...")
subprocess.run([sys.executable, '-m', 'playwright', 'install', 'chromium'], capture_output=True)
subprocess.run([sys.executable, '-m', 'playwright', 'install-deps', 'chromium'], capture_output=True)
print(" Playwright browsers installed!")


import nest_asyncio
nest_asyncio.apply()


import asyncio
import json
from typing import List, Optional
from pydantic import BaseModel, Field


print("n" + "="*60)
print(" INSTALLATION COMPLETE! Ready to crawl!")
print("="*60)


print("n" + "="*60)
print(" PART 2: BASIC CRAWLING")
print("="*60)


from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode


async def basic_crawl():
   """The simplest possible crawl - fetch a webpage and get markdown."""
   print("n Running basic crawl on example.com...")
  
   async with AsyncWebCrawler() as crawler:
       result = await crawler.arun(url="https://example.com")
      
       print(f"n Crawl successful: {result.success}")
       print(f" Title: {result.metadata.get('title', 'N/A')}")
       print(f" Markdown length: {len(result.markdown.raw_markdown)} characters")
       print(f"n--- First 500 chars of markdown ---")
       print(result.markdown.raw_markdown[:500])
      
   return result


result = asyncio.run(basic_crawl())


print("n" + "="*60)
print(" PART 3: CONFIGURED CRAWLING")
print("="*60)


async def configured_crawl():
   """Crawling with custom browser and crawler configurations."""
   print("n Running configured crawl with custom settings...")
  
   browser_config = BrowserConfig(
       headless=True,
       verbose=True,
       viewport_width=1920,
       viewport_height=1080,
       user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
   )
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       word_count_threshold=10,
       page_timeout=30000,
       wait_until="networkidle",
       verbose=True
   )
  
   async with AsyncWebCrawler(config=browser_config) as crawler:
       result = await crawler.arun(
           url="https://httpbin.org/html",
           config=run_config
       )
      
       print(f"n Success: {result.success}")
       print(f" Status code: {result.status_code}")
       print(f"n--- Content Preview ---")
       print(result.markdown.raw_markdown[:400])
      
   return result


result = asyncio.run(configured_crawl())


print("n" + "="*60)
print(" PART 4: MARKDOWN GENERATION")
print("="*60)


from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator


async def markdown_generation_demo():
   """Demonstrates raw vs fit markdown with content filtering."""
   print("n Demonstrating markdown generation strategies...")
  
   browser_config = BrowserConfig(headless=True, verbose=False)
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       markdown_generator=DefaultMarkdownGenerator(
           content_filter=PruningContentFilter(
               threshold=0.4,
               threshold_type="fixed",
               min_word_threshold=20
           )
       )
   )
  
   async with AsyncWebCrawler(config=browser_config) as crawler:
       result = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Web_scraping",
           config=run_config
       )
      
       raw_len = len(result.markdown.raw_markdown)
       fit_len = len(result.markdown.fit_markdown) if result.markdown.fit_markdown else 0
      
       print(f"n Markdown Comparison:")
       print(f"   Raw Markdown:  {raw_len:,} characters")
       print(f"   Fit Markdown:  {fit_len:,} characters")
       print(f"   Reduction:     {((raw_len - fit_len) / raw_len * 100):.1f}%")
      
       print(f"n--- Fit Markdown Preview (first 600 chars) ---")
       print(result.markdown.fit_markdown[:600] if result.markdown.fit_markdown else "N/A")
      
   return result


result = asyncio.run(markdown_generation_demo())

We prepare the complete Google Colab environment required to run Crawl4AI smoothly, including system packages, Python dependencies, and the Playwright browser setup. We initialize the async-friendly notebook workflow with nest_asyncio, import the core libraries, and confirm that the environment is ready for crawling tasks. We then begin with foundational examples: a simple crawl, followed by a more configurable crawl that demonstrates how browser settings and runtime options affect page retrieval.

Copy CodeCopiedUse a different Browser

print("n" + "="*60)
print(" PART 5: BM25 QUERY-BASED FILTERING")
print("="*60)


async def bm25_filtering_demo():
   """Using BM25 algorithm to extract content relevant to a specific query."""
   print("n Extracting content relevant to a specific query...")
  
   query = "legal aspects privacy data protection"
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       markdown_generator=DefaultMarkdownGenerator(
           content_filter=BM25ContentFilter(
               user_query=query,
               bm25_threshold=1.2
           )
       )
   )
  
   async with AsyncWebCrawler() as crawler:
       result = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Web_scraping",
           config=run_config
       )
      
       print(f"n Query: '{query}'")
       print(f" Fit markdown length: {len(result.markdown.fit_markdown or '')} chars")
       print(f"n--- Query-Relevant Content Preview ---")
       print(result.markdown.fit_markdown[:800] if result.markdown.fit_markdown else "No relevant content found")
      
   return result


result = asyncio.run(bm25_filtering_demo())


print("n" + "="*60)
print(" PART 6: CSS-BASED EXTRACTION (No LLM)")
print("="*60)


from crawl4ai import JsonCssExtractionStrategy


async def css_extraction_demo():
   """Extract structured data using CSS selectors - fast and reliable."""
   print("n Extracting data using CSS selectors...")
  
   schema = {
       "name": "Wikipedia Headings",
       "baseSelector": "div.mw-parser-output h2",
       "fields": [
           {
               "name": "heading_text",
               "selector": "span.mw-headline",
               "type": "text"
           },
           {
               "name": "heading_id",
               "selector": "span.mw-headline",
               "type": "attribute",
               "attribute": "id"
           }
       ]
   }
  
   extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False)
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       extraction_strategy=extraction_strategy
   )
  
   async with AsyncWebCrawler() as crawler:
       result = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Python_(programming_language)",
           config=run_config
       )
      
       if result.extracted_content:
           data = json.loads(result.extracted_content)
           print(f"n Extracted {len(data)} section headings")
           print(f"n--- Extracted Headings ---")
           for item in data[:10]:
               heading = item.get('heading_text', 'N/A')
               heading_id = item.get('heading_id', 'N/A')
               if heading:
                   print(f"  • {heading} (#{heading_id})")
       else:
           print(" No data extracted")
          
   return result


result = asyncio.run(css_extraction_demo())


print("n" + "="*60)
print(" PART 7: ADVANCED CSS EXTRACTION - Hacker News")
print("="*60)


async def advanced_css_extraction():
   """Extract stories from Hacker News with nested selectors."""
   print("n Extracting stories from Hacker News...")
  
   schema = {
       "name": "Hacker News Stories",
       "baseSelector": "tr.athing",
       "fields": [
           {
               "name": "rank",
               "selector": "span.rank",
               "type": "text"
           },
           {
               "name": "title",
               "selector": "span.titleline > a",
               "type": "text"
           },
           {
               "name": "url",
               "selector": "span.titleline > a",
               "type": "attribute",
               "attribute": "href"
           },
           {
               "name": "site",
               "selector": "span.sitestr",
               "type": "text"
           }
       ]
   }
  
   extraction_strategy = JsonCssExtractionStrategy(schema)
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       extraction_strategy=extraction_strategy
   )
  
   async with AsyncWebCrawler() as crawler:
       result = await crawler.arun(
           url="https://news.ycombinator.com",
           config=run_config
       )
      
       if result.extracted_content:
           stories = json.loads(result.extracted_content)
           print(f"n Extracted {len(stories)} stories from Hacker News")
           print(f"n--- Top 10 Stories ---")
           for story in stories[:10]:
               rank = story.get('rank', '?').strip('.') if story.get('rank') else '?'
               title = story.get('title', 'N/A')[:55]
               site = story.get('site', 'N/A')
               print(f"  #{rank:<3} {title:<55} ({site})")
              
   return result


result = asyncio.run(advanced_css_extraction())

We focus on improving the quality and relevance of extracted content by exploring markdown generation and query-aware filtering. We compare raw markdown with fit markdown to see how pruning reduces noise, and we use BM25-based filtering to keep only the parts of a page that align with a specific query. We then move into CSS-based extraction, where we define a structured schema and use selectors to pull clean heading data from a Wikipedia page without relying on an LLM.

Copy CodeCopiedUse a different Browser

print("n" + "="*60)
print(" PART 8: JAVASCRIPT EXECUTION")
print("="*60)


async def javascript_execution_demo():
   """Execute JavaScript on pages before extraction."""
   print("n Executing JavaScript before crawling...")
  
   js_code = """
   // Scroll down to trigger lazy loading
   window.scrollTo(0, document.body.scrollHeight);
  
   // Wait for content to load
   await new Promise(r => setTimeout(r, 1000));
  
   // Scroll back up
   window.scrollTo(0, 0);
  
   // Add a marker to verify JS ran
   document.body.setAttribute('data-crawl4ai', 'executed');
   """
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       js_code=[js_code],
       wait_for="css:body",
       delay_before_return_html=1.0
   )
  
   async with AsyncWebCrawler() as crawler:
       result = await crawler.arun(
           url="https://httpbin.org/html",
           config=run_config
       )
      
       print(f"n Page crawled with JS execution")
       print(f" Status: {result.status_code}")
       print(f" Content length: {len(result.markdown.raw_markdown)} chars")
      
   return result


result = asyncio.run(javascript_execution_demo())


print("n" + "="*60)
print(" PART 9: LLM-BASED EXTRACTION")
print("="*60)


from crawl4ai import LLMExtractionStrategy, LLMConfig


class Article(BaseModel):
   title: str = Field(description="The article title")
   summary: str = Field(description="A brief summary")
   topics: List[str] = Field(description="Main topics covered")


async def llm_extraction_demo():
   """Use LLM to intelligently extract and structure data."""
   print("n LLM-based extraction setup...")
  
   import os
   api_key = os.getenv('OPENAI_API_KEY')
  
   if not api_key:
       print("n No OPENAI_API_KEY found. Showing setup code only.")
       print("nTo enable LLM extraction, run:")
       print("   import os")
       print("   os.environ['OPENAI_API_KEY'] = 'sk-your-key-here'")
       print("n--- Example Code ---")
       example_code = '''
from crawl4ai import LLMExtractionStrategy, LLMConfig
from pydantic import BaseModel, Field


class Product(BaseModel):
   name: str = Field(description="Product name")
   price: str = Field(description="Product price")


llm_strategy = LLMExtractionStrategy(
   llm_config=LLMConfig(
       provider="openai/gpt-4o-mini",  # or "ollama/llama3"
       api_token=os.getenv('OPENAI_API_KEY')
   ),
   schema=Product.model_json_schema(),
   extraction_type="schema",
   instruction="Extract all products with prices."
)


run_config = CrawlerRunConfig(
   extraction_strategy=llm_strategy,
   cache_mode=CacheMode.BYPASS
)


async with AsyncWebCrawler() as crawler:
   result = await crawler.arun(url="https://example.com", config=run_config)
   products = json.loads(result.extracted_content)
'''
       print(example_code)
       return None
  
   llm_strategy = LLMExtractionStrategy(
       llm_config=LLMConfig(
           provider="openai/gpt-4o-mini",
           api_token=api_key
       ),
       schema=Article.model_json_schema(),
       extraction_type="schema",
       instruction="Extract article titles and summaries."
   )
  
   run_config = CrawlerRunConfig(
       extraction_strategy=llm_strategy,
       cache_mode=CacheMode.BYPASS
   )
  
   async with AsyncWebCrawler() as crawler:
       result = await crawler.arun(
           url="https://news.ycombinator.com",
           config=run_config
       )
      
       if result.extracted_content:
           data = json.loads(result.extracted_content)
           print(f"n LLM extracted:")
           print(json.dumps(data, indent=2)[:1000])
          
   return result


result = asyncio.run(llm_extraction_demo())

We continue structured extraction by applying nested CSS selectors to collect ranked story information from Hacker News in a clean JSON-like format. We then demonstrate JavaScript execution before extraction, which helps us interact with dynamic pages by scrolling, waiting for content, and modifying the DOM before processing. Finally, we introduce LLM-based extraction, define a schema with Pydantic, and show how Crawl4AI can convert unstructured web content into structured outputs using a language model.

Copy CodeCopiedUse a different Browser

print("n" + "="*60)
print(" PART 10: DEEP CRAWLING")
print("="*60)


from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter, DomainFilter


async def deep_crawl_demo():
   """Crawl multiple pages starting from a seed URL using BFS."""
   print("n Starting deep crawl with BFS strategy...")
  
   filter_chain = FilterChain([
       DomainFilter(
           allowed_domains=["docs.crawl4ai.com"],
           blocked_domains=[]
       ),
       URLPatternFilter(
           patterns=["*quickstart*", "*installation*", "*examples*"]
       )
   ])
  
   deep_crawl_strategy = BFSDeepCrawlStrategy(
       max_depth=2,
       max_pages=5,
       filter_chain=filter_chain,
       include_external=False
   )
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       deep_crawl_strategy=deep_crawl_strategy
   )
  
   pages_crawled = []
  
   async with AsyncWebCrawler() as crawler:
       results = await crawler.arun(
           url="https://docs.crawl4ai.com/",
           config=run_config
       )
      
       if isinstance(results, list):
           for result in results:
               pages_crawled.append(result.url)
               print(f"   Crawled: {result.url}")
               print(f"      Content: {len(result.markdown.raw_markdown)} chars")
       else:
           pages_crawled.append(results.url)
           print(f"   Crawled: {results.url}")
           print(f"      Content: {len(results.markdown.raw_markdown)} chars")
  
   print(f"n Total pages crawled: {len(pages_crawled)}")
   return pages_crawled


pages = asyncio.run(deep_crawl_demo())


print("n" + "="*60)
print(" PART 11: MULTI-URL CONCURRENT CRAWLING")
print("="*60)


async def multi_url_crawl():
   """Crawl multiple URLs concurrently for maximum efficiency."""
   print("n Crawling multiple URLs concurrently...")
  
   urls = [
       "https://httpbin.org/html",
       "https://httpbin.org/robots.txt",
       "https://httpbin.org/json",
       "https://example.com",
       "https://httpbin.org/headers"
   ]
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       verbose=False
   )
  
   async with AsyncWebCrawler() as crawler:
       results = await crawler.arun_many(
           urls=urls,
           config=run_config
       )
      
       print(f"n Results Summary:")
       print(f"{'URL':<40} {'Status':<10} {'Content':<15}")
       print("-" * 65)
      
       for result in results:
           url_short = result.url[:38] + ".." if len(result.url) > 40 else result.url
           status = "" if result.success else ""
           content_len = f"{len(result.markdown.raw_markdown):,} chars" if result.success else "N/A"
           print(f"{url_short:<40} {status:<10} {content_len:<15}")
          
   return results


results = asyncio.run(multi_url_crawl())


print("n" + "="*60)
print(" PART 12: SCREENSHOTS & MEDIA")
print("="*60)


async def screenshot_demo():
   """Capture screenshots and extract media from pages."""
   print("n Capturing screenshot and extracting media...")
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       screenshot=True,
       pdf=False,
   )
  
   async with AsyncWebCrawler() as crawler:
       result = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Web_scraping",
           config=run_config
       )
      
       print(f"n Crawl complete!")
       print(f" Screenshot captured: {result.screenshot is not None}")
      
       if result.screenshot:
           print(f"   Screenshot size: {len(result.screenshot)} bytes (base64)")
          
       if result.media and 'images' in result.media:
           images = result.media['images']
           print(f"n Found {len(images)} images:")
           for img in images[:5]:
               print(f"   • {img.get('src', 'N/A')[:60]}...")
              
   return result


result = asyncio.run(screenshot_demo())

We expand from single-page crawling to deeper and broader workflows by introducing BFS-based deep crawling across multiple related pages. We configure a filter chain to control which domains and URL patterns are allowed, making the crawl targeted and efficient rather than uncontrolled. We also demonstrate concurrent multi-URL crawling and screenshot/media extraction, showing how Crawl4AI can scale across several pages while also collecting visual and embedded content.

Copy CodeCopiedUse a different Browser

print("n" + "="*60)
print(" PART 13: LINK EXTRACTION")
print("="*60)


async def link_extraction_demo():
   """Extract and analyze all links from a page."""
   print("n Extracting and analyzing links...")
  
   run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
  
   async with AsyncWebCrawler() as crawler:
       result = await crawler.arun(
           url="https://docs.crawl4ai.com/",
           config=run_config
       )
      
       internal_links = result.links.get('internal', [])
       external_links = result.links.get('external', [])
      
       print(f"n Link Analysis:")
       print(f"   Internal links: {len(internal_links)}")
       print(f"   External links: {len(external_links)}")
      
       print(f"n--- Sample Internal Links (first 5) ---")
       for link in internal_links[:5]:
           print(f"   • {link.get('href', 'N/A')[:60]}")
          
       print(f"n--- Sample External Links (first 5) ---")
       for link in external_links[:5]:
           print(f"   • {link.get('href', 'N/A')[:60]}")
          
   return result


result = asyncio.run(link_extraction_demo())


print("n" + "="*60)
print(" PART 14: CONTENT SELECTION")
print("="*60)


async def content_selection_demo():
   """Target specific content using CSS selectors."""
   print("n Targeting specific content with CSS selectors...")
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       css_selector="article, main, .content, #content, #mw-content-text",
       excluded_tags=["nav", "footer", "header", "aside"],
       remove_overlay_elements=True
   )
  
   async with AsyncWebCrawler() as crawler:
       result = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Web_scraping",
           config=run_config
       )
      
       print(f"n Content extracted with targeting")
       print(f" Markdown length: {len(result.markdown.raw_markdown):,} chars")
       print(f"n--- Preview (first 500 chars) ---")
       print(result.markdown.raw_markdown[:500])
      
   return result


result = asyncio.run(content_selection_demo())


print("n" + "="*60)
print(" PART 15: SESSION MANAGEMENT")
print("="*60)


async def session_management_demo():
   """Maintain browser sessions across multiple requests."""
   print("n Demonstrating session management...")
  
   browser_config = BrowserConfig(headless=True)
  
   async with AsyncWebCrawler(config=browser_config) as crawler:
       session_id = "my_session"
      
       result1 = await crawler.arun(
           url="https://httpbin.org/cookies/set?session=demo123",
           config=CrawlerRunConfig(
               cache_mode=CacheMode.BYPASS,
               session_id=session_id
           )
       )
       print(f"  Step 1: Set cookies - Success: {result1.success}")
      
       result2 = await crawler.arun(
           url="https://httpbin.org/cookies",
           config=CrawlerRunConfig(
               cache_mode=CacheMode.BYPASS,
               session_id=session_id
           )
       )
       print(f"  Step 2: Read cookies - Success: {result2.success}")
       print(f"n Cookie Response:")
       print(result2.markdown.raw_markdown[:300])
      
   return result2


result = asyncio.run(session_management_demo())

We analyze the structure and navigability of a site by extracting both internal and external links from a page and summarizing them for inspection. We then demonstrate content targeting with CSS selectors and excluded tags, focusing extraction on the most meaningful sections of a page while avoiding navigation or layout noise. After that, we show session management, where we preserve browser state across requests and verify that cookies persist between sequential crawls.

Copy CodeCopiedUse a different Browser

print("n" + "="*60)
print(" PART 16: COMPLETE REAL-WORLD EXAMPLE")
print("="*60)


async def complete_example():
   """Complete example combining CSS extraction with content filtering."""
   print("n Running complete example: Hacker News scraper with filtering")
  
   schema = {
       "name": "HN Stories",
       "baseSelector": "tr.athing",
       "fields": [
           {"name": "rank", "selector": "span.rank", "type": "text"},
           {"name": "title", "selector": "span.titleline > a", "type": "text"},
           {"name": "url", "selector": "span.titleline > a", "type": "attribute", "attribute": "href"},
           {"name": "site", "selector": "span.sitestr", "type": "text"}
       ]
   }
  
   browser_config = BrowserConfig(
       headless=True,
       viewport_width=1920,
       viewport_height=1080
   )
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       extraction_strategy=JsonCssExtractionStrategy(schema),
       markdown_generator=DefaultMarkdownGenerator(
           content_filter=PruningContentFilter(threshold=0.4)
       )
   )
  
   async with AsyncWebCrawler(config=browser_config) as crawler:
       result = await crawler.arun(
           url="https://news.ycombinator.com",
           config=run_config
       )
      
       if result.extracted_content:
           stories = json.loads(result.extracted_content)
          
           print(f"n Successfully extracted {len(stories)} stories!")
           print(f"n{'='*70}")
           print(" TOP HACKER NEWS STORIES")
           print("="*70)
          
           for story in stories[:15]:
               rank = story.get('rank', '?').strip('.') if story.get('rank') else '?'
               title = story.get('title', 'No title')[:50]
               site = story.get('site', 'N/A')
               url = story.get('url', '')[:30]
               print(f"  #{rank:<3} {title:<50} ({site})")
              
           print("="*70)
          
           return stories
  
   return []


stories = asyncio.run(complete_example())


print("n" + "="*60)
print(" BONUS: SAVING RESULTS")
print("="*60)


if stories:
   with open('hacker_news_stories.json', 'w') as f:
       json.dump(stories, f, indent=2)
   print(f" Saved {len(stories)} stories to 'hacker_news_stories.json'")
   print("nTo download in Colab:")
   print("   from google.colab import files")
   print("   files.download('hacker_news_stories.json')")


print("n" + "="*60)
print(" TUTORIAL COMPLETE!")
print("="*60)


print("""
 What you learned:


1. Basic crawling with AsyncWebCrawler
2. Browser & crawler configuration
3. Markdown generation (raw vs fit)
4. BM25 query-based content filtering
5. CSS-based structured data extraction
6. Advanced CSS extraction (Hacker News)
7. JavaScript execution for dynamic content
8. LLM-based extraction setup
9. Deep crawling with BFS strategy
10. Multi-URL concurrent crawling
11. Screenshots & media extraction
12. Link extraction & analysis
13. Content targeting with CSS selectors
14. Session management
15. Complete real-world scraping example


 RESOURCES:
 • Docs: https://docs.crawl4ai.com/
 • GitHub: https://github.com/unclecode/crawl4ai
 • Discord: https://discord.gg/jP8KfhDhyN


 Happy Crawling with Crawl4AI!
""")

We combine several ideas from the tutorial into a complete real-world example that extracts and filters Hacker News stories using structured CSS extraction and Markdown pruning. We format the results into a readable output, demonstrating how Crawl4AI can support a practical scraping workflow from collection to presentation. Finally, we save the extracted stories to a JSON file and close the tutorial with a clear summary of the major concepts and capabilities we have implemented throughout the notebook.

In conclusion, we developed a strong end-to-end understanding of how to use Crawl4AI for both simple and advanced crawling tasks. We moved from straightforward page extraction to more refined workflows involving content filtering, targeted element selection, structured data extraction, dynamic-page interaction, multi-URL concurrency, and deep crawling across linked pages. We also saw how the framework supports richer automation through media capture, persistent sessions, and optional LLM-powered schema extraction. As a result, we finished with a practical foundation for building reliable, efficient, and flexible scraping and crawling pipelines that are ready to support real-world research, monitoring, and intelligent data processing workflows.

Check out the Full Implementation Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction appeared first on MarkTechPost.

Related Posts