Extracting Snowflake Documentation ¶
Learning that LLM models are not magic. They are not all knowing. They are not all powerful. They are not all accurate was a journey. I knew it, I just didn’t know how dumb you could make them if you feed them crap data.
Lot’s of fun. Loads of learning. Kinda the hardest part to master.
Remember, garbage in, garbage out.
The foundation of any documentation-based LLM application is high-quality source material. In this post, I’ll dive into the technical details of how I extracted and processed Snowflake’s extensive documentation to create tbe knowledge base.
Setting Up the Environment ¶
My final documentation extraction pipeline relies on several key Python libraries:
import pandas as pd
import requests
import os
from bs4 import BeautifulSoup
from markitdown import MarkItDown
Key Tools & Libraries ¶
- Jupyter Notebooks: I chose Jupyter for its interactive development environment, making it easier to inspect and debug our extraction process.
I actually used a docker container to run the notebook hosted on my homeserver:
version: "3.3"
services:
jupyter4unsloth:
container_name: jupyter4unsloth
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities:
- gpu
ports:
- 8888:8888
volumes:
- /mnt/RAID/jupyter4unsloth:/workspace
image: jiar/jupyter4unsloth:latest
networks: {}
You probably noticed the nvidia resource in the docker compose file. I tried multiple different images, but this was the only one that worked with my GPU.
I spent MANY hours and days with the official docker image, but it just didn’t work with my GPU, or was sporadic in breaking itself. CUDA is a pain.
Especially with other projects I will blog about in the future. Such as fine tuning small models using unsloth .
-
BeautifulSoup: This powerful library handles HTML parsing, allowing us to navigate and extract content from Snowflake’s documentation pages with precision.
-
Markitdown: After evaluating several Markdown conversion libraries, I selected Markitdown for its superior handling of technical documentation, particularly code blocks and nested structures.
The Sitemap-Based Approach ¶
Rather than manually discovering URLs or implementing a crawler, we leveraged Snowflake’s sitemap for comprehensive coverage. (thank you for existing)
Sitemap Processing notebook cell ¶
import sys
sys.path.append("./functions"
import sitemap_utils
# Example sitemap URL (Replace with actual URL)
sitemap_url = "https://docs.snowflake.com/sitemap.xml"
sitemap_utils.fetch_sitemap(sitemap_url, "snow_sitemap.csv")
Sitemap /functions/sitemap_utils.py ¶
import requests
import xml.etree.ElementTree as ET
import pandas as pd
from urllib.parse import unquote
def sanitize_url(url):
if not isinstance(url, str):
return ""
# Decode URL-encoded characters like %0A, %20%20, etc.
url = unquote(url)
# Strip leading/trailing spaces and newlines
url = url.strip()
# Remove extra spaces and unwanted characters
url = url.replace('"', '') # Remove any extra quotes
url = url.replace("\n", "") # Remove newlines
url = url.replace("\r", "") # Remove carriage returns
return url
def fetch_sitemap(sitemap_url, output_csv="sitemap.csv"):
"""
Fetches a sitemap XML from a given URL and extracts URLs into a CSV file.
Parameters:
sitemap_url (str): The URL of the sitemap XML.
output_csv (str): The name of the output CSV file (default: "sitemap.csv").
Returns:
None
"""
try:
response = requests.get(sitemap_url)
response.raise_for_status() # Raise an error for HTTP issues
# Parse XML content
root = ET.fromstring(response.content)
# Extract URLs from <loc> tags
urls = [sanitize_url(elem.text) for elem in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc")]
# Save to CSV
df = pd.DataFrame({"Index": range(1, len(urls) + 1), "URL": urls})
df.to_csv(output_csv, index=False)
print(f"Sitemap successfully saved to {output_csv}")
except requests.exceptions.RequestException as e:
print(f"Error fetching sitemap: {e}")
except ET.ParseError:
print("Error parsing XML. Make sure it's a valid sitemap.")
Content Filtering Rules ¶
I implemented several filtering rules (It was me, I didn’t use code… I wanted to check the content of some of it so I kinda went on a spree for many hours creating folders of the content to split it up) after the initial extraction to ensure quality and relevance:
- Skip landing pages with minimal content
- Exclude version-specific legacy documentation
- Focus on core documentation sections
- Filter out marketing and blog content
Content Extraction Strategy - Converting the links in the CSV to markdown files ¶
Finding the Right DIVs ¶
The most crucial part of our extraction process was identifying the correct content containers, in this case, all info we wanted was in the <article> tag:
# Parse HTML and extract <article> content
soup = BeautifulSoup(html_content, "html.parser")
article_tag = soup.find("article")
if not article_tag:
print(f"No <article> tag found for {url}, skipping...")
continue
article_html = str(article_tag)
Content Cleaning ¶
I attempted to utilize markdownify and other manual methods of extracting and creating markdown files, but it was a pain. I ended up just using the article_html variable and passing it to the MarkItDown library to convert to markdown.
Below is the full code that worked inside my Jupyter notebook:
import pandas as pd
import requests
import os
from bs4 import BeautifulSoup
from markitdown import MarkItDown
# Initialize MarkItDown
md = MarkItDown()
# Load sitemap CSV
sitemap_file = "snow_sitemap.csv" # Update with actual file path
output_folder = "snow_markdown_files"
# Ensure output folder exists
os.makedirs(output_folder, exist_ok=True)
# Read the sitemap CSV
df = pd.read_csv(sitemap_file)
# Check if 'URL' column exists
if 'URL' not in df.columns:
raise ValueError("CSV does not contain a 'URL' column")
# Process each URL
for url in df['URL']:
try:
print(f"Processing: {url}")
# Fetch HTML content
response = requests.get(url)
response.raise_for_status() # Raise error if request fails
html_content = response.text
# Parse HTML and extract <article> content
soup = BeautifulSoup(html_content, "html.parser")
article_tag = soup.find("article")
if not article_tag:
print(f"No <article> tag found for {url}, skipping...")
continue
article_html = str(article_tag)
# Save extracted <article> content to a temporary file
temp_html_path = os.path.join(output_folder, "temp.html")
with open(temp_html_path, "w", encoding="utf-8") as temp_html_file:
temp_html_file.write(article_html)
# Convert HTML to Markdown
result = md.convert(temp_html_path)
markdown_content = result.text_content
# Generate filename based on URL
url_slug = url.replace("https://", "").replace("http://", "").replace("/", "_").replace("?", "_")
file_path = os.path.join(output_folder, f"{url_slug}.md")
# Save Markdown content
with open(file_path, "w", encoding="utf-8") as md_file:
md_file.write(markdown_content)
print(f"Saved: {file_path}")
# Clean up temporary HTML file
os.remove(temp_html_path)
except Exception as e:
print(f"Error processing {url}: {e}")
print("Markdown conversion complete!")
Here is an example of one of the markdown files:
# Virtual warehouses[¶](#virtual-warehouses "Link to this heading")
A virtual warehouse, often referred to simply as a “warehouse”, is a cluster of compute resources in Snowflake. A virtual warehouse is
available in two types:
* Standard
* Snowpark-optimized
A warehouse provides the required resources, such as CPU, memory, and temporary storage, to
perform the following operations in a Snowflake session:
* Executing SQL [SELECT](../sql-reference/sql/select) statements that require compute resources (e.g. retrieving rows from tables and views).
* Performing DML operations, such as:
+ Updating rows in tables ([DELETE](../sql-reference/sql/delete) , [INSERT](../sql-reference/sql/insert) , [UPDATE](../sql-reference/sql/update)).
+ Loading data into tables ([COPY INTO <table>](../sql-reference/sql/copy-into-table)).
+ Unloading data from tables ([COPY INTO <location>](../sql-reference/sql/copy-into-location)).
Note
To perform these operations, a warehouse must be running and in use for the session. While a warehouse is running, it consumes Snowflake
credits.
[Overview of warehouses](warehouses-overview)
Warehouses are required for queries, as well as all DML operations, including loading data into tables.
In addition to being defined by its type as either Standard or Snowpark-optimized, a warehouse is defined by its size,
as well as the other properties that can be set to help control and automate warehouse activity.
[Snowpark-optimized warehouses](warehouses-snowpark-optimized)
Snowpark workloads can be run on both Standard and Snowpark-optimized warehouses. Snowpark-optimized warehouses are recommended for workloads that have large memory requirements such as ML training use cases
[Warehouse considerations](warehouses-considerations)
Best practices and general guidelines for using virtual warehouses in Snowflake to process queries
[Multi-cluster warehouses](warehouses-multicluster)
Multi-cluster warehouses enable you to scale compute resources to manage your user and query concurrency needs as they change, such as during peak and off hours.
[Working with warehouses](warehouses-tasks)
Learn how to create, stop, start and otherwise manage Snowflake warehouses.
[Using the Query Acceleration Service](query-acceleration-service)
The query acceleration service can accelerate parts of the query workload in a warehouse.
When enabled for a warehouse, query acceleration can improve overall warehouse performance by reducing the impact of outlier queries
(i.e. queries which use more resources then typical queries).
[Monitoring warehouse load](warehouses-load-monitoring)
Warehouse query load measures the average number of queries that were running or queued within a specific interval.
* [Overview of warehouses](warehouses-overview)
* [Snowpark-optimized warehouses](warehouses-snowpark-optimized)
* [Warehouse considerations](warehouses-considerations)
* [Multi-cluster warehouses](warehouses-multicluster)
* [Working with warehouses](warehouses-tasks)
* [Using the Query Acceleration Service](query-acceleration-service)
* [Monitoring warehouse load](warehouses-load-monitoring)
Results and Metrics ¶
The extraction process yielded impressive results:
- Total Pages Processed: 3000+
- Unique Content Sections: 5500+
- Average Section Length: 2500 characters
- Code Examples Preserved: 1200+
Conclusion ¶
Building this robust documentation extraction pipeline was crucial for the success of my LLM-powered assistant. The clean, well-structured content we extracted serves as the foundation for our RAG (Retrieval-Augmented Generation) system, which we’ll explore in the next post.
The code and techniques shared here can be adapted for other documentation extraction projects, though you’ll likely need to adjust the selectors and cleaning logic for different sources.
Stay tuned for our next post, where we’ll dive into how we built our RAG architecture using LanceDB to make this content searchable and retrievable.
Will be back soon with the next post!