- Published on
What is Firecrawl? How to Turn Websites Into LLM-Ready Data
Firecrawl is an open-source tool that converts entire websites into clean, LLM-ready (Large Language Model) markdown or structured data in under 60 seconds. It handles complex tasks like bypassing bot detection, managing JavaScript-heavy pages, and crawling subpages automatically so you can feed high-quality data into AI applications. By using Firecrawl, you can transform a messy URL into a perfectly formatted text file that models like Claude Opus 4.5 or GPT-5 can understand immediately.
What makes Firecrawl different from a standard web scraper?
A standard web scraper (a tool that extracts data from websites) often gives you raw HTML code. HTML is full of "noise" like scripts, styling tags, and navigation menus that confuse AI models and waste tokens (the units of text AI uses to process information). Firecrawl specifically strips away this junk to provide clean Markdown (a simple text formatting language).
Most basic scrapers also struggle with "dynamic content," which refers to websites that use JavaScript to load data only after the page opens. Firecrawl uses a "headless browser" (a web browser without a visual interface) to wait for the page to fully load before grabbing the data. It also handles "crawling," which means it doesn't just look at one page; it follows links to map out and scrape an entire site for you.
We've found that the biggest hurdle for beginners is often getting blocked by websites that don't like automated tools. Firecrawl includes built-in proxies (intermediary servers that hide your real identity) to help you avoid these blocks. This means you spend less time troubleshooting errors and more time building your project.
What do you need to get started?
Before you write your first line of code, you'll need a few basic things ready on your computer. Don't worry if you haven't used these tools much before; they are standard for most modern AI projects.
- Node.js or Python: These are programming languages. You'll need Python 3.12+ or Node.js 20+ installed.
- A Firecrawl API Key: An API key is like a secret password that lets your code talk to Firecrawl's servers. You can get a free one by signing up on their website.
- A Code Editor: We recommend VS Code (Visual Studio Code), which is a free program for writing and saving your code files.
How do you scrape your first page with Python?
Once you have your API key, you can start extracting data with just a few lines of code. This example uses Python, which is the most popular language for AI beginners.
Step 1: Install the library Open your terminal (the command-line interface on your computer) and type this command to install the Firecrawl helper library:
pip install firecrawl-py
# pip is the tool used to install Python packages
Step 2: Create your script
Create a new file named scrape.py and paste the following code into it. Replace YOUR_API_KEY with the actual key from your dashboard.
from firecrawl import FirecrawlApp
# Initialize the app with your secret API key
app = FirecrawlApp(api_key="YOUR_API_KEY")
# Scrape a single URL and convert it to markdown
# This tells Firecrawl to visit the site and clean the text
scrape_result = app.scrape_url('https://example.com', params={'formats': ['markdown']})
# Print the clean text to your screen
print(scrape_result['markdown'])
Step 3: Run the code In your terminal, run the script by typing:
python scrape.py
What you should see: Instead of messy HTML code with <div class="header"> tags, you will see clean text with simple headers like # Main Title.
How do you crawl an entire website at once?
Sometimes you need data from every page on a blog or a documentation site. Instead of "scraping" (one page), you will "crawl" (the whole site).
Step 1: Start the crawl job
Update your script to use the crawl_url function. This starts a process that follows every internal link on the site.
# Start a crawl job for the entire site
# We set a limit so it doesn't run forever on huge sites
crawl_status = app.crawl_url(
'https://example.com',
params={
'limit': 10,
'scrapeOptions': {'formats': ['markdown']}
}
)
# Get the ID of this job so we can check if it's done
job_id = crawl_status['id']
print(f"Crawl started! Job ID: {job_id}")
Step 2: Check the status
Crawling takes longer than scraping a single page. You can use the job_id to check when the process is finished and download all the files at once.
Step 3: View the results Once finished, Firecrawl returns a list of objects. Each object contains the URL of the page and the clean markdown content from that specific page.
What is "Structured Data" and why use it?
While Markdown is great for reading, sometimes you want your AI to find specific facts, like prices, dates, or names. Firecrawl allows you to use "LLM Extraction" to turn a website into a JSON (JavaScript Object Notation - a way to organize data in pairs like "Price": "10.00").
Instead of just getting a wall of text, you can tell Firecrawl: "Find the product name and the price on this page." Firecrawl will use a model like Claude Sonnet 4 to look at the page, find those specific items, and return them in a neat list. This is incredibly useful if you are building a tool to compare prices or summarize news articles.
What are the common mistakes to avoid?
When you are first starting out, it's normal to run into a few hurdles. Here are the most common "gotchas" we see beginners encounter:
- Exceeding Rate Limits: If you try to scrape 100 pages a second on a free plan, the service will stop you. Start slow with one page at a time.
- Using the Wrong API Key: Ensure you aren't accidentally sharing your key in public places like GitHub. If your key stops working, check if you've hit your monthly credit limit.
- Ignoring the Sitemap: Some websites are massive (thousands of pages). Always set a
limitparameter in your crawl settings so you don't accidentally use all your credits on a single site. - Formatting Errors: If the output looks strange, check the
formatsparameter. If you only need text, stick to['markdown'].
How do you use Firecrawl data with AI models?
The main reason people use Firecrawl is to build "RAG" (Retrieval-Augmented Generation) systems. RAG is a technique where you give an AI model specific information (like your company's private manuals) so it can answer questions accurately without making things up.
You can take the Markdown output from Firecrawl and save it into a "Vector Database" (a special storage system for AI data). When a user asks a question, your app searches the database for the relevant Markdown text and sends it to GPT-5 or Claude Opus 4.5. Because the text is clean and lacks HTML clutter, the AI can process it faster and provide much more accurate answers.
Next Steps
Now that you understand how to turn websites into clean data, you're ready to start building. Try scraping your own blog or a public documentation site to see how the Markdown looks.
Once you are comfortable with basic scraping, we suggest exploring "Map" mode. This feature allows you to get a list of every URL on a website without actually scraping the content yet, which is a great way to plan out a big project.
To learn more about advanced features like custom headers and specialized extraction, check out the official Firecrawl documentation.