Marker Tutorial
I personally enjoy browsing instructables — the site helpfully provides a "Export article as PDF" button in the top right, but no "Export as Markdown" option. I really wanted Markdown format, since AI tools prefer reading .md over PDF. After searching online, I found this GitHub project.
Marker is an open-source tool that converts PDF, Word, PPT, and other file formats into Markdown. It's fast, and accuracy is pretty solid. This post documents how to get it running.
Setting Up the Environment
Prerequisites
- Python 3.10 or higher
- Conda is recommended for environment management (explained below)
If you're on a Mac M1/M2/M3, no extra configuration is needed — PyTorch natively supports Apple Silicon.
Install Conda
Miniforge is recommended: lightweight and well-supported on Apple Silicon.
# macOS Apple Silicon
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
After installation, run this command so Terminal recognizes conda:
~/miniforge3/bin/conda init zsh
Then close and reopen Terminal. You'll see (base) at the start of your prompt — conda is ready.
Create a Virtual Environment
Think of a virtual environment as a separate "room" for this project — packages installed here won't affect anything else on your system.
conda create -n marker python=3.11 -y
conda activate marker
Once activated, your prompt changes from (base) to (marker).
Install Marker
pip install marker-pdf
If you also need to convert Word, PPT, or EPUB files:
pip install marker-pdf[full]
Verify Installation
marker_single --help
Wait about 10–15 seconds. If a list of parameters is printed, you're good to go.
Activating the Environment Each Session
Every time you open a new Terminal, remember to activate the environment first:
conda activate marker
How to Use
The basic format is a single line:
marker_single /path/to/your/file --output_dir /output/directory
Useful options:
--output_dir: specify the output folder--output_format: output format —markdown,json, orhtml--page_range: convert specific pages, e.g."0,5-10"(pages are 0-indexed)--force_ocr: force OCR — use this when text in the PDF is actually an image--langs zh: tells Marker the document is in Chinese for better accuracy--use_llm: use an LLM for assistance — higher accuracy, requires an API key
After conversion, the output folder will contain a .md file and extracted images. Open with Typora or Obsidian for proper rendering.
Practical Examples
Convert a Research Paper
marker_single ~/Downloads/paper.pdf \
--output_dir ~/Documents/notes
The first run downloads the AI models (~1 GB) — be patient. Afterwards, math formulas are converted to LaTeX, tables and headings are preserved. Results are quite good.
Convert a Scanned PDF
Text in scanned PDFs is stored as images and can't be recognized normally. Add --force_ocr:
marker_single ~/Downloads/scanned.pdf \
--output_dir ~/Desktop/output \
--force_ocr \
--langs zh
This mode is slower — roughly 5–15 seconds per page. Don't stop it midway.
Convert Specific Pages, Output as JSON
For example, from a 200-page report, extract only pages 1, 5–10, and 20:
marker_single ~/Downloads/report.pdf \
--output_dir ~/Desktop/output \
--page_range "0,4-9,19" \
--output_format json
JSON output produces structured data — useful for further processing, such as importing into a database or feeding into an LLM.
Let AI Tools Handle the Deployment
If the steps above feel tedious, paste this prompt into Claude Code, Cursor, or a similar AI tool and let it set up the environment automatically.
You are a professional Python environment configuration engineer. Please fully deploy the marker-pdf runtime environment on this machine.
My setup: macOS Apple Silicon (M1 Pro), shell is zsh, Conda is installed at ~/miniforge3.
Please complete the following steps in order, check the result after each step, and immediately diagnose and fix any errors:
1. Check if conda is available (conda --version). If not, run ~/miniforge3/bin/conda init zsh and remind me to reopen Terminal.
2. Create and activate a virtual environment: conda create -n marker python=3.11 -y && conda activate marker
3. Install marker: pip install marker-pdf
4. Verify PyTorch: python -c "import torch; print(torch.__version__)". If you see a libtorch_cpu.dylib error, reinstall immediately: pip uninstall torch torchvision torchaudio -y && pip install torch torchvision torchaudio
5. Verify marker: run marker_single --help, wait ~20 seconds. A list of parameters means success.
Tell me what you're doing before each step. Do not modify system-level configuration. When done, output the installed versions of Python, PyTorch, and marker-pdf.
Uninstalling
Uninstall only the marker package, keep the environment:
conda activate marker
pip uninstall marker-pdf
Delete the entire environment (more thorough):
conda deactivate
conda env remove -n marker
Troubleshooting
Terminal is stuck? The first run downloads models, which can be slow on some networks. Try configuring a proxy for Terminal:
export https_proxy=http://127.0.0.1:YOUR_PORT
libtorch_cpu.dylib error on Mac? Reinstall PyTorch:
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio
Poor Chinese recognition? Add --langs zh. For scanned documents, also add --force_ocr.
command not found? The environment isn't activated. Run conda activate marker first.
Summary
Marker is essentially a locally-running document parsing engine, powered by AI models specifically trained on document layouts — not a general-purpose LLM. This makes it fast, offline-capable, and effective without any API key. It excels at complex academic PDFs: converting math formulas to LaTeX, preserving table structure, and extracting images separately. If you need to organize large volumes of documents into a knowledge base, or want to feed PDF content into an LLM for further processing, Marker is one of the best open-source options available.
Based on datalab-to/marker, compatible with marker v1.x