AI Crawler Manager — Documentation
Complete guide to the dfaicrawlermanager module: installation, visual robots.txt builder, HTTP 403 blocking, Apache/Nginx log import and AI bot blocking strategies.
Overview
AI Crawler Manager (technical slug: dfaicrawlermanager) gives your PrestaShop 8 or 9 store fine-grained control over traffic generated by AI bots: OpenAI GPTBot, Anthropic ClaudeBot, Google-Extended, Applebot-Extended, PerplexityBot, ByteDance Bytespider, and 25+ other crawlers up-to-date as of May 2026.
Three complementary protection mechanisms:
- Visual robots.txt builder — allow/block each bot via a toggle, apply a preset in one click, write the file without breaking your manual directives.
- HTTP 403 blocking — for bots that ignore robots.txt (Bytespider, legacy anthropic-ai), returns a 403 status code on the very first request, before any PrestaShop processing.
- Crawl statistics — real-time tracking via hook + Apache/Nginx log import to retroactively measure AI traffic.
# BEGIN DataFirefly AI Crawler Manager and # END DataFirefly AI Crawler Manager. Everything else in the file is preserved as-is, and a .bak file is created on every write.
Requirements
- PrestaShop 8.0 → 9.x
- PHP 7.4 minimum (PHP 8.0 to 8.3 recommended)
- MySQL 5.7 / MariaDB 10.3 or higher
- Write access on
/robots.txt(store root) - For log import: read access to the Apache/Nginx access log (usually
/var/log/apache2/access.log, or~/logs/on o2switch,~/access-logs/on cPanel)
Installation
- Download the
dfaicrawlermanager-v1.0.0.ziparchive from your DataFirefly account. - In the PrestaShop back office, go to Modules › Module Manager › Upload a module.
- Drag and drop the ZIP, wait for confirmation, then click Install.
- Once installed, a new AI Crawler Manager tab appears in the left menu (under Configure).
Installation creates 5 tables (prefix ps_dfaicm_), automatically seeds the list of 30+ AI bots and adds 6 admin tabs.
composer install is required. The PSR-4 autoloader is bundled with the module under the namespace DataFirefly/AiCrawlerManager.
First steps — the dashboard
The AI Crawler Manager tab opens the dashboard. On a fresh install, you see:
- Tracked AI bots: 30+ (count of active bots in the database)
- Blocked bots: 0 (by default, all bots are allowed)
- Visits (30d): 0 (real-time tracking starts only after activation)
- Path rules: 0
Three recommended actions at this stage:
- Open the robots.txt visual builder and apply a preset (see dedicated section).
- Enable real-time tracking in Settings to start collecting statistics.
- Optional: import your historical access logs to see AI crawl traffic from previous weeks.
AI Bots tab
The full list of 30+ tracked bots, with:
- Display name: marketing name (e.g. “ClaudeBot”)
- User-agent: exact string searched in the HTTP header
- Vendor: company (OpenAI, Anthropic, Google, ByteDance, Meta…)
- Purpose: training (LLM training), assistant (real-time answers), search (AI search engine), crawl (generic)
- Respects robots.txt: yes / no (indicates whether robots.txt is sufficient)
- Status: allowed / blocked
Available actions:
- Edit a bot to adjust its status or add internal notes.
- Bulk block / unblock via the group actions at the bottom of the list.
- Any change triggers an automatic robots.txt regeneration if the corresponding option is enabled in Settings.
Visual robots.txt builder
The most used tab: visual editor for the robots.txt file.
One-click presets
Five ready-to-use strategies:
- Block training only — stops training bots (GPTBot, ClaudeBot, anthropic-ai, CCBot, Bytespider…) and keeps assistant and search bots allowed (ChatGPT-User, Claude-User, OAI-SearchBot…). Recommended for most stores.
- Strict — blocks training + generic crawl, allows assistant + search.
- Block everything — disallow on all 30+ AI bots.
- Allow everything — resets all bots to allowed.
- Block Bytespider only — useful if you only want to target the most aggressive crawler without touching the rest.
Per-bot toggle
Each bot has its own switch:
- Green = allowed (no
Disallowdirective in robots.txt) - Red = blocked (
User-agent: X / Disallow: /directive written in the managed section)
A yellow “ignores robots.txt” badge marks bots for which robots.txt alone is insufficient. For those, also enable HTTP 403 blocking in Settings (see dedicated section).
Live preview
The right-hand panel displays in real time the content that will be written to robots.txt. Typical output:
# BEGIN DataFirefly AI Crawler Manager
# Generated 2026-05-26 14:32 — do not edit manually
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Allow: /
User-agent: Bytespider
Disallow: /
# … other bots …
Sitemap: https://example.com/sitemap.xml
# END DataFirefly AI Crawler Manager
Click Save to robots.txt to write the file. A robots.txt.bak file is created next to it on every save.
Path rules
For fine-grained blocking: allow a bot on part of the site, block it on another.
Typical example: allow ClaudeBot on product pages (so Claude recommends them) but block it on the blog (to not give up your editorial content).
A rule consists of:
- Bot — targeted bot (or “all bots” via wildcard)
- Action —
allowordisallow - Path — URL pattern with wildcard
*and end-of-string$ - Position — evaluation order (most specific rules first)
Pattern examples:
/blog/*— any URL starting with/blog//*.pdf$— all PDF files/order*— order URLs/module/dfsavecart/*— a specific module
Allow: / Disallow: directives, but they also drive HTTP 403 blocking when enabled.
HTTP 403 blocking
Some bots deliberately ignore robots.txt. The most notorious is Bytespider (ByteDance), along with a few legacy versions of anthropic-ai. For these bots, robots.txt is not enough.
Enable the option “Enable HTTP 403 blocking for blocked bots” in Settings. The module then installs an actionDispatcherBefore hook that:
- Detects the user-agent on every incoming request (in-memory string comparison, ~0.1 ms).
- If the bot is in the blocked list and the request matches a block rule: immediately returns HTTP 403 before any PrestaShop initialization.
- Logs the attempt in the
ps_dfaicm_visittable with theblocked = 1flag.
Statistics and log import
The Statistics tab offers a 7, 30 or 90-day view with:
- Global KPIs (total visits, distinct bots, blocked hits)
- Daily traffic chart
- Top bots by volume
- Top crawled URLs
- Log of the 50 most recent visits (date, bot, URL, IP, status)
Real-time tracking
If enabled in Settings, every request is inspected and identified AI bot hits are recorded. The overhead is negligible: less than 1 % of traffic reaches the write phase.
Apache/Nginx log import
Lets you retroactively count AI visits, including those from before the module was installed.
- In Settings, enter the path to the log file. The module offers auto-detection (common Apache, Nginx, o2switch, cPanel paths).
- Choose the format (combined by default, or common).
- In the Statistics tab, click Parse access log now.
Parsing is incremental: a byte offset is stored in the database. Running the operation again does not create duplicates. The module caps each run at 8 MB to avoid timeouts; for very large files, several successive passes are enough.
To start over (for example after a log rotation), check Reset parse offset in Settings and run the parse again.
Settings
Summary of available options:
robots.txt
- Auto-regenerate: regenerates robots.txt automatically when a bot or rule changes
- Crawl-delay: recommended delay between requests (0 = disabled, 1-120 seconds)
- Sitemap URL: added at the end of the managed section
- Global Disallow section: also adds a
User-agent: *section blocking sensitive areas (admin, cart, login)
HTTP blocking
- Enable HTTP 403 blocking: immediately returns 403 for blocked bots (see dedicated section)
Real-time tracking
- Enable tracking: records every detected AI visit
- Retention: number of days to keep individual visits (7 to 730, default 90). Daily aggregates are kept longer.
Log import
- Enable log parsing: enables the import button in the Statistics tab
- File path: absolute path, with auto-detection offered
- Format: combined (Apache/Nginx default) or common
- Reset parse offset: check to re-read the whole file
Recommended blocking strategies
The choice depends on your editorial and commercial positioning. Three typical profiles:
Standard e-commerce store (default recommendation)
Apply the “Block training only” preset. Training bots (GPTBot, ClaudeBot, anthropic-ai, CCBot, Bytespider) are blocked. Real-time assistant bots (ChatGPT-User, Claude-User) and AI search bots (OAI-SearchBot, PerplexityBot, Google-Extended) remain allowed: your products can still be recommended in ChatGPT, Claude, Perplexity and Google AI Overviews.
Premium brand / strong editorial content
“Strict” preset + path rules to allow specific zones. Example: block all AI bots everywhere, except /product/* allowed for ChatGPT-User and Claude-User. Your product descriptions remain referenced in assistants, your blog and guides are protected.
Store in launch phase / low editorial volume
“Allow everything” preset. Visibility in AI answer engines outweighs the risk of content cession. You will switch to stricter blocking once your catalog and blog gain value.
Maintenance
Automatic pruning
Individual visits older than the configured retention are automatically deleted on every log parsing. You can also trigger a manual prune from the Statistics tab (“Prune old visits” button).
robots.txt backup
Each write creates a robots.txt.bak next to the original file. In case of error, you can manually restore it via FTP or your cPanel.
Bot list updates
New AI bots are added through module updates. The ps_dfaicm_bot table is updated in “merge” mode: a bot you have manually customized is never overwritten.
Troubleshooting
robots.txt is not writable
The dashboard shows a red “Not writable” badge. Check:
- File permissions on
/robots.txt: must be 644 minimum, and the owner must be the PHP/Apache user - If the file does not exist, check the root directory permissions (755 + correct owner)
- On some shared hostings, robots.txt is generated dynamically by PrestaShop: disable the corresponding option in Preferences › Traffic › SEO and URLs
Access log auto-detection finds nothing
The module looks at the following paths: /var/log/apache2/access.log, /var/log/nginx/access.log, ~/logs/, ~/access-logs/. On other hostings, enter the path manually. If you do not know it, contact your hosting support or check the documentation of your control panel.
Log parsing takes too long
The module caps each run at 8 MB to avoid PHP timeouts. For a 500 MB file, expect 60 to 70 passes. Each click on “Parse access log now” resumes where the previous one stopped thanks to the stored offset.
A blocked bot still appears in statistics
This is normal: real-time tracking records EVERY detected AI visit, including blocked ones (with the was_blocked = 1 flag). This lets you measure how many attempts your configuration is actually blocking.
A bot ignores robots.txt despite my rule
Confirm with a log import: if you still see hits with status 200, the bot is indeed ignoring robots.txt. Enable HTTP 403 blocking in Settings. From that moment, hits from the bot will appear with status 403 and the was_blocked = 1 flag.
Uninstallation
From Modules › Module Manager, click Uninstall on the module card. The operation:
- Deletes the 5
ps_dfaicm_*tables - Removes the 6 admin tabs
- Removes the managed section from robots.txt (sentinel markers and everything they delimit)
- Preserves the rest of robots.txt and the
robots.txt.bakfile
Technical reference
- Technical slug:
dfaicrawlermanager - Namespace:
DataFirefly/AiCrawlerManager - Created tables:
ps_dfaicm_bot,ps_dfaicm_rule,ps_dfaicm_category_rule,ps_dfaicm_visit,ps_dfaicm_visit_daily - Hooks used:
actionDispatcherBefore,actionAdminControllerSetMedia,displayBackOfficeHeader - Back-office tabs: Dashboard, Bots, Path rules, Builder, Statistics, Settings (under AdminParentConfigure)
- Configuration keys:
DFAICM_AUTO_REGEN,DFAICM_VISIT_LOG,DFAICM_HTTP_BLOCK,DFAICM_LOG_PARSING,DFAICM_LOG_PATH,DFAICM_LOG_FORMAT,DFAICM_LAST_PARSE,DFAICM_LAST_OFFSET,DFAICM_RETENTION,DFAICM_CRAWL_DELAY,DFAICM_SITEMAP_URL,DFAICM_GLOBAL_DISALLOW,DFAICM_INSTALLED_AT
Support
For any technical question, contact the DataFirefly team at contact@datafirefly.com or visit your customer area on datafirefly.com.