PS PrestaShop PrestaShop Intermediate

AI Crawler Manager — Documentation

Complete guide to the dfaicrawlermanager module: installation, visual robots.txt builder, HTTP 403 blocking, Apache/Nginx log import and AI bot blocking strategies.

Updated June 29, 2026 Module version 1.0.0

Overview

AI Crawler Manager (technical slug: dfaicrawlermanager) gives your PrestaShop 8 or 9 store fine-grained control over traffic generated by AI bots: OpenAI GPTBot, Anthropic ClaudeBot, Google-Extended, Applebot-Extended, PerplexityBot, ByteDance Bytespider, and 25+ other crawlers up-to-date as of May 2026.

Three complementary protection mechanisms:

Visual robots.txt builder — allow/block each bot via a toggle, apply a preset in one click, write the file without breaking your manual directives.
HTTP 403 blocking — for bots that ignore robots.txt (Bytespider, legacy anthropic-ai), returns a 403 status code on the very first request, before any PrestaShop processing.
Crawl statistics — real-time tracking via hook + Apache/Nginx log import to retroactively measure AI traffic.

Note — The module never touches your robots.txt outside its own section, delimited by the sentinel markers # BEGIN DataFirefly AI Crawler Manager and # END DataFirefly AI Crawler Manager. Everything else in the file is preserved as-is, and a .bak file is created on every write.

Requirements

PrestaShop 8.0 → 9.x
PHP 7.4 minimum (PHP 8.0 to 8.3 recommended)
MySQL 5.7 / MariaDB 10.3 or higher
Write access on /robots.txt (store root)
For log import: read access to the Apache/Nginx access log (usually /var/log/apache2/access.log, or ~/logs/ on o2switch, ~/access-logs/ on cPanel)

Installation

Download the dfaicrawlermanager-v1.0.0.zip archive from your DataFirefly account.
In the PrestaShop back office, go to Modules › Module Manager › Upload a module.
Drag and drop the ZIP, wait for confirmation, then click Install.
Once installed, a new AI Crawler Manager tab appears in the left menu (under Configure).

Installation creates 5 tables (prefix ps_dfaicm_), automatically seeds the list of 30+ AI bots and adds 6 admin tabs.

Tip — No composer install is required. The PSR-4 autoloader is bundled with the module under the namespace DataFirefly/AiCrawlerManager.

First steps — the dashboard

The AI Crawler Manager tab opens the dashboard. On a fresh install, you see:

Tracked AI bots: 30+ (count of active bots in the database)
Blocked bots: 0 (by default, all bots are allowed)
Visits (30d): 0 (real-time tracking starts only after activation)
Path rules: 0

Three recommended actions at this stage:

Open the robots.txt visual builder and apply a preset (see dedicated section).
Enable real-time tracking in Settings to start collecting statistics.
Optional: import your historical access logs to see AI crawl traffic from previous weeks.

AI Bots tab

The full list of 30+ tracked bots, with:

Display name: marketing name (e.g. “ClaudeBot”)
User-agent: exact string searched in the HTTP header
Vendor: company (OpenAI, Anthropic, Google, ByteDance, Meta…)
Purpose: training (LLM training), assistant (real-time answers), search (AI search engine), crawl (generic)
Respects robots.txt: yes / no (indicates whether robots.txt is sufficient)
Status: allowed / blocked

Available actions:

Edit a bot to adjust its status or add internal notes.
Bulk block / unblock via the group actions at the bottom of the list.
Any change triggers an automatic robots.txt regeneration if the corresponding option is enabled in Settings.

Visual robots.txt builder

The most used tab: visual editor for the robots.txt file.

One-click presets

Five ready-to-use strategies:

Block training only — stops training bots (GPTBot, ClaudeBot, anthropic-ai, CCBot, Bytespider…) and keeps assistant and search bots allowed (ChatGPT-User, Claude-User, OAI-SearchBot…). Recommended for most stores.
Strict — blocks training + generic crawl, allows assistant + search.
Block everything — disallow on all 30+ AI bots.
Allow everything — resets all bots to allowed.
Block Bytespider only — useful if you only want to target the most aggressive crawler without touching the rest.

Important — A preset overwrites the status of all affected bots. A confirmation is requested before application. You can then fine-tune bot by bot.

Per-bot toggle

Each bot has its own switch:

Green = allowed (no Disallow directive in robots.txt)
Red = blocked (User-agent: X / Disallow: / directive written in the managed section)

A yellow “ignores robots.txt” badge marks bots for which robots.txt alone is insufficient. For those, also enable HTTP 403 blocking in Settings (see dedicated section).

Live preview

The right-hand panel displays in real time the content that will be written to robots.txt. Typical output:

# BEGIN DataFirefly AI Crawler Manager
# Generated 2026-05-26 14:32 — do not edit manually

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Allow: /

User-agent: Bytespider
Disallow: /

# … other bots …

Sitemap: https://example.com/sitemap.xml
# END DataFirefly AI Crawler Manager

Click Save to robots.txt to write the file. A robots.txt.bak file is created next to it on every save.

Path rules

For fine-grained blocking: allow a bot on part of the site, block it on another.

Typical example: allow ClaudeBot on product pages (so Claude recommends them) but block it on the blog (to not give up your editorial content).

A rule consists of:

Bot — targeted bot (or “all bots” via wildcard)
Action — allow or disallow
Path — URL pattern with wildcard * and end-of-string $
Position — evaluation order (most specific rules first)

Pattern examples:

/blog/* — any URL starting with /blog/
/*.pdf$ — all PDF files
/order* — order URLs
/module/dfsavecart/* — a specific module

Note — Path rules are added to robots.txt as standard Allow: / Disallow: directives, but they also drive HTTP 403 blocking when enabled.

HTTP 403 blocking

Some bots deliberately ignore robots.txt. The most notorious is Bytespider (ByteDance), along with a few legacy versions of anthropic-ai. For these bots, robots.txt is not enough.

Enable the option “Enable HTTP 403 blocking for blocked bots” in Settings. The module then installs an actionDispatcherBefore hook that:

Detects the user-agent on every incoming request (in-memory string comparison, ~0.1 ms).
If the bot is in the blocked list and the request matches a block rule: immediately returns HTTP 403 before any PrestaShop initialization.
Logs the attempt in the ps_dfaicm_visit table with the blocked = 1 flag.

Tip — HTTP blocking saves CPU and database resources for the highest-volume bots. Bytespider can represent several thousand hits per day on an average store.

Statistics and log import

The Statistics tab offers a 7, 30 or 90-day view with:

Global KPIs (total visits, distinct bots, blocked hits)
Daily traffic chart
Top bots by volume
Top crawled URLs
Log of the 50 most recent visits (date, bot, URL, IP, status)

Real-time tracking

If enabled in Settings, every request is inspected and identified AI bot hits are recorded. The overhead is negligible: less than 1 % of traffic reaches the write phase.

Apache/Nginx log import

Lets you retroactively count AI visits, including those from before the module was installed.

In Settings, enter the path to the log file. The module offers auto-detection (common Apache, Nginx, o2switch, cPanel paths).
Choose the format (combined by default, or common).
In the Statistics tab, click Parse access log now.

Parsing is incremental: a byte offset is stored in the database. Running the operation again does not create duplicates. The module caps each run at 8 MB to avoid timeouts; for very large files, several successive passes are enough.

To start over (for example after a log rotation), check Reset parse offset in Settings and run the parse again.

Settings

Summary of available options:

robots.txt

Auto-regenerate: regenerates robots.txt automatically when a bot or rule changes
Crawl-delay: recommended delay between requests (0 = disabled, 1-120 seconds)
Sitemap URL: added at the end of the managed section
Global Disallow section: also adds a User-agent: * section blocking sensitive areas (admin, cart, login)

HTTP blocking

Enable HTTP 403 blocking: immediately returns 403 for blocked bots (see dedicated section)

Real-time tracking

Enable tracking: records every detected AI visit
Retention: number of days to keep individual visits (7 to 730, default 90). Daily aggregates are kept longer.

Log import

Enable log parsing: enables the import button in the Statistics tab
File path: absolute path, with auto-detection offered
Format: combined (Apache/Nginx default) or common
Reset parse offset: check to re-read the whole file

Recommended blocking strategies

The choice depends on your editorial and commercial positioning. Three typical profiles:

Standard e-commerce store (default recommendation)

Apply the “Block training only” preset. Training bots (GPTBot, ClaudeBot, anthropic-ai, CCBot, Bytespider) are blocked. Real-time assistant bots (ChatGPT-User, Claude-User) and AI search bots (OAI-SearchBot, PerplexityBot, Google-Extended) remain allowed: your products can still be recommended in ChatGPT, Claude, Perplexity and Google AI Overviews.

Premium brand / strong editorial content

“Strict” preset + path rules to allow specific zones. Example: block all AI bots everywhere, except /product/* allowed for ChatGPT-User and Claude-User. Your product descriptions remain referenced in assistants, your blog and guides are protected.

Store in launch phase / low editorial volume

“Allow everything” preset. Visibility in AI answer engines outweighs the risk of content cession. You will switch to stricter blocking once your catalog and blog gain value.

Maintenance

Automatic pruning

Individual visits older than the configured retention are automatically deleted on every log parsing. You can also trigger a manual prune from the Statistics tab (“Prune old visits” button).

robots.txt backup

Each write creates a robots.txt.bak next to the original file. In case of error, you can manually restore it via FTP or your cPanel.

Bot list updates

New AI bots are added through module updates. The ps_dfaicm_bot table is updated in “merge” mode: a bot you have manually customized is never overwritten.

Troubleshooting

robots.txt is not writable

The dashboard shows a red “Not writable” badge. Check:

File permissions on /robots.txt: must be 644 minimum, and the owner must be the PHP/Apache user
If the file does not exist, check the root directory permissions (755 + correct owner)
On some shared hostings, robots.txt is generated dynamically by PrestaShop: disable the corresponding option in Preferences › Traffic › SEO and URLs

Access log auto-detection finds nothing

The module looks at the following paths: /var/log/apache2/access.log, /var/log/nginx/access.log, ~/logs/, ~/access-logs/. On other hostings, enter the path manually. If you do not know it, contact your hosting support or check the documentation of your control panel.

Log parsing takes too long

The module caps each run at 8 MB to avoid PHP timeouts. For a 500 MB file, expect 60 to 70 passes. Each click on “Parse access log now” resumes where the previous one stopped thanks to the stored offset.

A blocked bot still appears in statistics

This is normal: real-time tracking records EVERY detected AI visit, including blocked ones (with the was_blocked = 1 flag). This lets you measure how many attempts your configuration is actually blocking.

A bot ignores robots.txt despite my rule

Confirm with a log import: if you still see hits with status 200, the bot is indeed ignoring robots.txt. Enable HTTP 403 blocking in Settings. From that moment, hits from the bot will appear with status 403 and the was_blocked = 1 flag.

Uninstallation

From Modules › Module Manager, click Uninstall on the module card. The operation:

Deletes the 5 ps_dfaicm_* tables
Removes the 6 admin tabs
Removes the managed section from robots.txt (sentinel markers and everything they delimit)
Preserves the rest of robots.txt and the robots.txt.bak file

Technical reference

Technical slug: dfaicrawlermanager
Namespace: DataFirefly/AiCrawlerManager
Created tables: ps_dfaicm_bot, ps_dfaicm_rule, ps_dfaicm_category_rule, ps_dfaicm_visit, ps_dfaicm_visit_daily
Hooks used: actionDispatcherBefore, actionAdminControllerSetMedia, displayBackOfficeHeader
Back-office tabs: Dashboard, Bots, Path rules, Builder, Statistics, Settings (under AdminParentConfigure)
Configuration keys: DFAICM_AUTO_REGEN, DFAICM_VISIT_LOG, DFAICM_HTTP_BLOCK, DFAICM_LOG_PARSING, DFAICM_LOG_PATH, DFAICM_LOG_FORMAT, DFAICM_LAST_PARSE, DFAICM_LAST_OFFSET, DFAICM_RETENTION, DFAICM_CRAWL_DELAY, DFAICM_SITEMAP_URL, DFAICM_GLOBAL_DISALLOW, DFAICM_INSTALLED_AT

Support

For any technical question, contact the DataFirefly team at contact@datafirefly.com or visit your customer area on datafirefly.com.

Was this page helpful?

Still stuck? Contact support