How to Whitelist Our Website Crawling Bot for University AI Cataloging

Overview

Our university AI assistants rely on a specialized website crawling bot to catalog academic content efficiently. To ensure smooth operation, it's essential to whitelist the bot on your website.

Follow the steps below to allow our bot access to your web pages, ensuring comprehensive data collection.

Allow the Crawler User-Agent

The bot identifies itself with a specific user-agent string, which is how web servers and security systems can differentiate between regular users and bots. To allow our bot access to your website, it's important to whitelist this user-agent string across any systems that may block or restrict bot access.

The user-agent for our crawler is: User-Agent: Mongoose-Ai-Crawler-Function/1.1

Steps to Whitelist the Crawler:

To ensure full access, follow these guidelines for different web security configurations:

1. Robots.txt Configuration:

In your website’s robots.txt file, allow access to the directories and pages you want the bot to catalog.

For example: User-agent: Mongoose-Ai-Crawler-Function/1.1

Allow: /

2. Whitelist Our IP Addresses:

Ensure that your firewall or web application firewall (WAF) allows traffic from the following IP addresses. Since we are a cloud service, please whitelist this IP corresponds to our bot's crawling infrastructure:

52.167.228.131

Blocking Software and Anti-Bot Systems:

Honeypots: Some websites use honeypot traps to detect bots by embedding invisible links or form fields that a human user wouldn’t interact with. Our crawler is designed to avoid honeypot traps, but it’s a good idea to whitelist our bot’s user-agent in any honeypot detection systems you may have set up.
- Tools like Project Honeypot or custom honeypot traps might block the bot if it’s mistaken for a malicious crawler. Configuring these systems to ignore our user-agent can prevent unnecessary blocks.
CAPTCHA Systems: Tools like reCAPTCHA may present challenges to bots. Ensure that our user-agent is excluded from such protections if it detects automated access. Many CAPTCHA providers have settings to whitelist specific user-agents or IP addresses.

Monitoring and Reporting:

You can monitor the bot’s behavior using your server logs or traffic analytics.

By following these steps, you'll ensure that our bot can efficiently crawl and catalog your content for the AI assistant, providing your university with valuable and up-to-date information for your students.

For further assistance or custom crawling configurations contact support@hellomongoose.com

Other Advanced Settings

Adjust Rate Limiting:

Some websites implement rate limiting to prevent bots from overwhelming servers. Please ensure that your website allows a sufficient rate of requests per minute (RPM) for the following user-agent:

If your current rate limit is too restrictive, consider increasing the RPM or allowing an exception for our user agent.

Firewall and Security Software:

Many web hosts implement additional security layers such as web application firewalls (WAF) or intrusion detection systems (IDS) that may block bots. To ensure our bot can crawl your website:

Cloudflare or Similar WAFs: Platforms like Cloudflare may block non-human traffic. To avoid this, navigate to your firewall settings and allow requests from our user-agent:
- In Cloudflare, go to the Firewall Rules section and create a rule to allow traffic from the user-agent Mongoose-Ai-Crawler-Function/1.1.
- You may also need to allow our bot to bypass CAPTCHA challenges that could block its progress.
ModSecurity: If you use ModSecurity (often included with hosting providers like cPanel), review your security rules to ensure our user-agent is not blocked by aggressive bot protection rules.

Web Server Settings:

Apache: If you're using an .htaccess file to block unwanted bots, ensure that our user-agent string is not being blocked. Check for lines like:SetEnvIfNoCase User-Agent "Mongoose-Ai-Crawler-Function/1.1" allow_bot

Nginx: Similar to Apache, check your nginx.conf file for user-agent blocking. You may need to add or modify rules such as:

if ($http_user_agent ~* "Mongoose-Ai-Crawler-Function/1.1") {

set $allow_bot 1;

}