Bug #2829: Scraping external sites often times out - Qbix-Beginners - Qbix Project Tracker

Actions

Copy link

Bug #2829

open

Scraping external sites often times out

Added by Gregory Magarshak over 2 years ago. Updated over 2 years ago.

Status:

New

Priority:

High

Assignee:

Category:

External APIs

Start date:

02/19/2023

Due date:

% Done:

Estimated time:

Description

When we try to call scrape() on a new URL such as microsoft.com/word, it just hangs sometimes.

Try to debug it, identify the root cause and update this issue. Then fix it.

It may be that the external sites are blocking our servers. In this case, let's implement Websites_Scrape class, with the Adapter Pattern. So we can use Websites_Scrape_ScraperAPI for instance, as an adapter that will use the Scraper API. The credentials should be stored under "Websites"/"scrape"/"api"/... in local/app.json config. (Use https://www.scraperapi.com/)

Example¶

https://new.freetalklive.com/action.php/Websites/scrape?Q.ajax=json&Q.slotNames=result&Q.method=POST&Q.nonce=e781bbfc7f4939189ae68e5a3be2272cf595c4a21ce46bc302326990df7142d3

Query String Parameters
Q.ajax: json
Q.slotNames: result
Q.method: POST
Q.nonce: e781bbfc7f4939189ae68e5a3be2272cf595c4a21ce46bc302326990df7142d3

Form Data:
url:  microsoft.com/word

Stack trace:

Operation timed out after 30859 milliseconds with 0 out of -1 bytes received

in /live/Q/platform/classes/Q/Utils.php (1164)
#0 /live/Q/platform/classes/Q/Utils.php(964): Q_Utils::request()
#1 /live/Q/platform/plugins/Websites/classes/Websites/Webpage.php(129): Q_Utils::get()
#2 /live/Q/platform/plugins/Websites/handlers/Websites/scrape/post.php(23): Websites_Webpage::scrape()
#3 /live/Q/platform/classes/Q.php(1170): Websites_scrape_post()
#4 /live/Q/platform/classes/Q.php(1050): Q::handle()
#5 /live/Q/platform/handlers/Q/post.php(30): Q::event()
#6 /live/Q/platform/classes/Q.php(1170): Q_post()
#7 /live/Q/platform/classes/Q.php(1050): Q::handle()
#8 /live/Q/platform/classes/Q/Dispatcher.php(343): Q::event()
#9 /live/Q/platform/classes/Q/ActionController.php(69): Q_Dispatcher::dispatch()
#10 /live/Q/platform/classes/Q/WebController.php(33): Q_ActionController::execute()
#11 /live/FTL/web/index.php(12): Q_WebControlle

Actions

Copy link

Updated by Gregory Magarshak over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by Gregory Magarshak over 2 years ago

We should also have a way to scrape again.

Websites_Webpage::scrape() is the method that is called to do the scraping. We should have Websites/webpage/scrape/put.php to update the scrape, and a global quota (meaning userId = "") in "Users"/"quotas" to not allow the same domain to be scraped more than 20 times a day, and not allow the same webpage to be scraped more than once an hour (defaults).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Qbix-Beginners

Custom queries

Watchers (1)

Bug #2829

Scraping external sites often times out

Example¶

Updated by Gregory Magarshak over 2 years ago

Updated by Gregory Magarshak over 2 years ago