Scraping websites with PHP cURL under proxy
Scraping websites with PHP cURL is damn easy. Just do it the right way – use a proxy. Here is a simple function that does the job.
Simple PHP cURL scraper:
-
-
-
-
-
-
-
-
-
-
-
-
-
-
return $result;
-
-
}
-
-
?>
PHP cURL functions used:
- curl_init – initializes a cURL session.
- curl_setopt – sets and option for a cURL transfer.
- curl_exec – performs a cURL session.
- curl_getinfo – gets information about the last transfer.
- curl_error – returns a string containing the last error for the current session.
- curl_close – close a cURL session.
curl_setopt options used:
- CURLOPT_URL – the URL to scrap.
- CURLOPT_HEADER – inlude/exclude the header?
- CURLOPT_RETURNTRANSFER – return the transfer as a string or output it out directly? Use 1, i.e. return.
- CURLOPT_PROXY – the HTTP proxy to tunnel request through.
- CURLOPT_HTTPPROXYTUNNEL – tunnel through a given HTTP proxy? Use 1, i.e. tunnel.
- CURLOPT_CONNECTTIMEOUT – it’s obvious.
- CURLOPT_REFERER – header to be used in a HTTP request.
- CURLOPT_USERAGENT – “User Agent:” to be used in a HTTP request.
Scraper usage:
-
< ?php
-
$result = getPage(
-
‘[proxy IP]:[port]‘, // use valid proxy
-
‘http://www.google.com/search?q=twitter’,
-
‘http://www.google.com/’,
-
‘Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8′,1,5);if (empty($result[‘ERR’])) {
-
-
// Job’s done! Parse, save, etc.
-
-
// …
-
-
} else {
-
-
// WTF? Captcha or network problems?
-
-
// …
-
-
}
-
-
?>
Note: Activate cURL in php.ini if required.
Tags: PHP code, PHP cURL, PHP script, Review script PHP, Script code, Source code
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.


(15 votes, average: 4.53 out of 5)
March 21st, 2010 at 2:22 am
I have found that people like to get onto facebook / youtube / myspace when they are at work. So the simple way to do that if it is blocked is just to unblock it with a facebook / youtube / myspace proxy. You can always find new ones if yours gets blocked.