Scraping websites with PHP cURL under proxy

PHP script — By Script on September 18, 2009 at 1:50 pm

Scraping websites with PHP cURL is damn easy. Just do it the right way – use a proxy. Here is a simple function that does the job.

Simple PHP cURL scraper:

  1. <?php function getPage($proxy, $url, $referer, $agent, $header, $timeout) {$ch = curl_init();
  2.  
  3. curl_setopt($ch, CURLOPT_URL, $url);
  4.  
  5. curl_setopt($ch, CURLOPT_HEADER, $header);
  6.  
  7. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  8.  
  9. curl_setopt($ch, CURLOPT_PROXY, $proxy);
  10.  
  11. curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
  12.  
  13. curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
  14.  
  15. curl_setopt($ch, CURLOPT_REFERER, $referer);
  16.  
  17. curl_setopt($ch, CURLOPT_USERAGENT, $agent);
  18.  
  19. $result[‘EXE’] = curl_exec($ch);
  20.  
  21. $result[‘INF’] = curl_getinfo($ch);
  22.  
  23. $result[‘ERR’] = curl_error($ch);
  24.  
  25.  
  26. return $result;
  27.  
  28. }
  29.  
  30. ?>

PHP cURL functions used:

  • curl_init – initializes a cURL session.
  • curl_setopt – sets and option for a cURL transfer.
  • curl_exec – performs a cURL session.
  • curl_getinfo – gets information about the last transfer.
  • curl_error – returns a string containing the last error for the current session.
  • curl_close – close a cURL session.

curl_setopt options used:

  • CURLOPT_URL – the URL to scrap.
  • CURLOPT_HEADER – inlude/exclude the header?
  • CURLOPT_RETURNTRANSFER – return the transfer as a string or output it out directly? Use 1, i.e. return.
  • CURLOPT_PROXY – the HTTP proxy to tunnel request through.
  • CURLOPT_HTTPPROXYTUNNEL – tunnel through a given HTTP proxy? Use 1, i.e. tunnel.
  • CURLOPT_CONNECTTIMEOUT – it’s obvious.
  • CURLOPT_REFERER – header to be used in a HTTP request.
  • CURLOPT_USERAGENT – “User Agent:” to be used in a HTTP request.

Scraper usage:

  1. < ?php
  2. $result = getPage(
  3. ‘[proxy IP]:[port]‘, // use valid proxy
  4. ‘http://www.google.com/search?q=twitter’,
  5. ‘http://www.google.com/’,
  6. ‘Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8′,1,5);if (empty($result[‘ERR’])) {
  7.  
  8. // Job’s done! Parse, save, etc.
  9.  
  10. // …
  11.  
  12. } else {
  13.  
  14. // WTF? Captcha or network problems?
  15.  
  16. // …
  17.  
  18. }
  19.  
  20. ?>

Note: Activate cURL in php.ini if required.

1 Star2 Stars3 Stars4 Stars5 Stars (15 votes, average: 4.53 out of 5)
Loading ... Loading ...

Random Articles

1 Comment

  1. Thanks a lot this shopperpress theme is very intersting and i like it so much ,one of the best ecommerce themes out therer…

    Keep It Up

Leave a Comment


Tags: , , , , , , , , ,