CURL Page Scraping Script
Using cURL and page scraping for specific data is one of the most important things I do when creating databases. I’m not just talking about scraping pages and reposting here, either.
You can use cURL to grab the HTML of any viewable page on the web and then, most importantly take that data and pick out the bits you need. This is the basis for link analysis scripts, training scripts, compiling databases from sources around the web, there’s almost limitless things you can do.
I’m providing a simple PHP class here, which will use cURL to grab a page then pull out any information between user specified tags, into an array. So for instance, in our example you can grab all of the links from any web page.
The class is quite simple – I had to get rid of the lovely indententation to make it fit nicely onto the blog, but it’s fairly well commented.
In a nutshell, it does this:
1) Goes to specified URL
2) Uses cURL to grab the HTML of the URL
3) Takes the HTML and scans for every instance of the start and end tags you provide (e.g. < a > < / a >)
4) Returns these in an array for you.
Download taggrab.class.zip
html = ""; $this->binary = 0; $this->url = ""; } // takes url passed to it and.. can you guess? function fetchPage($url) { // set the URL to scrape $this->url = $url; if (isset($this->url)) { // start cURL instance $this->ch = curl_init (); // this tells cUrl to return the data curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); // set the url to download curl_setopt ($this->ch, CURLOPT_URL, $this->url); // follow redirects if any curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); // tell cURL if the data is binary data or not curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); // grabs the webpage from the internets $this->html = curl_exec($this->ch); // closes the connection curl_close ($this->ch); } } // function takes html, puts the data requested into an array function parse_array($beg_tag, $close_tag) { // match data between specificed tags preg_match_all("($beg_tag.*$close_tag)siU", $this->html, $matching_data); // return data in array return $matching_data[0]; } } ?>
So that is your basic class, which should be fairly easy to follow (you can ask questions in comments if needed).
To use this, we need to call it from another PHP file to pass the variables we need to it.
Below is tag-example.php which demonstrates how to pass the URL, start/end tag variables to the class and pump out a set of results.
Download tag-example.zip
"; // Make a title spider $tspider = new tagSpider(); // Pass URL to the fetch page function $tspider->fetchPage($urlrun); // Enter the tags into the parse array function $linkarray = $tspider->parse_array($stag, $etag); echo "Links present on page: ".$urlrun."
"; // Loop to pump out the results foreach ($linkarray as $result) { echo $result; echo "
"; } ?>
So this code will pass the Techcrunch website to the class, looking for any standard a href links. It will then simply echo these out. You could use this in conjunction with SearchStatus Firefox Plugin to quickly see what links Techcrunch is showing bots and what they are following and nofollowing.
You can view a working example of the code here.
As I said, there’s so much you can do from a base like this, so have a think. I might post some proper tutorials on extracting data methodically, saving it to a database then manipulating it to get some interesting results.
Enjoy.
Edit: You’ll of course need cURL library installed on your server for this to work!
Like this article? Then subscribe to the feed!
Related Posts:
Next Post:
SEO Tips For Business and More »
Previous Post:
« Blogs Worth Reading
I should look into cURL. For now I’m happy using file_get_contents,preg_match and pregmatch_all.
Comment by sloth
December 18th, 2008 @ 2:46 pm
whoa looks like I broke your comments section.
Comment by sloth
December 18th, 2008 @ 2:47 pm
and it hasn’t put all the code that I wrote, damn :/
Comment by sloth
December 18th, 2008 @ 2:48 pm
Haha, it’s WordPress just parsing the code as live. Well, some of it (:
Upload the code example to Google code and link to it if you want to.
Comment by Mark
December 18th, 2008 @ 5:11 pm
I tell you what. Here’s a link to the tutorial that took care of all my scraping needs. It’s really simple to do.
http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial
Comment by sloth
December 18th, 2008 @ 6:03 pm
Have you run into this before?
CURLOPT_FOLLOWLOCATION cannot be activated when in safe_mode or an open_basedir
Comment by Shane
December 18th, 2008 @ 8:08 pm
If you are into more advanced CURL ing, logging into websites for example then this an extremely useful class… takes care of cookies etc:
http://sourceforge.net/projects/snoopy/
Comment by Jez
December 23rd, 2008 @ 11:05 am
sorry 4got to subscribe to your comments….
Comment by Jez
December 23rd, 2008 @ 11:05 am
Yo Mark!
Does this mean all your code is now gonna be OOP?
lol, JK!! Love it all, and wish u n urs all the best for the new year!
Des
Comment by ScanKey
December 31st, 2008 @ 2:27 am
@Shane
Regarding your error:
http://bugs.typo3.org/view.php?id=4292
Comment by Mark
January 2nd, 2009 @ 1:00 am
what about curl to grab media? Viral marketing i think this may also be handy
try video swiper
Comment by thrifty
January 25th, 2009 @ 11:17 pm
This is a nice tip. I have recently started using PHP after years of Java.
Nonetheless, I think PHP is much more available on most platforms than Java is, and have found it even more interesting.
Thanks for this sample code
Comment by Helen Hunt
February 6th, 2009 @ 7:02 pm
Helen you name seems familiar?
do you have a cousin called mike?
if not i would be interested in meeting you ?
to discover your dark side 
Comment by thrifty
February 7th, 2009 @ 3:08 am
SEO Blog.. Dating Website… It’s a fine line apparently, thrifty?
Comment by Mark
February 7th, 2009 @ 4:41 am
mmm yes slightly red faced and i appologise to Helen profusely…sorry
not an excuse but as you can see by the time i posted…the chances of me being completely sober on a friday night at 3.08 AM is pretty Slim to say the least.
lesson number one dont Drink and Type.
sorry again Helen ohh and as well mark its your blog..consider myself slapped on the wrist.
Comment by thrifty
February 7th, 2009 @ 5:19 pm
biterscripting also is good at scraping web pages and harvesting data. They have a few good samples posted over at http://www.biterscripting.com/samples_internet.html .
Jenni
Comment by Jenni
July 20th, 2009 @ 3:46 pm
curlopt_followlocation is not necessarily as it cannot be activated when in safe_mode
Comment by tiong
March 13th, 2010 @ 3:31 am
Awesome tutorial!
I had to use curl on my host 1and1.
http://www.quickscrape.com/ is what I came up with!
Comment by Steve
December 2nd, 2010 @ 7:56 am
@Steve
Sweet job. So, where’s my cut? (=
Comment by Mark
December 2nd, 2010 @ 4:56 pm
thanks !!
Comment by laksh
March 8th, 2011 @ 10:55 am
Really good tips. How to use this cURL script to extract a required portion of web page, not just links, and display scraped data in own page inside div?
Comment by Stranger
March 15th, 2011 @ 10:47 pm