Case Study: use PHPQuery to Crawl 3000 images from Tumblr

There are some nice images on Tumblr. We can use Tumblr APIs to search and download images however this usually requires the registration and API keys. Another way is to crawl the HTML web pages and parse the DOM (Document Object Model), which allows us to retrieve the image URLs and their descriptions.

There is a handy Library which is called PHPQuery. It allows us to write JQuery-style PHP in the way that we write JQuery, using CSS selectors. It makes the PHP powerful in analyzing the DOM of any HTML pages.

phpquery

The following is a psudo-code that illustrates how to parse the HTML pages and grab the images.

With PHPQuery, it becomes so much easier to analyse the DOM! Share on X

require('phpQuery.php');
require('app.php');

$ip = get_ip_address();

function grab($url, $lvl = 5) {
  global $ip;
  if ($lvl < = 0) {
    return;
  }
  $doc = phpQuery::newDocumentFile($url);
  foreach(pq('div.TumbPostPane') as $p) {
      $img = pq($p)->find('img.PhotoPostMainPhoto')->attr('src');
      $desc = htmlspecialchars(trim(pq($p)->find('div.MetaPanel')->html()));
      $url = pq($p)->find('a')->attr('href');
      $err = '';
      if (UploadPic($img, $desc, $err, $ip)) { // find pictures and save locally
         echo "OK = $err \n";
      } else {
         echo str_replace("
", "\n", "Error = $err \n");
      }     
      grab($url, $lvl - 1); // recursive download
  }
}

grab("https://uploadbeta.com", 1);

With little modification, you can let the script crawling several thousands of pictures within a few minutes. All images are saved to local databases in the VPS server. The pictures can be seen at: uploadbeta.com

It is better to set a time interval between page crawling otherwise, the IP address may be blocked.

grab-urls Case Study: use PHPQuery to Crawl 3000 images from Tumblr

Image Crawling

PS, the Image Upload website supports a few APIs for various purposes subject to fair use policy: https://uploadbeta.com/picture-gallery/faq.php#api

–EOF (The Ultimate Computing & Technology Blog) —

491 words
Last Post: Shell Coding Exercise - Word Frequency
Next Post: C++ Range Sum Query on Immutable Array via Prefix Sum

The Permanent URL is: Case Study: use PHPQuery to Crawl 3000 images from Tumblr (AMP Version)

3 Comments

twofei

what about self-closing tag and the unclosed ？

- ACMer
  
  doesn’t matter, the PHPQuery recognizes self-closing tag.
  unclosed tag is not a valid HTML, which produces unexpected (unstable) results.
  
  - twofei
    
    It seems that WP has removed the tag I mentioned in my comment above.
    There are many html documents use <br> to represent as a soft break，not the slash one: <br/>. And <p>， in H5, it is valid.
    So they are not strict, well-formed XML. Many DOM parsers cannot handle them well.

Algorithms, Blockchain and Cloud

Case Study: use PHPQuery to Crawl 3000 images from Tumblr

3 Comments

Leave a Reply

Related posts:

3 Comments

Leave a Reply