Making Efficient Tag Clouds With Drupal's Taxonomy

I love Drupal. I love it for many reasons, but most of all - it's the thinness. Drupal allows us to mess with it in any way. There is a certain path to accomplish almost any idea that might get into programmer's head. Like, say, a tag cloud.

Let me start by mentioning that Drupal already has a tag system built in. If you set up Taxonomy (Categories) to support "Free Tagging" - you would be able to tag any of your content, anyhow. However, now we want to present these tags conveniently for our users. Let's see how we could achieve that, whereas diminishing any performance hits.

Step 1. Digging through the database mess.

Let's do some digging. The node table stores the basic node information about all our content. It has the title of the content, some status flags, and most importantly - its type, and its nid (node id).

Ok, that's nice, but now we need to find the tags.

After some more digging, we stumble upon a table called term_data. This table has tag names mapped to their tid (term ids). Beautiful. Now the only thing left is to find the table that maps nids to tids. That happens to be the term_node table. Great that's all we need.

Step 2. Thinking about output and its cost.

In order to display the tag cloud - we need to count the amount of occurrences of each tag for a specific content type of our choice (for example, let's use articles), and join in the tag's name. That means - we have to manipulate the three aforementioned tables, plus perform the counting process. If we ever plan to have a popular website - this kind of operation performed for each visitor is unacceptable. This is the place where cron jobs come to the rescue! It would've been way too expensive if we had to recalculate tags on every visit. Why not create a temporary space which will store tags and their weights in the exact format we need. We will also write a script that will repopulate that space chronically.

Step 3. Starting a Drupal module.

Fortunately, Drupal has very convenient facilities for implementing cron jobs for specific tasks. First, we will start a module (lets call it tagcloud). There is a Drupal hook responsible for crons - hook_cron(). Thus, if we put a function in our tagcloud.module and call it tagcloud_cron() - Drupal will magically recognize and include this function into its common cron script.

Let's think about the function. It will create a new cache table with two columns - tag and its weight (the number of occurrences). So our cron function will end up looking like this:

<?php
   
function tagcloud_cron() {
       
$drop_existing_table = 'drop table if exists tagcloud';
       
$recreate_table = 'create table tagcloud (tag varchar(255) primary key, weight integer default 1 )';
       
$count = 'insert into tagcloud (tag) select td.name from term_node tn
                    inner join term_data td on tn.tid = td.tid
                    inner join node n on tn.nid = n.nid
                    where n.type = \'article\' and n.status = 1
                    order by td.name asc
                    on duplicate key update tagcloud.weight = tagcloud.weight + 1'
;
       
       
db_query($drop_existing_table);
       
db_query($recreate_table);
       
db_query($count);
       
    }
?>

The heart of that function is of course that ambiguous MySQL query stored in $count. This query looks at the nid - tid mapping table, fetches tag name by tid, filters by node type using nid for join, and filters by status=1 (meaning that the node must be marked "published"). Then it tries inserting it into tagcloud table, but if it hits the same tagname (which happens to be our primary key) - it will update the weight. Therefore - we will have the count of each tag in our weight column. Quite nice! MySQL did all the counting.

Now we will need another function - for fetching tags for display. This function will run every time a visitor comes in. It will look at the tag weights, and output the actual CSS sizes for each tag. Let's say our size will vary from 0 to 15. (Btw, notice how much less database work we have to do on every visit. With some front-end caching the performance issue will become nonexistent.)

<?php
   
function tagcloud_get_tags() {
       
$highest_size = 16; // Our limit is incremented by 1 to avoid problems with 0.
   
       
$fetch_tags = 'select * from tagcloud'; // That's all the database work!
       
$resultset = db_query($fetch_tags);
       
       
$tags = array();
       
        while (
$row = db_fetch_object($resultset) ) // Let's prepare an array of tags
           
$tags[ $row->tag ] = $row->weight;
       
       
// Here we will calculate each of the tag's actual display size
       
       
$highest_weight = max($tags);
       
       
// converting weights to sizes
       
foreach ( $tags as $tag => &$weight ) {
           
$weight = ( round( ( $highest_size * $weight ) / $highest_weight ) - 1 );
        }
       
        return
theme('tags', $tags);
    }
?>

In the above code I used a proportion to calculate the display size of each tag. Here's how it works. If we know the maximum allowed display size, and we get the maximum script weight from the database - we can say:

If the highest_weight is of highest size, then weight x is of size y.

So for each weight we got from database - we calculate the size using that proportion. Then we round it off, and subtracting that extra 1 that we introduced to avoid problems with 0.

Step 4. Theming the tag cloud.

You could implement the tag cloud as a block, but in my case I had to implement it as part of the main page layout itself - since it was supposed to be always there, in a static place which wasn't considered a block region. I will be assuming that you're using Drupal's default template system.

First, we need to make sure that we pass the tag cloud to our page.tpl.php template. This Drupal's magic function would do that:

<?php
   
function _phptemplate_variables( $hook, $vars = array() )
    {       
        switch (
$hook) {
            case
'page':
               
$vars['tagcloud'] = tagcloud_get_tags();
            break;
        }
        return
$vars;
    }
?>

This function will be called when someone tries to view the page that corresponds to the tempalte 'page.tpl.php' - which is usually the template of the master layout for the whole drupal site. However, as you can see in our previous function tagcloud_get_tags() - we are not just returning $tags array. We are theming it. This theming function will go right here just as well:

<?php
   
function theme_tags( $tags ) {
        if ( !empty(
$tags) ) {
           
$html = '';
            foreach (
$tags as $tag => $size ) {
               
$html .= "<span class=\"size$size\"><a href=\"#\" title=\"$tag\">$tag</a></span>\n";
            }
        }
       
        return
$html;
    }
?>

This function is responsible for outputting actual html at the point where the tags will be inserted into the template. So now it all gets glued together. If you echo $tagcloud anywhere in page.tpl.php - you will get the $html from this function injected into the page. It would be a good idea to come up with 15 css classes at this point. As you can see in the html snippet above - they would have to be named from size0 to size15. Also, you'll certainly want to hang a link on each tag. I will just hint you that you should use Drupal's path functions and the tag name that you already have to concatenate together a url to the tag itself. Drupal already provides you with the url for each tag, so all you need to do is simply build it for output in this function. In my case, each tag triggered a javascript event, therefore links were replaced with '#'. Almost done!

Additional thingie... Maybe cron-job isn't necessary?

Say, you do not have many people submitting articles often. For example - it's your own news portal, and only you and few of your friends submit articles. Say, you want to recalculate tags on every submission of an article, because you aren't afraid that a few people could affect performance. Well, nothing easier!

Drupal implements a hook called hook_nodeapi(). It allows us to catch an article submission right after it got inserted into database, and recalculate tags at that moment.

<?php
   
function tagcloud_nodeapi( &$node, $op ) {
        switch (
$op) {
            case
'insert':
                if (
$node->type == 'article' )
                   
tagcloud_cron();
            break;
        }
    }
?>

That's all there is to it! We already have the cron function written - so we're just choosing to call it when an article gets inserted into the database. Don't forget to set up your cron to run correctly!

Hope this will help some people out there! Please comment on any errors, optimizations, or simply if you like it. : )

Probably a dumb question,

Probably a dumb question, but I have to ask it: ;)

In which file did you put the functions _phptemplate_variables and theme_tags? Is it page.tpl.php?

Thanks!

both go into your

both go into your template.php

ctrl tags

hai! thanks for article.

What you think about allocate some tags at tags cloud by put "ctrl"?
It's very difficultly to realise?

Could use more Drupal tools

I needed to do something very similar but I used some more of the stock Drupal tools.

I put these functions in a module with some other helper functions specific to my site.

The first thing I did was use to use the tagadelic module to define a custom tag cloud block.

function theme_mymodule_tag_cloud($vid = 1){
  $tags = tagadelic_get_weighted_tags(array($vid),7,15);
  $tags = tagadelic_sort_tags($tags);
  $block = theme('tagadelic_weighted', $tags);//return a chunk of 12 tags
  $block .= theme('tagadelic_more', $vid);//add more link
  return $block;
}

But as you noted, the tagedelic module uses some really expensive queries that you don't want to run on every page load.

So I also used the hook_cron, but instead, I used Drupal's internal caching system. So this is all the code I had to write:

function mymodule_cron(){
  cache_clear_all('mymodule', 'cache', TRUE);
  cache_set('mymodule_tag_cloud', 'cache', serialize(theme(mymodule_tag_cloud)));
}

Then I created a function that retrieved the cached tag cloud and call it from a block using the php format option. Of course now it occurs to me, I could have just put this php in the block instead of having it exist as a function in my module. Or maybe, I just have just properly defined this as a block in the module using the block hooks.

function mymodule_cached_cloud(){
  $cached = cache_get('custom_tag_cloud');
  return unserialize($cached->data);
}

I have not looked in several months, but I think the tagedelic developers may be implementing some more caching in the module.

Did you try the tagadelic

Did you try the tagadelic module? Looks like it also has a caching mechanism. Just wondering what's the difference between tagadelic.module and yours. Thanks.

I think the most important

I think the most important differences are:

1) tagadelic runs quite expensive query (especially for large sites with lots of content) every time a page with tag cloud is shown (tagadelic cache works per page view only, i.e. cached data is lost after the page is generated); this module puts tag cloud related data in a dedicated table (i.e. it is cached for many page requests).

2) tagadelic provides a few configuration settings, can generate both page and block views; Maxmi's functions do not (though it should not be too difficult to add this functionality).

Regards,
Bartek

Block cache

Thank you for the interesting writeup and details. One option to deal with performance is to use the block cache module (http://drupal.org/project/blockcache) to cache tagadelic (or any) blocks for registered users.
Chris

Thank you for clarifying

Thank you for clarifying this. I wasn't sure how to answer this question since I haven't looked through tagadelic's source. Couldn't find time to do this. : )

Taxonomy vs CCK

Very cool! Glad to see this, and thanks for taking the time to do this as their has been lots of talk lately about taxonomy and performance issues. ( http://www.lullabot.com/audiocast/podcast-48-taxonomy-taxonomy-taxonomy ) . Since you seem well versed in taxonomy..ever thought about using CCK select lists instead?