Programming

Related Tags:

blogging bug CSS Google How To HTML Humor PHP plugins The Internet theme timeout Ultimate-Tag-Warrior Web-Development Web Design WordPress

Taking Web Stats to the Next Level (of Weirdness) with Google Analytics

If you have ever run a web site, you've been exposed to the addictive, number-crunching fun provided by web stats. Any web site that's worth it's pixels will have, at the very least, a freeware program like AWStats parsing through the server logs and putting together colorful charts and reports. Our host, Q5media, are kind enough to provide us with LiveStats by Deepmetrix. Web stats can be really useful for blogs. They can tell you all sorts of interesting things about your readership, for example, last month 55 people found the site while searching for Yakety Sax, no doubt landing on our article about how Yakety Sax makes anything funny. Other top searches included guys kissing, how youtube works, and once you go black. Hopefully everyone found what they were looking for. As you can see, the most important use of web stats is to find the strangest search phrases people use to get to your articles. The small sample above is actually at the top of our list, but on a more sedate blog you might have to dig a bit to get to the comedy. Looking further down I get gems such as "indian dicks" and "bees apocalypse." In addition, web stats provide you a way to start fights between your writers as they argue over who's getting more traffic and why. So it's a lot of fun. In order to get some really deep knowledge, you have to venture off into the world of web analytics. Analytics gives you more than just the list of top pages by visitor count. You are able to see where readers come from, how they make their way through the site, and how they exit. If you have advertising on your site, you can really get a sense of what works and what doesn't. Google Analytics is a completely free, and fairly useful, analytics package to try out. It works by placing a small JavaScript on your pages - in Wordpress, you could stick it in your footer. So what is this deep knowledge I speak of? Let me give you an example: a few days ago we had an article about the weight loss drug Alli. With Google Analytics, I now know that 9 of the people who read the article clicked on an ad, no doubt one selling Alli or a similar weight-loss product. Six people clicked on to an article about things every nursing student should have, which means at least a tiny percentage of our readers actually look to us for helpful information. But five people clicked on the the page for the tag "accidents." If you follow that link you'll notice that there's only one article there. The only thing I can think is that five of our readers were not interested in the helpful information aspect of the article as the "pooped myself" aspect. They picked up on the tag and thought it would lead them to more... accidents. What's worse, in the academic world, this trail they followed is called the "information scent."

I Can’t get Enough of Mr T.

I had to post this video: Mr T. busts through a wall, alarmed by the jibber-jabber of a fat-headed consultant. He proceeds to show the consultant that Mr. T puts the "T" in IT. [youtube]tW1S2tsxVHg[/youtube] Best lines:
"You know you got a lot of mouth, and I've got a lot of fist for your mouth!" "Intelligence in the network? That's for suckas. That's for routing stuff, not data, fool!"
This is a video for Hitachi, but that last quote could also be seen as an argument for net neutrality. Somebody call up Mr. T. and get him in front of Congress. And in all seriousness, Mr. T. is a really interesting guy - check out his profile on IMDB.
"I think about my father being called 'boy', my uncle being called 'boy', my brother, coming back from Vietnam and being called 'boy'. So I questioned myself: 'What does a black man have to do before he's given the respect as a man?' So when I was 18 years old, when I was old enough to fight and die for my country, old enough to drink, old enough to vote, I said I was old enough to be called a man. I self-ordained myself Mr. T so the first word out of everybody's mouth is 'Mr.' That's a sign of respect that my father didn't get, that my brother didn't get, that my mother didn't get."

Fighting Spam on a Diet – How to fix Akismet Performance Problems

Running into strange WordPress performance problems and database errors?  Akismet could be the culprit, but we're in luck, it's an easy fix. Earlier I wrote a bit about our encounter with vicious, robotic Chinese comment spammers.  Since then we've had a few further issues, and I think I've found the culprit - Akismet, the plugin we've been using to fight the spam. First off, let me say that I think Akismet is a great plugin.  While we had hundreds of spams come in for a few days in a row, not one made it out to the public.  Very nice.  But it is a bit too aggressive in one spot, and that can slow down your blog or lock up the comment table, filling your max_connections. The problem is in akismet.php, specifically the akismet_delete_old() function.  Look for the following lines:
$n = mt_rand(1, 5); if ( $n % 5 ) $wpdb->query("OPTIMIZE TABLE $wpdb->comments");
Those of you with PHP / MySQL experience will recognize the problem immediately.  For the less code-literate, this is creating a random number between 1 and 5, and if the number has a remainder after being divided by 5, it runs and OPTIMIZE TABLE on the comments table.  That means that at random, it will lock the entire table and compute statistics after 80% off all deletes. Now, it's a good idea to optimize your tables after a large number of deletes.  But it is a pretty expensive operation, because it could be rearranging things on disk to free up space. Now, imagine you get hit by a spam bot and end up with a couple hundred spam comments.  Akismet catches them all, and 15 days later tries to delete them all in one big loop.  One big loop filled with a couple hundred table-locking, disk-intensive database operations. But it's easy to fix.  Replace the lines above with this:
$n = mt_rand(1, 100); if ( $n == 42 ) $wpdb->query("OPTIMIZE TABLE $wpdb->comments");
That will only optimize the table on average once out of 100 comments deleted.  Why 100?  It's an educated guess.  According to the MySQL documentation, at most you will need to optimize a table once a month or so, maybe once a week if you have a large number of deletes or edits on varchar fields. Why did I pick 42 for the one value out of a hundred that triggers an optimization?  You're asking the wrong question.

Comment Spam Deluge – Did our Captcha get Hacked?

Have you been having trouble reading Unsought Input lately? You're in good company – I've been having trouble writing for it.

We've been having issues with MySQL to the point of hanging connections and pleasant, but not very helpful WordPress error messages. It's nice that user-friendly errors are built-in to WordPress, since you never want to give users cryptic, blue-screen-of-death style errors. But I needed to get to the root of the problem.

I quickly put on my detective cap and tried to log in with phpMyAdmin – no luck, but this time the error message was a little more useful:

#1040 - Too many connections

Normally you encounter this error for one of two reasons: either you are being Slashdotted, or you are opening up persistent connections (with PHP's mysql_pconnect(), for example) and they are not being closed properly. In the first case, there are just too many queries at once and it fills up the connection limit, and in the second case they build up over time.

I didn't think possibility number 1 was very likely, since we don't write anything cool and geeky enough to get on Slashdot. The story about the Canadian geologist was probably our best bet. I knew I hadn't written any code to use persistent connections, but what about the rest of WordPress?

No such luck. Not a single pconnect in any of the WordPress or plugin code. Back to the first possibility – is it possible we were being hit but a distributed denial of service attack (DDoS)? More specifically (and more likely), we were being effectively DDoS'ed by comment spammers.

How did I figure it out? The connection limit for MySQL is set in the config file, my.cnf in Apache (or possibly my.ini in Windows/IIS):

[mysqld] set-variable=max_connections=100

The default is 100 and that should be enough for most sites. I needed to see what was actually being run, so I connected as a user with administrative rights and sent MySQL this command:

SHOW FULL PROCESSLIST

I got back a list of 200 locked queries, all dealing with selecting or deleting comments!

We have two measures in place to combat comment spam. One is Askimet, which is a standard plugin for WordPress. I have no hard data but I would guess almost everyone uses it. The other is a captcha plugin called Did You Pass Math?

The idea behind captchas is to give visitors a small task that is easy for humans but harder for machines. That's where those fancy images with the wavy letters and numbers come from. I wanted to use something a little simpler, so I went with Did You Pass Math. From what I've read, a big part of the power of captchas is just having something there at all to make your submit form non-standard and break the really naïve spamming scripts (see Jeff Atwood's story about his captcha in Coding Horror). It worked really well for a while.

But not any more. Askimet now reports an order of magnitude more spam blocked than ever before.

Is Did You Pass Math officially broken? It seems like I'll need to upgrade or find something different. Maybe I can hack it a bit to ask about more than just addition.

Jess B was kind enough to look through our logs and she found a ton of hits from the same IP range, and the IPs all went to spammy sites filled with more spam. Ugh.

Has anyone else noticed this with Did You Pass Math, or any other captcha plugin?

Weird Errors – Fix Timeout Issues in CURL, PHP, and Apache.

Hitting strange errors when trying to execute long-running PHP processes, like large file reads, generating static HTML pages, file uploads, or CURL calls? It might not be just bugs in your code.

Are you getting pages that seem to load, but then nothing shows up in the browser? When you go to a page, does your browser sometimes ask, "You have chosen to open something.php which is a : PHP file. What should Firefox do with this file" or possibly "File name: something.php File type: PHP File Would you like to open the file or save it to your computer" Do you get internal server errors at random intervals?

Depending on what you are trying to, you could be running into timeout issues, either in PHP, in a particular library, in Apache (or IIS or whatever web server you use), or even in the browser. Timeout issues can be a real pain because you don't run into them very often and they don't result in clear error messages.

Let's take a PHP script that does a number of CURL calls as an example. PHP gives you access to libcurl a really powerful tool for calling up other web pages, web services, RSS feeds, and whatever else you can dream up, right in your PHP code. This article is not a general introduction to CURL, so I won't go into detail, but basically the CURL functions allow your code to make requests and get responses from web sites just like a browser. You can then parse the results use the data on your site.

Let's say you have a page on your site where you would like to display the latest posts from a few of your friends' websites, and they don't have RSS feeds set up. When a user comes to your site, you can make a series of CURL calls to get the data:

$curl_session = curl_init(); curl_setopt($curl_session, CURLOPT_HEADER, false); curl_setopt($curl_session, CURLOPT_FOLLOWLOCATION, true); curl_setopt($curl_session, CURLOPT_RETURNTRANSFER, true); curl_setopt ($curl_session, CURLOPT_HTTPGET, true); curl_setopt($curl_session, CURLOPT_URL, 'http://www.example.com'); $string = curl_exec($curl_session);

 

You can now parse the results in $string and hack out the most recent post. You would repeat these calls for each of your friends' web sites.

You try running the page and everything seems to work at first, but then you hit reload and get some strange behavior, like the the problems listed above. In the worst cases, you won't get the same exact error each time - sometimes the page will load, some times you'll get an empty $string or errors from curl, sometimes a blank page will appear, and some times you will be asked to download the PHP file - which includes all your source code!

In this situation you could be timing out. CURL is going out to another web server and your code will have to wait for it to finish before moving on to something else. In addition, your web server may be waiting on PHP to finish it's work before sending something to the browser.

Luckily, there are a few ways to control how long the CURL functions, PHP, and Apache wait and you can do a little bit to ensure that the user's browser doesn't just give up either.

CURL has two options worth looking at, CURLOPT_TIMEOUT and CURLOPT_CONNECTTIMEOUT. The former sets how long CURL will run before it gives up and the latter sets how long CURL will wait to even connect to the site you want to pull data from. If you wanted to wait at most 4 seconds to connect and 8 seconds total, you would set it like this:

curl_setopt($curl_session, CURLOPT_CONNECTTIMEOUT, 4); curl_setopt($curl_session, CURLOPT_TIMEOUT, 8);

This can be very helpful if you are connecting to a large number of different web sites or connecting to sites that not always available or are on slow hosts. You may wish to set the timeouts much higher, if you really need to get that data, or fairly low, if you have a lot of CURL calls and don't want PHP to time out. You can get an idea how long things are taking by using curl_getinfo():

echo '

';
print_r(curl_getinfo($curl_session));
echo '
';

PHP may also time out if it is running for too long. Luckily, you can control this to some extent by changing a setting in your php.ini or using the set_time_limit() function. If you can make changes to php.ini, it might be worth adding or adjusting the following lines:

max_execution_time = 300 ; Maximum execution time of each script, in seconds max_input_time = 60 ; Maximum amount of time each script may spend parsing request data memory_limit = 8M ; Maximum amount of memory a script may consume (8MB)

If you don't have access to php.ini, you may be able to use set_time_limit() to change the max_execution time on each page where it is needed. If you are in a shared hosting environment, don't monkey with these values too much or you might impact other users. If you raise the time limit too high, you may get an angry email from your admin. Some hosts have programs set up to look out for long-running processes and kill them - check with your admin if you raise the time limit and the script still dies an early death.

Your web server (Apache is used for this example) may also be running into timeout issues. If you have access to your httpd.conf, changing the timeout is pretty easy:

Timeout 300

Unfortunately, not everyone will be able to edit their httpd.conf and this is not something you can add to an .htaccess file to change for just the scripts in a particular directory. Luckily we can work around this limitation, so long as we are sending the webpage to the user in parts, rather than waiting for the entire PHP script to execute and then sending the response.

How do we do it? First, make sure mod_gzip is turned off in an .htaccess file:

mod_gzip_on no mod_gzip_item_include mime ^text/.* mod_gzip_item_exclude mime ^image/.*$

Mod_gzip is a great way to reduce bandwidth use and increase site performance, but it waits until PHP has completed executing before zipping and sending the web page to the user.

Second, take a look at your PHP code and make sure you are not output buffering the whole page, including output buffering to send gz-encoded (gzipped) output. Output buffering can give you a lot of control, but in this case it can cause problems. You can Look for something like this:

ob_start(); // ... // a whole ton of time-consuming code here // ... ob_flush(); //or possibly ob_end_flush();

Finally, if you have a number of time-intensive sections in your code, you can force some data out to the browser to keep Apache going and help make sure the browser doesn't lose interest either. It might look something like this:

echo "Loading Steve's page ..."; // ... // a time-consuming CURL call // ... //do a flush to keep browser interested... echo str_pad(" Loaded. ",8); sleep(1); flush(); echo "Loading Jill's page ..."; // ... // a time-consuming CURL call // ... echo str_pad(" Loaded ... ",8); sleep(1); flush();

The flush() function is the main trick - it tells PHP to send out what it has generated so far. The str_pad() and sleep() calls might not be necessary in this case, but the general idea is that some browsers need a minimum of 8 bytes to start displaying and the delay from the sleep(1) call seems to make IE happy.

This technique is not just useful in getting around timeout problems, it can also be used on long pages to give the user something to start looking at while the rest of the data loads. Also, some browsers might not handle content serves as XML incrementally – in that case you might want to serve it as text/html:

header("Content-Type: text/html");

Hopefully this will help you track down those nasty timeout-related bugs. Have questions or some other tips? Post in the comments below.