A PHP/MySQL Bug Most People Have But Don’t Realize

I’ve seen this over and over in my career and thought I should save others from the horror. Part of me feels like I blogged about this years ago, but I couldn’t find a post referencing it (EDIT: found it!). The bug is simple:

  1. Create a database table with a decimal value, such as order_total
  2. Write some code that retrieves the row
  3. Do an implicit boolean check on order_total to see if it has a value

Here’s some actual code:

$results = mysql_query("SELECT * FROM orders");
while($row = mysql_fetch_assoc($results)) {
    if($row['order_total']) {
        echo "Order total is clearly not zero!";
    }
    else {
        log_bad_order($row['id']);
    }
}

This code has a serious bug in it. The problem is the line pertaining to checking if the order_total has been set. Pop quiz:

What is the value of the following:
(bool) “0.00”

The answer is TRUE! 0.00 may evaluate to zero, but “0.00” is not the same thing! As soon as PHP sees more than just a single “0” in a string, it assumes it’s a regular string and treats it as a non-zero string. A more obvious way to ask the same question:

What is the value of the following:
(bool) “0.0000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000″

Or what about:

What is the value of the following: (bool) “0.”

The point is that as soon as you go beyond a single zero, PHP just assumes the rest is real data and will not discard it. Thus:

if(0.00) {
    echo "THIS NEVER EXECUTES";
}
if("0.00") {
    echo "THIS ALWAYS EVALUATES TO TRUE";
}

SO going back to the original problem, the way to solve is is by fixing the code to either explicitly type cast the variable or use a “greater than” check:

$results = mysql_query("SELECT * FROM orders");
while($row = mysql_fetch_assoc($results)) {
    if((float) $row['order_total'] > 0) {
        echo "Order total is clearly not zero!";
    }
    else {
        log_bad_order($row['id']);
    }
}

If you don’t do either of these things, I’d suggest you go and double check some of your code.

Representing heirarchical data in MySQL

I’ve always wondered if there was a better way to manage nested data structures (such as product categories) in MySQL. Today I stumbled across a solution called the Nested Set Model.

The only addition I made to the solution is rename what they call the “category_id” and call that a “sort_id”. Then I added a primary key called “id” to the table. This way, I have an immutable ID I can use in the application (such as for URL deep linking). For example:

CREATE TABLE nested_category (
  id INT AUTO_INCREMENT PRIMARY KEY,
  sort_id INT NOT NULL,
  name VARCHAR(20) NOT NULL,
  left_sort INT NOT NULL,
  right_sort INT NOT NULL
);

Down But Not Out… Sun on the Other Hand…

For a brief period, the site was down. I was moving to a more permanent host. Special thanks to Brian for hosting my sites all these years. =)

Anyway, yes, I do keep this site in mind. And for any of you paying attention, I hope it’s not the end of the (open source database) world that Oracle bought Sun. I think it’s funny that Oracle just bought Sun for a price that puts MySQL’s value at 1/7th Sun’s value. Maybe instead of buying up MySQL, Sun should have been focusing on their own business strategy. And they did it during the hardest possible economic times. Moronic.

Oh well. As they say, “when the tide goes out, you can see who’s not wearing shorts,” right? I do feel bad for MySQL though. They dodged the Oracle Bullet only to get caught under the Oracle Steamroller.

Debugging Tips for Database Abstraction

Today I want to talk about database script debugging in large systems. The main problem is that in large applications, it becomes difficult to find the source of rogue queries that, for example, broke in a recent system update.This may not readily apply to most of you, but bear with me: some day it will.

Pretend for a moment you have a database architecture where you have 2 masters (dual replication) and 2 read-only slaves. Now pretend that you have a large application with 100 different pages/scripts. You have 5 web servers with mirror copies of the application. This would be a fairly typical setup for a small, but growing company.

One day, you come into work and find out that you had a bad transaction lock that caused your system to hang all weekend. So you look at the process list and you know what query is causing the problem (because it’s still stuck). The problem is that it looks suspiciously like the queries you’d find on virtually every page in your application. How do you fix this problem? An different (but related) problem is when an update initially executed on one master database server replicated to a slave and got stuck on the slave but executed fine elsewhere. What happened? Which master server got the initial query? This sort of debugging is very difficult to track down without more information such as where the query was initially sent and from what page it originated.

The primary challenge is figuring out which query came from what page in your application. The solution is to add logging straight into your queries. The implemented looks something like this:

//Get the current page or script file
$source = $_SERVER['REQUEST_URI'] ? $_SERVER['REQUEST_URI'] : $_SERVER['SCRIPT_FILENAME'];
//Replace out any comment tags and add in the database being connected to
$metaData = str_replace(array('/*', '*/'), array('/ *', '* /'), $source) . " ($databaseHost)");
//Escape the query so the URI can't be used to inject data
$metaData = mysql_real_escape_string($metaData);
//Execute the query
$result = mysql_query("/* $metaData */ " . $query, $connection);

This solution inserts a comment into your query that gives you useful information that can be seen when looking at the raw query. MySQL uses C++ style comment blocks (the /* */) which are ignored by the parsing engine. This means you can pass data to the engine which can be useful for debugging. These comments are also replicated down to the slaves, which can be useful when you find a slave having problems with a query that came from a master server. For those of you unaware, the “URI” refers to the full URL that was typed in the address bar to access a page.

But make sure that you correctly sanitize the URI so that somebody can’t arbitrarily end your comment block (with a */) and inject their own nonsense into your query. Also, considering issues like multi-byte character attacks, I don’t even want to take the risk of not further escaping the data with a call to mysql_real_escape_string.

The solution we use at my work logs the web server IP, database server IP, and script path/URI. Other potential ideas are local timestamps, version information, user IDs, and session IDs.

In conclusion, this solution will help you identify the source (and sometimes the destination) of queries that are causing problems. This has been used in our production environment at work often when trying to determine what pages are producing extremely slow queries. This solution should work with any database, although my example is written for MySQL.

Happy debugging!

The Secret of SQL_CALC_FOUND_ROWS

Today, I wanted to go over a relatively simple MySQL feature that a lot of people don’t understand: SQL_CALC_FOUND_ROWS. To use this mystical key word, simply put it in your query right after the SELECT statement. For example:

SELECT * FROM USER WHERE id > 10 LIMIT 2,1 –see just second record

Becomes

SELECT SQL_CALC_FOUND_ROWS * FROM USER WHERE id > 10 LIMIT 2,1

This won’t change your results. It may, however, make your query run slower than when you select just one row the regular way. What this statement does is tell MySQL to find out just how many total records exist that match your criteria (in this case, where id is bigger than 10). For example, let’s assume that the user table has 100 records that have an id bigger than 10, then the query will take as long as it would have taken for the engine to find those 100 records.

The returned result will still be one the records you are expecting (in this case, the second record it found). But here is where the magic starts: If the very next query you run is a special select statement, you will have access to the total that was found. As in:

SELECT FOUND_ROWS(); –returns 100

The MySQL documentation on this subject says:

[SELECT FOUND_ROWS()] returns a number indicating how many rows the first SELECT would have returned had it been written without the LIMIT clause. In the absence of the SQL_CALC_FOUND_ROWS option in the most recent SELECT statement, FOUND_ROWS() returns the number of rows in the result set returned by that statement.

No matter what your LIMIT clause looks like (such as LIMIT 10, 1), this second query will still return the same number (in this example, 100). Why is this useful? Pagination. Often times, beginners (including me a few years ago) are stuck doing something like this:

SELECT count(*) FROM USER WHERE id > 10 –figure out how many total records there are
SELECT * FROM USER WHERE id > 10 LIMIT 50, 1 –get to record #50

People do this because you need the total to know if other matching results exist or what the last page number is.

This requires the engine to run the same query twice. This can be disastrous in cases where that query already takes a very long time to run. By including SQL_CALC_FOUND_ROWS, the overhead of running that count is grouped up with the process of actually retrieving the row of interest. So while the initial query might take a little longer to run than if you hadn’t tried to do a count, it is definitely faster than running the same query twice.

To take this to the next level, your pagination code should omit the use of SQL_CALC_FOUND_ROWS in subsequent page loads by caching the total count in the URL or session.

Happy hunting!

MySQL 5 and Condition

The word “condition” is a reserved name in MySQL 5, apparently. It was not in MySQL 4. Thus, if you have a query where you do something like this:

SELECT the_field_name AS condition FROM the_table

That causes problems.

Quicker Ways to Import Data into InnoDb

For those of you who ever have to put data into MySQL database, here’s a quick little tip. If you’re importing into a table that is using the InnoDb engine (versus the MyISAM), wrap your import file in the following:

SET AUTOCOMMIT=0;
SET UNIQUE_CHECKS=0;
… SQL import statements …
SET UNIQUE_CHECKS=1;
COMMIT;

 This will significantly improve the import time (by orders of magnitude, for imports beyond tens of thousands of records).

Edit: Quick point of reference. Importing 60,000 records (simple inserts) took about 10 minutes or about 10,000 per minute. After switching this fix in, I inserted 10,000 more records in under 10 seconds (so fast, I forgot to time it).

MySQL Timestamp and NULL – ANNOYING Part 2

Today I discovered a new bug/feature in MySQL that is very annoying and ALSO involves NULL (see previous post). In short, a NULL-enabled column can’t be set to NULL if it is of the type timestamp.

So here’s the run down.

If you create a timestamp column that is…

  • NULL enabled and has a default value to NULL, things work exactly as you imagine. You can set it to NULL, you can set it to other time values, and then set it back to NULL. No problem.
  • NULL enabled and has a default value of “0000-00-00 00:00:00″ or CURRENT_TIMESTAMP, it can NOT be set to NULL. Ever. If you try to set it to NULL, it will instead *DANGER DANGER* set it to CURRENT_TIMESTAMP!

Did you catch that? NULL is converted to CURRENT_TIMESTAMP in a NULL enabled table without NULL set as the default value. Can you say insanely stupid bug? Well, supposedly this is a legacy feature, but you know how I feel about that.

To fix this, make sure your NULL enabled timestamp columns are correctly defaulted to NULL.

I confirmed this DOES NOT apply to datetime columns. Great.

I Hate Magic Quotes

Today, I’m going to give away some source code! Celebrate! I wrote the code to address a relatively common problem among new programmers: the over-reliance on Magic Quotes.

Do you know what Magic Quotes are? It’s the annoying feature in PHP that goes around randomly (okay, not so random, per se) modifying your data to protect you from yourself. If you have it turned on and someone types in crap in your web form, well… Let me show you.

Original:

I hate magic quotes because it’s awesome at screwing up otherwise good content. And if you’re unlucky and decide to edit your text, it tends to add even more backslashes into your pretty content. Thus, stuff like ‘\’ becomes \’\\\’.

Freak version after I decide to edit the content:

I hate magic quotes because it\’s awesome at screwing up otherwise good content. And if you\’re unlucky and decide to edit your text, it tends to add even more backslashes into your pretty content. Thus, stuff like \’\\\’ becomes \\’\\\\\\’.

Awesome, huh? Do you see that freak-show at the end there? That’s right: if you forget to strip backslashes out of your content before you let people edit stuff they’ve saved, it can get progressively worse, adding in backslashes like your mother added veggies to your meal when you were a kid. It’s that bad. And it’s very common.

WHY?

Of course, the logic for the feature is obvious. The designers of PHP decided that it was better for the content to get jacked up than for millions of developers everywhere getting fired for letting 15 year old hackers run “DROP DATABASE” in their corporate servers (for those of you who don’t know what I mean, that command equates to Armageddon on your servers).

But I still hate it. It smells of noobishness, and it encourages sloppy coding. Having to strip slashes out of your code is not the way you should be doing things. There are four reasons I argue against magic quotes.

  1. If you are having to use magic quotes, you’re already committing tons of SQL sin.
  2. Not all servers you work with will have magic quotes on by default. Programming for security should be a defensive practice, and, thus, programmers should be trained to assume the least secure environment.
  3. Magic quotes alone won’t protect you from SQL inject attacks. Character encoding can be used to pass in un-escaped single quotes.
  4. This feature will be gone in PHP 6.

I’m not going to go too much into #1 because that list is way too long to cover. In short, you should be using prepared statements to minimize SQL injection, and using a single, unified database abstraction class to handle all your querying to centralize security weaknesses.

The second point is important. You can’t be a great body guard if you assume nothing is ever suspicious. Same goes for being a programmer writing secure SQL. Of course, the whole point of magic quotes is to prevent the body guard from forgetting to check one of the closets, even though he checked every other room in the 1000 room house. People forget, and that’s unfortunate. Magic quotes is that paranoid body guard that goes around handcuffing anything that moves, and it’s your job to go around and free each person. It’s the guy that walks around nailing every door shut and insists everybody lives in a bullet proof glass box. There has to be a better way, and there is.

The third point is the one novices rarely know. You can pass in invalid characters into a query that then get converted, thanks to Mr addslashes, into a single quote. This relates to the inconsistency of converting single-byte characters into multi-byte characters. As the article quoted here mentions:

Whenever a multi-byte character ends in 0×5c (a backslash), an attacker can inject the beginning byte(s) of that character just prior to a single quote, and addslashes() will complete the character rather than escape the single quote. In essence, the backslash gets absorbed, and the single quote is successfully injected.

Yeah, it’s gibberish to me too. The point is, converting certain hex values in certain foreign languages screws up and can leave you with a hanging single quote. Magic quotes don’t save you from that.

Lastly, PHP 6 isn’t going to have magic quotes on by default. Might as well take off those training wheels now.

The Fix

I have two solutions for you.

The first is to write a database abstraction layer. What? A database abstraction “layer” is a fancy word for a class that manages your database connection and data manipulation. I gave a brief example of how to write one of these a while back. Another is to create a function (or method, if you’re talking about classes) that does INSERT and UPDATE querying for you. Such a function’s prototype would look like this:

function perform($tableName, $data, $whereClause)

The $tableName variable is a string, such as “user”. The $data variable is an array where each key is a column name and each value is the value to be assigned to that column. This data is sanitized (addslashes and what not) before being inserted. The $whereClause variable is used as a suffix to an UPDATE query (ex. “WHERE user_id = 1″). This method commonly exists in many open source projects.

The problem with this method is that you still have to escape variables manually for the WHERE clause. And it doesn’t even cover how to do SELECT statements safely.

So I sat here thinking for a bit about this, trying to think of a simple, elegant solution to this problem that would help PHP beginners everywhere. Before I hand out my solution for free, let’s go over the main points:

  1. SQL injection attacks (vulnerabilities) come from putting user provided input directly into SQL queries. User input comes from $_POST, $_GET, and sometimes $_COOKIE.
  2. Data that is passed around by the developer that was retrieved from the database is mostly safe since it has already been sanitized (if sanitization happened before it was loaded in).
  3. Data from files, XML, or other forms of potential stream input need to be sanitized as well. But this would be done manually.

The Class

That said, I wrote a class that solves the main point. With it, all POST, GET, request, and cookie data can be accessed through a nice clean abstraction layer. The goal is that if you’re using this, you’d avoid using un-sanitized data, unless you meant to. For example, to access the $_GET[‘name’], $_POST[‘name’], $_REQUEST[‘name’], or $_COOKIE[‘name’] variables, you’d call:

$safe = new DbSafe();
// O’Reilly becomes O\’Reilly
$name = $safe->get(‘name’);
$name = $safe->post(‘name’);
$name = $safe->request(‘name’);
$name = $safe->cookie(‘name’);

If you wanted to get the original unmodified values, you’d call:

$safe = new DbSafe();
// O’Reilly is still O’Reilly
$name = $safe->get(‘name’, TRUE); // notice the second parameter
$name = $safe->post(‘name’, TRUE);
$name = $safe->request(‘name’, TRUE);
$name = $safe->cookie(‘name’, TRUE);

That even takes into consideration whether or not magic quotes are on. In other words, if magic quotes are on and your variables are getting slashed up, the code I show above would spit out the original version that was typed in by the user. What good is my library if it didn’t do some auto-detection, eh? =)

If you wanted to escape a value manually, you’d say:

$safe = new DbSafe();
// It’s becomes It\’s
$escapedValue = $safe->escape($value);

Or…

// It’s becomes It\’s
$escapedValue = DbSafe::escape($value);

If you wanted to escape an entire array, you’d say:

$safe = new DbSafe();
$escapedArray = $safe->escapeArray($array);

Or

$escapedValue = DbSafe::escapeArray($value);

All of these examples would convert a string (or an array of strings) that said:

Hello, my name’s Michi

To:

Hello, my name\’s Michi

When you saved this into the database, that little backslash disappears so next time you read it, it looks like this:

Hello, my name’s Michi

No need to strip anything! If you want to directly access the values without stupid slashes being automatically added in (“magically,” if you will), my class supports that as a secondary measure.

Is this class the end-all-be-all for secure programming? No. Really, the better solution that I won’t give away today is to write a strong database abstraction layer. But this will do most of your dirty work without requiring magic quotes, and without making developers think PHP has some kind of built in “security.” Remember, you can’t always rely on magic quotes being on, nor should you.

You can get the source here.

Left Join Snafu

How embarrassing. I learned something new today that I really should have known for some number of years now. Left joins can increase the result set size. 

Here’s what I thought left joins do: When you combine two tables together with a left join, the source table (the one on the left) becomes the “anchor” for the results, guaranteeing that each and every record in the left table shows up in the result. If there are results in the right table that don’t correspond, those results are omitted. If there are results in the left table that don’t have corresponding records with the right table, those records are shown either way. For example…

Let’s say table A has 10 records pertaining to people’s names. And table B has five records pertaining to where those people live. No people live in two places.

If you did a left join on these two tables, you’d end up with five people and their addresses and five people (NULL sets) with no address information.

And…

Let’s say table A has 10 records pertaining to people’s names. And table B has 12 records pertaining to where those people live, where each person in A has a record in B. But two of those records don’t match up with anything in table A because some person records were accidentally deleted (oh no!).

If you did a left join on these two tables, you’d end up with 10 people with information about where each one lives. The extra records in B are simply ignored. 

Okay. That part was easy. Everybody knows that, even your grandmother. Let’s take this a few notches up.

Now if table A has 10 records pertaining to people’s names. And table B has 15 records pertaining to where people live. And this time, those extras are no mistake! Because a bunch of people live in two places, thanks to vacation homes.

If you did a left join on these two tables, what happens? Well, embarrassingly, I predicted this sucker wrong. Assuming all 10 people from A are mentioned in B with some mentioned twice or more, the result would have 15 records!! What!? 15!? Yeah, that was my reaction too. I thought MySQL would spit back 10 and ignore duplicates in B.

Let’s do one more example. How many records will we find if we join the following scenario:

Table A has 10 records pertaining to people’s names. And table B has 15 records pertaining to where people live. One guy has 15 vacation homes and everybody else is homeless (no records in B).

Ok. Do a left join. Not an inner join. Not a regular join. A left join. How many results do we get, huh?

Our result would be 24! Who the hell guessed that? Well, probably some of my more pretentious Computer Science readers, but certainly not me (so that’s what you learn in CS, huh?). It is 24 because you have 15 duplicate records for the one rich guy and 9 default records for the homeless saps. 

Thus, the maximum number of records a left join can yield is sizeof(record set A) + sizeof(record set B) – 1. Why is this never explicitly mentioned!

For a long time, I thought left joins meant the result set can never be more than the row count of the result set in the left table. I don’t know how I managed to go through this many years without realizing my error, but I suppose through good query structuring and table use, I never encountered a problem with this until now… And, to my credit, it wasn’t a query I wrote either.

I have never seen this behavior mentioned in any documentation (even MySQL documentation). It seems to be an implicitly assumed function of the command. In fact, I found several examples out in “tutorials” about left joins, that conveniently left out mentioning this fact, but still showed it as an unexplained portion of their results. Nice.

For all of you non-Computer Science gurus, I hope you learned something new from reading this post. Wasted about an hour of my time.