Something New #6 and #7

I forgot to post something new I learned yesterday so today I will post two. One will be about IE7 and the other will be a highly educational MySQL factoid and index tutorial.

I found a bug in IE7 today. I won’t go and say it’s confirmed or official, but I’d say it’s a probably a bug. I wrote a script that reads a PDF file into a variable, cleans out the output buffer (so no garbage appears above the PDF), sets the content type as a PDF, and then dumps the data. What a good, well-behaved browser would do is say, “Oh, it’s a PDF. Let’s download it.” A bad one might say, “Screw you. This is text.” and show a bunch of gibberish. What IE7 did was say “Document not found, sucka!”

Huh? This kept happening in IE7, but did not happen in Opera. After a lot of head banging of the non-Metallica type, I tried dumping the PDF contents straight to the output, producing tons of gibberish. IE7 loved that. My end user did not. But suddenly, a page that “didn’t exist,” did. But as soon as I sent those pesky headers that tell the browser, “Yo, pay attention, this is a PDF file!” IE7 barfed up an error.

After further investigation, I found that the culprit was output buffering. Is this really IE7’s fault? Perhaps not. I can’t say. But what I can say, is that it only happened on IE7. Output buffering is a cool feature in PHP that basically hoards up any output your script made, and gives it to the browser all at once, rather than spitting out little pieces of output as it gets it. This is cool because you can send headers to the browser (such as telling it that the file is a PDF, or to redirect to another page), without worrying if the browser already got output. You see, once the browser has some HTML, it ignores any headers you send it.

So I had output buffering on. And I was telling PHP to clear out the buffer so that the PDF data would start on a clean slate. That was causing IE7 to think there was NOTHING there. Sure, maybe PHP had some weird quirk. Maybe the application I was working on had a I wasn’t aware of regarding how it used output buffering. Maybe. But “document not found?”

I blame IE7.

The other thing I learned today is that a poorly placed OR clause in an SQL query can completely screw up the indexes that query uses, making an otherwise very fast query into an extremely slow and stupid one, even if the fields in question are all indexed. To illustrate the problem I had, the following is a simplified version of the original query that ran absolutely crappy:

SELECT * FROM tableA a LEFT JOIN tableB b USING (‘common’) WHERE indexedAColumn = ‘something’ AND (indexedBColumn = ” OR indexedBColumn IS NULL);

(I was looking for blank indexedBColumn records)

There are indexes on all columns involved, which means the queries should run fast. What less experienced database programmers don’t know is that database engines analyze your query to determine which index to use. You see, an index is like those letter tabs in a phone book — they speed up finding specific values because the database knows where to look in its phone book when you ask it to find “something.”

Grab a helmet. This is a crash course.

Well, if I told you to find a person with the last name, “Smith,” in a phone book, you would turn straight to the “S” section and then straight to the “Sm,” “Smi,” “Smit,” and finally “Smith.” If I told you to find “Jason Smith,” you would now look at the first name, which is also conveniently sorted alphabetically. The search for the last name is like a search using a regular index. Finding a first name is easy, even if there are 1000 Smiths, because the first name is also sorted. This is basically a two-key index, a situation where two fields are sorted together. If I told you to count how many Jason Smiths existed, you would be able to tell me as soon as you found the first and last Jason. This is the magic of indexing — you know where to look for information that the index is designed to optimize.

But if I told you to find a Smith which a phone number that started with a 4, it would be a whole different beast. After you found the “Smith” section, you would have to read and analyze every single phone number. If the Smith section was 100,000 people, you would have to look at all 100,000 entries before you knew how many Smiths had a phone number that started with a 4. It could be 0. It could be 100,000. Let’s pretend there are 100 Smiths with phone numbers that start with a 4. How many records would you have to read to find that out? Well, if you didn’t know there are 100, you would have to look at 100,000 records before you can conclude the answer is 100. This is what happens when you search on a field that is NOT INDEXED. Even if one part of your search *IS* indexed, if the other half is not, you still have a ton of foot-work to finish to get your desired result.

Now compare that to counting all “Adam Smiths.” Let’s say there are 100 Adam Smiths as well. How many records would we have to look at to find out there are 100 of them? 100. Maybe 101 if we’re stupid. You flip to the first Adam you see, and start counting. And for you purists out there, note that “flipping to Adam” is *a lot faster* than “reading to Adam.” You scan. A human can find a word in a dictionary in seconds, but that person is not actually reading the records before or after that entry. That is the beauty of indexes.

So back to my original problem.

SELECT * FROM tableA a LEFT JOIN tableB b USING (‘common’) WHERE a.indexedAColumn = ‘something’ AND (b.indexedBColumn = ” OR b.indexedBColumn IS NULL);

So this query uses indexes. It is fast. Or at least it should be. What I found out was specific to my database problem and the way MySQL automatically determines what index to use. I give this warning because you probably won’t be able to replicate this problem here. Anyway, there is a command in MySQL called EXPLAIN that shows exactly what indexes MySQL plans to use when that query is executed. Without getting too technical, MySQL decided it would use the index in B first, and manually read through A to prune down the results. Why is this a problem?

B.indexedBColumn is highly redundant. In fact, it was almost not worth indexing. It is like having a 10,000 page dictionary with 26 tabs in it — one for each letter. Sure, they help, but not a whole lot without sub-tabs and sub-letter groupings. Here is a diagram of what I was hoping to do:

What this diagram tries to explain is that the database still did the join in the right direction. There were never more results than what was found in A, even if a ton more matched up in B. But what the database did completely and utterly wrong was use B as the initial filter. If it used A first, it would have two records to use. Then it could look up each related record in B and figure out which one it wanted (resulting in the desired 1 result). If it used B first, which it did, it had some outrageous number of records to then look up in A. Looking up tens of thousands of records is slow. It doesn’t matter if there are indexes or not, looking up that many specific records takes forever. To find that many records, it’s faster for the database to read and compare every record in A, comparing it to a huge, 60,000 record candidate list. To make things even worse, these newly looked up records resemble a bunch of pages torn out of a dictionary — their index is no longer useful. Thus after pulling out all these records, the database now had to look through all of them for the desired value.

Well, the good news is that MySQL is impressively fast. And that crazy operation of comparing 60,000 records would take a minute or less. Unfortunately, I used example numbers in this post; my real life problem had A MILLION records that matched.

How did we fix it? Well, I realized something very interesting that was happening. That OR clause was confusing MySQL to the point that it didn’t know what index was best. If I took one or the other part of that OR statement out, things ran smoothly:

SELECT * FROM tableA a LEFT JOIN tableB b USING (‘common’) WHERE a.indexedAColumn = ‘something’ AND (b.indexedBColumn = ” OR b.indexedBColumn IS NULL);

Became:

SELECT * FROM tableA a LEFT JOIN tableB b USING (‘common’) WHERE a.indexedAColumn = ‘something’ AND b.indexedBColumn = ”;

This worked exactly like you would expect. For those of you wondering, since I knew the initial result in tableA was extremely small (maybe 10 records, tops), I thought throwing in that OR statement would do little harm.

Well, that’s the problem with letting the database decide what indexes to use. I could have specified the index, but it was 9PM, and I was tired, so I simply removed the IS NULL part and left it at that. My original intention was to find all empty records, so I simply decided to stop using NULL columns instead (which is wiser, in my opinion, anyway). The real solution would be to force MySQL to use the right index, but that’s a whole different article.

The point is that you should be aware of how indexes work. MySQL only uses one index per table per query. Indexing every column in your table won’t do much good if you’re doing a lot of multi-column selects, and it will eat up a lot of hard drive space and cause more CPU and disk write overhead on every insert or update since those require the indexes to be updated too. And if you are doing complicated queries across multiple tables, you should be especially careful on the order of how indexes are being assessed. Notice my point wasn’t that the indexes weren’t be considered, it was the order they were being considered. When you use the EXPLAIN command, it returns the indexes in the order the engine plans to use them. The syntax for this command is:

EXPLAIN the_select_query

More information here.

One thought on “Something New #6 and #7”

  1. On that IE7 bug:
    i’ve noticed that problem in IE6 too, so appears it’s more of IE problem:) i was using ufpdf, sending right headers (content-type, size etc) and still if you tried to open the file, IE said “sorry, doc not found”.

Comments are closed.