Background Checks Are a Strange Beast

Today, I learned someone else’s name came up on a background check under my social security number (as well as my name). Curious, I looked up my credit report from other sources, and discovered only the original background check linked my SSN to this mysterious person. But there is no suspicious activity. My credit is just fine. And the other credit checks only see my name attached to the number. So in conclusion, I learned background checks can turn up pretty wild and crazy results… That, or I am missing something really important.

Catching up (8, 9, 10)

I’ve been a little busy with moving, so that is my excuse for being AWOL. Let’s continue.

I learned that Ikea dressers require far more effort to build than beds. Probably twice as much time and effort.

I learned that the California Department of Corporations has wacky accounting that seems to work against their best interest. I just received a full refund for a processing fee they kept requesting last year. Strange.

I learned that it is unexpectedly rare to find web developers who have tried creating their own object-oriented database abstraction layer. I found it is even more rare to find developers who took this abstraction layer and made the (in my opinion) relatively obvious step toward creating a generalized abstraction layer that removes the need to write SQL 90% of the time. For those of you who haven’t thought about this before, creating such a layer, the pride and joy of the rails movement, is relatively simple. While there are many schools of thought on how to accomplish this, I think a simple place to start is to setup something like this:

// Notice my example assumes any table you want.
$object = new DBLayer(‘tablename’);
// Runs an equivalent of SELECT * FROM tablename WHERE
// primary_key_field = 30;

$object->load(30);
// Overload PHP5’s __set() method (see documentation)
// store this in an internal array so that table fields like
// ‘tablename’ don’t accidentally erase object settings.
// Thus, “$this->mData[‘username’] = $value;”
// Just see the documentation of __set(). Trust me.
$object->username = ‘new username’;
// runs an equivalent of UPDATE tablename SET username=’new
// username’ WHERE primary_key_field = 30;

$object->save();

There are many ways to figure out the primary key. One idea is to standardize the primary key name so that “tablename” always has a primary key of “tablename_id”. Another idea is to dynamically determine it by running a “DESC tablename” and caching the results. Think it over. It’s an interesting, but highly insightful challenge. My example may be a little advanced, but this is the starting point of those shiny “rails frameworks” you hear about.

Something New #6 and #7

I forgot to post something new I learned yesterday so today I will post two. One will be about IE7 and the other will be a highly educational MySQL factoid and index tutorial.

I found a bug in IE7 today. I won’t go and say it’s confirmed or official, but I’d say it’s a probably a bug. I wrote a script that reads a PDF file into a variable, cleans out the output buffer (so no garbage appears above the PDF), sets the content type as a PDF, and then dumps the data. What a good, well-behaved browser would do is say, “Oh, it’s a PDF. Let’s download it.” A bad one might say, “Screw you. This is text.” and show a bunch of gibberish. What IE7 did was say “Document not found, sucka!”

Huh? This kept happening in IE7, but did not happen in Opera. After a lot of head banging of the non-Metallica type, I tried dumping the PDF contents straight to the output, producing tons of gibberish. IE7 loved that. My end user did not. But suddenly, a page that “didn’t exist,” did. But as soon as I sent those pesky headers that tell the browser, “Yo, pay attention, this is a PDF file!” IE7 barfed up an error.

After further investigation, I found that the culprit was output buffering. Is this really IE7’s fault? Perhaps not. I can’t say. But what I can say, is that it only happened on IE7. Output buffering is a cool feature in PHP that basically hoards up any output your script made, and gives it to the browser all at once, rather than spitting out little pieces of output as it gets it. This is cool because you can send headers to the browser (such as telling it that the file is a PDF, or to redirect to another page), without worrying if the browser already got output. You see, once the browser has some HTML, it ignores any headers you send it.

So I had output buffering on. And I was telling PHP to clear out the buffer so that the PDF data would start on a clean slate. That was causing IE7 to think there was NOTHING there. Sure, maybe PHP had some weird quirk. Maybe the application I was working on had a I wasn’t aware of regarding how it used output buffering. Maybe. But “document not found?”

I blame IE7.

The other thing I learned today is that a poorly placed OR clause in an SQL query can completely screw up the indexes that query uses, making an otherwise very fast query into an extremely slow and stupid one, even if the fields in question are all indexed. To illustrate the problem I had, the following is a simplified version of the original query that ran absolutely crappy:

SELECT * FROM tableA a LEFT JOIN tableB b USING (‘common’) WHERE indexedAColumn = ‘something’ AND (indexedBColumn = ” OR indexedBColumn IS NULL);

(I was looking for blank indexedBColumn records)

There are indexes on all columns involved, which means the queries should run fast. What less experienced database programmers don’t know is that database engines analyze your query to determine which index to use. You see, an index is like those letter tabs in a phone book — they speed up finding specific values because the database knows where to look in its phone book when you ask it to find “something.”

Grab a helmet. This is a crash course.

Well, if I told you to find a person with the last name, “Smith,” in a phone book, you would turn straight to the “S” section and then straight to the “Sm,” “Smi,” “Smit,” and finally “Smith.” If I told you to find “Jason Smith,” you would now look at the first name, which is also conveniently sorted alphabetically. The search for the last name is like a search using a regular index. Finding a first name is easy, even if there are 1000 Smiths, because the first name is also sorted. This is basically a two-key index, a situation where two fields are sorted together. If I told you to count how many Jason Smiths existed, you would be able to tell me as soon as you found the first and last Jason. This is the magic of indexing — you know where to look for information that the index is designed to optimize.

But if I told you to find a Smith which a phone number that started with a 4, it would be a whole different beast. After you found the “Smith” section, you would have to read and analyze every single phone number. If the Smith section was 100,000 people, you would have to look at all 100,000 entries before you knew how many Smiths had a phone number that started with a 4. It could be 0. It could be 100,000. Let’s pretend there are 100 Smiths with phone numbers that start with a 4. How many records would you have to read to find that out? Well, if you didn’t know there are 100, you would have to look at 100,000 records before you can conclude the answer is 100. This is what happens when you search on a field that is NOT INDEXED. Even if one part of your search *IS* indexed, if the other half is not, you still have a ton of foot-work to finish to get your desired result.

Now compare that to counting all “Adam Smiths.” Let’s say there are 100 Adam Smiths as well. How many records would we have to look at to find out there are 100 of them? 100. Maybe 101 if we’re stupid. You flip to the first Adam you see, and start counting. And for you purists out there, note that “flipping to Adam” is *a lot faster* than “reading to Adam.” You scan. A human can find a word in a dictionary in seconds, but that person is not actually reading the records before or after that entry. That is the beauty of indexes.

So back to my original problem.

SELECT * FROM tableA a LEFT JOIN tableB b USING (‘common’) WHERE a.indexedAColumn = ‘something’ AND (b.indexedBColumn = ” OR b.indexedBColumn IS NULL);

So this query uses indexes. It is fast. Or at least it should be. What I found out was specific to my database problem and the way MySQL automatically determines what index to use. I give this warning because you probably won’t be able to replicate this problem here. Anyway, there is a command in MySQL called EXPLAIN that shows exactly what indexes MySQL plans to use when that query is executed. Without getting too technical, MySQL decided it would use the index in B first, and manually read through A to prune down the results. Why is this a problem?

B.indexedBColumn is highly redundant. In fact, it was almost not worth indexing. It is like having a 10,000 page dictionary with 26 tabs in it — one for each letter. Sure, they help, but not a whole lot without sub-tabs and sub-letter groupings. Here is a diagram of what I was hoping to do:

What this diagram tries to explain is that the database still did the join in the right direction. There were never more results than what was found in A, even if a ton more matched up in B. But what the database did completely and utterly wrong was use B as the initial filter. If it used A first, it would have two records to use. Then it could look up each related record in B and figure out which one it wanted (resulting in the desired 1 result). If it used B first, which it did, it had some outrageous number of records to then look up in A. Looking up tens of thousands of records is slow. It doesn’t matter if there are indexes or not, looking up that many specific records takes forever. To find that many records, it’s faster for the database to read and compare every record in A, comparing it to a huge, 60,000 record candidate list. To make things even worse, these newly looked up records resemble a bunch of pages torn out of a dictionary — their index is no longer useful. Thus after pulling out all these records, the database now had to look through all of them for the desired value.

Well, the good news is that MySQL is impressively fast. And that crazy operation of comparing 60,000 records would take a minute or less. Unfortunately, I used example numbers in this post; my real life problem had A MILLION records that matched.

How did we fix it? Well, I realized something very interesting that was happening. That OR clause was confusing MySQL to the point that it didn’t know what index was best. If I took one or the other part of that OR statement out, things ran smoothly:

SELECT * FROM tableA a LEFT JOIN tableB b USING (‘common’) WHERE a.indexedAColumn = ‘something’ AND (b.indexedBColumn = ” OR b.indexedBColumn IS NULL);

Became:

SELECT * FROM tableA a LEFT JOIN tableB b USING (‘common’) WHERE a.indexedAColumn = ‘something’ AND b.indexedBColumn = ”;

This worked exactly like you would expect. For those of you wondering, since I knew the initial result in tableA was extremely small (maybe 10 records, tops), I thought throwing in that OR statement would do little harm.

Well, that’s the problem with letting the database decide what indexes to use. I could have specified the index, but it was 9PM, and I was tired, so I simply removed the IS NULL part and left it at that. My original intention was to find all empty records, so I simply decided to stop using NULL columns instead (which is wiser, in my opinion, anyway). The real solution would be to force MySQL to use the right index, but that’s a whole different article.

The point is that you should be aware of how indexes work. MySQL only uses one index per table per query. Indexing every column in your table won’t do much good if you’re doing a lot of multi-column selects, and it will eat up a lot of hard drive space and cause more CPU and disk write overhead on every insert or update since those require the indexes to be updated too. And if you are doing complicated queries across multiple tables, you should be especially careful on the order of how indexes are being assessed. Notice my point wasn’t that the indexes weren’t be considered, it was the order they were being considered. When you use the EXPLAIN command, it returns the indexes in the order the engine plans to use them. The syntax for this command is:

EXPLAIN the_select_query

More information here.

Scum of the Earth Scammer

If you haven’t already heard, a man by the name of Jason Fortuny has been the center of a news-frenzy when he played a cruel prank on hundreds of people. He created a fictitious Craigslist posing as a woman looking for a sex partner. As many would expect, he received nearly 200 replies, many with photos, emails, phone numbers, and full names. What makes me angry, as well as many other people, is that Fortuny then decided to publish every response, verbatim.

A lot of people initially react with “Well, that’s the victim’s dumb fault.” You know, even I had that knee-jerk thought. No. Stop. Think.

Have you ever bought something from eBay? How would you feel if someone published your personal information after they got it from you on there? Better yet, how would you feel if it was just after you finished bidding on that inflatable sex doll you bought for your buddy as a gag-gift. Now imagine someone – a potential employer, a subordinate, a new acquaintance you just made that wants to see if you’re romantically compatible, your parents, your teacher, your schoolmates – decided to Google you and the very first hit was your name, your email, your phone number, a picture of you, a picture of the inflatable doll, and a small blurb about how much you paid for it. And some morons might get on a high horse and tell you not to trust making purchases from strangers, but the bottom line is that you will never, EVER be able to escape this again. Ever. Every job interview you will ever have has the potential to turn awkward when your new employer stumbles across this (assuming they even call you back after seeing it).

Most of the respondents did not indicate they were married. None of the activity was illegal. And they were only providing the basic information you’d expect if you were trying to find an online date, which by the way, is an increasingly popular way to meet people. How is this much different than my eBay story?

I noticed a lot of commentary saying those victims deserved what they got because they “posted” their private information online by replying. Pretend Craigslist was a newspaper for a moment, say the Washington Post. You see a personal ad for a person and you send them a little information about yourself. The next day, that person posts all 180 responses verbatim in the same newspaper. Never once did YOU post your personal information for the world to see, just like I don’t post my social security number on the Internet when I log into my banking website. There’s implicit (and often legally protected) expectation of privacy when you are communicating to another person over a private medium (email). Email is private. Sure, it goes through public servers, but so does your snail mail when it goes through your local post office and sits in your mail box for a few hours while you come home from work. It’s a federal crime in either instance to intercept that communication without a warrant. I really don’t understand how people can blame the victims like that. Nothing about their responses was public — nothing.

There was even a case of a man who claimed he never contacted Fortuny – that someone else, perhaps Fortuny himself, used his personal information without his consent. There was the instance of a couple who wanted a third partner for a threesome — a lifestyle, they hid from their highly religious friends and family. There was the Microsoft contractor that first heard about this through a co-worker who saw his name. These are random people who’s lives are now turned upside-down for no good reason except as a good laugh to Fortuny.

I think the moron wanted attention. He was actually happy to report his interview with the New York Times. Funny. Does he actually expect any good press coming out of this? The worst part is that Fortuny, as indicated by his journal, is proud of his accomplishments. He has taunted his victims to sue him, and he has even publicly discussed the idea of making money out of this by publishing a book.

Who cares if the victims were looking for sex. That’s not the point. They could have been looking for roommates or hoping to buy old furniture. The point is now that their personal information, along with photos, are now posted online, opening them up to identity theft and public humiliation. This could have been an angry ex posting half naked photos of you, your cell phone, full name, and some dirty AIM chats you had with them a long time ago. Why doesn’t this sort of stuff happen more often? That’s because it’s illegal, and for good reason.

On to the good news.

First, he published his personal information in great detail.  This makes it very easy for people to send him lawsuits. Which introduces the second point.

There is the very long list of potential legal and personal troubles awaiting this idiot.

  • While you can debate his right to publish the emails, he also published photos that he did not own, opening him up to lawsuits for copyright infringement. As this indicates, infringement penalties range from $750 to $150,000 per violation.
  • Many of those very same photos are sexual by nature. How many of the 150 photos he published do you think are of minors? Well, US law requires publishers of sexual material to provide documentation that participants of said photos are of legal consenting age. As indicated in the link, “prison terms up to 5 years and $25k.”
  • He is also open to intentional tort under intentional infliction of emotional distress. What’s interesting is that this is a rare type of tort to win due to the extreme difficulty in proving the action was “heinous and beyond the standards of civilized decency or utterly intolerable in a civilized society.” Usually this is abused as a type of frivolous lawsuit, but I think the victims may have a chance here.
  • Additionally, he published private information. This is another tort claim. The following two points really show just how much of a home run case prosecutors would have against this guy:

One who intentionally intrudes, physically or otherwise, upon the solitude or seclusion of another or his private affairs or concerns, is subject to liability to the other for invasion of his privacy, if the intrusion would be highly offensive to a reasonable person.

One who gives publicity to a matter concerning the private life of another is subject to liability to the other for invasion of his privacy, if the matter publicized is of a kind that

  • would be highly offensive to a reasonable person, and
  • is not of legitimate concern to the public.
  • It’s also likely he broke this privacy law in Washington.
  • It’s an election year. District attorneys are looking for a nice public case to win.
  • He is a Network Administrator, a position typically tightly bundled with trust and company security. I doubt he’ll keep that job much longer, let alone get a new one if he ever gets fired. Judging from how he posted his personal information on his site (a lot of it), it’ll be easy for angry people to track his employers and let them know about this scandal.
  • His victims will file claims all over the country. It will be very expensive and unpleasant for him to defend himself due to all the traveling costs as well as coordinating defenses concurrently across the country. This will severely impact his working life, which, as I indicated in the previous bullet, will be very difficult from which to recover.

I hope he gets buried in court.

Something New #4

I’ve seen a lot of people directly compare Ruby on Rails to PHP. Even after so much media attention, people continue to misunderstand that the valid comparison is PHP to Ruby (with no Rails). Rails is a framework. PHP has frameworks, but they are not (necessarily) called “rails.” Ruby on its own is a language, just like PHP. It has no miraculous way of doing a Flikr photo browser blog in 5 minutes, much like what a popular band-wagon programmer may want to believe. Ruby on Rails, arguably, probably can do a Flikr photo browser blog thingy in 5 minutes. But so could a good PHP framework.

Frameworks make certain common assumptions for the developer to speed up development times. By having predefined database abstraction layers and methods for presenting output, they can significantly reduce the amount of overhead required in producing simple CRUD (Create, Read, Update, Delete) applications.

I’ve looked over Ruby. It’s a neat language. It has its ups, and I can see why some people like its simplicity. But is it better than PHP? Only arguably. Probably only incrementally. But with PHP being a tried-and-tested production class web language, I’d rather bet my money on that horse. Ruby by itself offers no true productivity gains over PHP. And there are now several prominent PHP application frameworks that are closing the gap to Ruby on Rails.