Strong Visions Can Cloud Judgement

A while ago, I was involved in a start up venture. The company invested its future in social networking. We had a business model, angel investments, employees, contracts, marketing plans, and a clear vision. It was over two years ago, and it was before Facebook was truly a mainstream phenomenon. To be fair, if any new social network had a chance to gain any traction, then was the time.

Looking back now, I can tell you a million and ten reasons why it was destined to fail, despite having a feature list that, at the time (and even today), were far ahead of most of the competition. Perhaps in another post I can elaborate on the many lessons I learned from that venture, but not today.

A variety of news outlets and blogs have covered Google closing its Answers service. They also cover how Yahoo came in late and cleaned up.

Well, two years ago, while I was still in the planning stages of the start up, my friend (Brian-Ji) pitched to me the idea of an answers service. He pitched it very convincingly, and explained why it was destined to become awesome because it would fill a currently unsaturated market. I cited Google Answers as a reason why it would fail, despite the fact that my start up was competing against Facebook and Myspace. But where’s the fun in being naive if you can’t be a hypocrite, right? We chose to stick with our original social networking idea and abandoned the seemingly random questions idea.

Upon reading the news of Yahoo beating Google down in under a year, I exclaimed to my friend how I should have listened to him. But this is the very next thought that came out of my brain:

Of course, had we done that, it would have been a lame social networking questions hybrid and failed anyway. Ultimately, it would have been a social networking site first, and an answering service second.

At the time, my mind was so pre-occupied with one idea that I couldn’t see the full value of another. And even if I had seen the value, I would have screwed up the execution. At least I recognize that today. Let’s call that wisdom.

Posting Arrays

While hacking away on a problem today at work, I suddenly realized my DbSafe class would trip over itself if someone tried to access a value that was an array since it would try to escape an array value as if it were a string, leaving you with the text “Array”! What should happen is you get back an actual array full of escaped values.

I’m very sorry for this simple oversight. It has now been corrected. You can get the source here. Let me know if anybody finds any other bugs.

I Hate Magic Quotes

Today, I’m going to give away some source code! Celebrate! I wrote the code to address a relatively common problem among new programmers: the over-reliance on Magic Quotes.

Do you know what Magic Quotes are? It’s the annoying feature in PHP that goes around randomly (okay, not so random, per se) modifying your data to protect you from yourself. If you have it turned on and someone types in crap in your web form, well… Let me show you.

Original:

I hate magic quotes because it’s awesome at screwing up otherwise good content. And if you’re unlucky and decide to edit your text, it tends to add even more backslashes into your pretty content. Thus, stuff like ‘\’ becomes \’\\\’.

Freak version after I decide to edit the content:

I hate magic quotes because it\’s awesome at screwing up otherwise good content. And if you\’re unlucky and decide to edit your text, it tends to add even more backslashes into your pretty content. Thus, stuff like \’\\\’ becomes \\’\\\\\\’.

Awesome, huh? Do you see that freak-show at the end there? That’s right: if you forget to strip backslashes out of your content before you let people edit stuff they’ve saved, it can get progressively worse, adding in backslashes like your mother added veggies to your meal when you were a kid. It’s that bad. And it’s very common.

WHY?

Of course, the logic for the feature is obvious. The designers of PHP decided that it was better for the content to get jacked up than for millions of developers everywhere getting fired for letting 15 year old hackers run “DROP DATABASE” in their corporate servers (for those of you who don’t know what I mean, that command equates to Armageddon on your servers).

But I still hate it. It smells of noobishness, and it encourages sloppy coding. Having to strip slashes out of your code is not the way you should be doing things. There are four reasons I argue against magic quotes.

  1. If you are having to use magic quotes, you’re already committing tons of SQL sin.
  2. Not all servers you work with will have magic quotes on by default. Programming for security should be a defensive practice, and, thus, programmers should be trained to assume the least secure environment.
  3. Magic quotes alone won’t protect you from SQL inject attacks. Character encoding can be used to pass in un-escaped single quotes.
  4. This feature will be gone in PHP 6.

I’m not going to go too much into #1 because that list is way too long to cover. In short, you should be using prepared statements to minimize SQL injection, and using a single, unified database abstraction class to handle all your querying to centralize security weaknesses.

The second point is important. You can’t be a great body guard if you assume nothing is ever suspicious. Same goes for being a programmer writing secure SQL. Of course, the whole point of magic quotes is to prevent the body guard from forgetting to check one of the closets, even though he checked every other room in the 1000 room house. People forget, and that’s unfortunate. Magic quotes is that paranoid body guard that goes around handcuffing anything that moves, and it’s your job to go around and free each person. It’s the guy that walks around nailing every door shut and insists everybody lives in a bullet proof glass box. There has to be a better way, and there is.

The third point is the one novices rarely know. You can pass in invalid characters into a query that then get converted, thanks to Mr addslashes, into a single quote. This relates to the inconsistency of converting single-byte characters into multi-byte characters. As the article quoted here mentions:

Whenever a multi-byte character ends in 0×5c (a backslash), an attacker can inject the beginning byte(s) of that character just prior to a single quote, and addslashes() will complete the character rather than escape the single quote. In essence, the backslash gets absorbed, and the single quote is successfully injected.

Yeah, it’s gibberish to me too. The point is, converting certain hex values in certain foreign languages screws up and can leave you with a hanging single quote. Magic quotes don’t save you from that.

Lastly, PHP 6 isn’t going to have magic quotes on by default. Might as well take off those training wheels now.

The Fix

I have two solutions for you.

The first is to write a database abstraction layer. What? A database abstraction “layer” is a fancy word for a class that manages your database connection and data manipulation. I gave a brief example of how to write one of these a while back. Another is to create a function (or method, if you’re talking about classes) that does INSERT and UPDATE querying for you. Such a function’s prototype would look like this:

function perform($tableName, $data, $whereClause)

The $tableName variable is a string, such as “user”. The $data variable is an array where each key is a column name and each value is the value to be assigned to that column. This data is sanitized (addslashes and what not) before being inserted. The $whereClause variable is used as a suffix to an UPDATE query (ex. “WHERE user_id = 1”). This method commonly exists in many open source projects.

The problem with this method is that you still have to escape variables manually for the WHERE clause. And it doesn’t even cover how to do SELECT statements safely.

So I sat here thinking for a bit about this, trying to think of a simple, elegant solution to this problem that would help PHP beginners everywhere. Before I hand out my solution for free, let’s go over the main points:

  1. SQL injection attacks (vulnerabilities) come from putting user provided input directly into SQL queries. User input comes from $_POST, $_GET, and sometimes $_COOKIE.
  2. Data that is passed around by the developer that was retrieved from the database is mostly safe since it has already been sanitized (if sanitization happened before it was loaded in).
  3. Data from files, XML, or other forms of potential stream input need to be sanitized as well. But this would be done manually.

The Class

That said, I wrote a class that solves the main point. With it, all POST, GET, request, and cookie data can be accessed through a nice clean abstraction layer. The goal is that if you’re using this, you’d avoid using un-sanitized data, unless you meant to. For example, to access the $_GET[‘name’], $_POST[‘name’], $_REQUEST[‘name’], or $_COOKIE[‘name’] variables, you’d call:

$safe = new DbSafe();
// O’Reilly becomes O\’Reilly
$name = $safe->get(‘name’);
$name = $safe->post(‘name’);
$name = $safe->request(‘name’);
$name = $safe->cookie(‘name’);

If you wanted to get the original unmodified values, you’d call:

$safe = new DbSafe();
// O’Reilly is still O’Reilly
$name = $safe->get(‘name’, TRUE); // notice the second parameter
$name = $safe->post(‘name’, TRUE);
$name = $safe->request(‘name’, TRUE);
$name = $safe->cookie(‘name’, TRUE);

That even takes into consideration whether or not magic quotes are on. In other words, if magic quotes are on and your variables are getting slashed up, the code I show above would spit out the original version that was typed in by the user. What good is my library if it didn’t do some auto-detection, eh? =)

If you wanted to escape a value manually, you’d say:

$safe = new DbSafe();
// It’s becomes It\’s
$escapedValue = $safe->escape($value);

Or…

// It’s becomes It\’s
$escapedValue = DbSafe::escape($value);

If you wanted to escape an entire array, you’d say:

$safe = new DbSafe();
$escapedArray = $safe->escapeArray($array);

Or

$escapedValue = DbSafe::escapeArray($value);

All of these examples would convert a string (or an array of strings) that said:

Hello, my name’s Michi

To:

Hello, my name\’s Michi

When you saved this into the database, that little backslash disappears so next time you read it, it looks like this:

Hello, my name’s Michi

No need to strip anything! If you want to directly access the values without stupid slashes being automatically added in (“magically,” if you will), my class supports that as a secondary measure.

Is this class the end-all-be-all for secure programming? No. Really, the better solution that I won’t give away today is to write a strong database abstraction layer. But this will do most of your dirty work without requiring magic quotes, and without making developers think PHP has some kind of built in “security.” Remember, you can’t always rely on magic quotes being on, nor should you.

You can get the source here.

Left Join Snafu

How embarrassing. I learned something new today that I really should have known for some number of years now. Left joins can increase the result set size. 

Here’s what I thought left joins do: When you combine two tables together with a left join, the source table (the one on the left) becomes the “anchor” for the results, guaranteeing that each and every record in the left table shows up in the result. If there are results in the right table that don’t correspond, those results are omitted. If there are results in the left table that don’t have corresponding records with the right table, those records are shown either way. For example…

Let’s say table A has 10 records pertaining to people’s names. And table B has five records pertaining to where those people live. No people live in two places.

If you did a left join on these two tables, you’d end up with five people and their addresses and five people (NULL sets) with no address information.

And…

Let’s say table A has 10 records pertaining to people’s names. And table B has 12 records pertaining to where those people live, where each person in A has a record in B. But two of those records don’t match up with anything in table A because some person records were accidentally deleted (oh no!).

If you did a left join on these two tables, you’d end up with 10 people with information about where each one lives. The extra records in B are simply ignored. 

Okay. That part was easy. Everybody knows that, even your grandmother. Let’s take this a few notches up.

Now if table A has 10 records pertaining to people’s names. And table B has 15 records pertaining to where people live. And this time, those extras are no mistake! Because a bunch of people live in two places, thanks to vacation homes.

If you did a left join on these two tables, what happens? Well, embarrassingly, I predicted this sucker wrong. Assuming all 10 people from A are mentioned in B with some mentioned twice or more, the result would have 15 records!! What!? 15!? Yeah, that was my reaction too. I thought MySQL would spit back 10 and ignore duplicates in B.

Let’s do one more example. How many records will we find if we join the following scenario:

Table A has 10 records pertaining to people’s names. And table B has 15 records pertaining to where people live. One guy has 15 vacation homes and everybody else is homeless (no records in B).

Ok. Do a left join. Not an inner join. Not a regular join. A left join. How many results do we get, huh?

Our result would be 24! Who the hell guessed that? Well, probably some of my more pretentious Computer Science readers, but certainly not me (so that’s what you learn in CS, huh?). It is 24 because you have 15 duplicate records for the one rich guy and 9 default records for the homeless saps. 

Thus, the maximum number of records a left join can yield is sizeof(record set A) + sizeof(record set B) – 1. Why is this never explicitly mentioned!

For a long time, I thought left joins meant the result set can never be more than the row count of the result set in the left table. I don’t know how I managed to go through this many years without realizing my error, but I suppose through good query structuring and table use, I never encountered a problem with this until now… And, to my credit, it wasn’t a query I wrote either.

I have never seen this behavior mentioned in any documentation (even MySQL documentation). It seems to be an implicitly assumed function of the command. In fact, I found several examples out in “tutorials” about left joins, that conveniently left out mentioning this fact, but still showed it as an unexplained portion of their results. Nice.

For all of you non-Computer Science gurus, I hope you learned something new from reading this post. Wasted about an hour of my time.

Finding Your First Job

I have a lot of friends who are graduating or recently graduated looking for jobs. One thing I’ve noticed is that my peers seem to have highly inflated expectations about what a degree at UCLA means. On occasion, we’ve all read those “how-to” guides on getting jobs. Well, most of them suck. Today, I came across one that really hit the nail on the head.

There was minor point made that I did not agree with.

Salary.com is a horrible place to go figure out what you’re worth, at least according to how the article frames it. I don’t know if it’s just my experience, but the numbers on that site seem to be randomly inflated for many positions I’ve ever looked up for various fields. Also, I believe they use some self-reporting to hone their numbers, which tends to skew their salaries upwards.

But most important is your interview skills. Know what you want when you walk in their office. I’m not just talking about how much you want to make. What do you want to do with your life? What makes you happy at work? What aspects are deal breakers for you? Why this job? Why hire you over the more qualified guy? Know yourself, before you go in.

Soooo, Net Neutrality or What?

The Democrats are probably going to take the Senate. They’ve clobbered the house.

Some people are speculating how this will effect IT. But what about net neutrality? Now that the Dems are in power, I am hoping the issue will come to center stage again. If I recall correctly, the Democrats supported net neutrality.

Of course, while we’re on the subject of how retarded America’s Internet is, I’d also like to mention how horrible the Internet is here. Other countries have up to 20 times the available household bandwidth. People have cell phones that have faster connections than my cable modem and for half the price.

When’s that coming to America?

IT Worker Shortage?

So there was yet again another article on Slashdot about IT worker shortage. A bunch of people replied saying the culprit was that corporations want to pay crappy wages and want legislation to change so they can import in cheap labor. You hear this story all the time.

“Cheap” labor? The problem isn’t with “cheap” labor. I call bullshit.

Expectations too High

I’ve noticed that there are way too many people in this industry who think that a CS degree means they’re entitled to make upwards of double what some of their non-CS peers will be making. How can a company justify someone being worth double another entry level worker who is equally smart (all things being equal here), but one has a different degree than the other?

Sure, maybe CS is harder than musicology and thus deserves more money. Maybe. But double? When you’re talking about job applicants who send you homework assignments as sample code, it’s difficult to gauge how good they are at developing. How do you know they didn’t copy a sample solution their professor gave out? Unlike in the real world, the 1000 lines of sample code could have taken four weeks and three graded revisions to get to that point. Maybe the TA helped them. Maybe it was a partner assignment.

Companies Want IT Gods

Like in most industries, there is a huge shortage of highly qualified IT professionals. The main reason this “shortage” exists is because companies are trying to consolidate many responsibilities into one, godly position.

Companies these days are looking for a very diverse skill set out of its employees. If this were a fast food joint, it’d be like looking for applicants that know how to flip burgers, take orders, mop the floor, and serve the food — all at once without causing the quality of the work to suffer.

Inexperience Costs Money

Very few people can do this because companies are looking for non-academic skills. Does school teach you how to optimize a Oracle query for speed, setup a MSSQL database cluster, install packages onto Linux, use version control software, or even how to write standards compliant JavaScript? Probably not. Schools don’t emphasize current technologies because they focus on the theories, not the application of those theories (nothing wrong with that, btw).  And if your school does offer courses that teach you these things, I hope you’re taking those courses because your Advanced Algorithm course won’t mean much to 99% of employers you will be talking to (ironically, mine might).

I’m not saying some applicants aren’t truly worth double. Some are. But everybody has to start at the bottom and prove their worth. Otherwise there would be a whole lot more crashing servers and unworkable software. And for every mistake you make — and as a rookie, you will make many — you cost the company. And it’s easy to extrapolate that these mistakes are very expensive if you think about who would be fixing those mistakes (and what these senior people get paid). That’s what experience is: past failures that you’ve learned from. You can’t substitute that with a degree.

A degree is another bullet point in your résumé, not a résumé in itself.

The Bust Screwed Us

Also, thanks to the Dot Com Bust, lots of otherwise qualified people exited the IT industry, leaving a pretty prominent gap. On the flip side, the Bust also generated a huge influx of less-than-qualified IT professionals who got a job as a web “developer” making $70k because they knew HTML. This helps to fuel the over-valuation of under-qualified IT employees that still exists today (not to knock on my own job or anything).

The Shortage is up Top

So as I was saying about those super burger flipper employees, if you think about it, the problem isn’t a shortage of entry level employees. There are tons of those, if you can’t infer it from my post. And to be honest, no employer expects an entry level employee to know every technology under the sun. The true IT shortage exists in the top tier of the IT employees: those responsible for managing all the little pieces. Very, very, very few people have enough experience to know how all of the little pieces in an IT department interact, or even how they should be interacting.

Think about it: how would a developer ever learn how to administrate a server? How will a database administrator ever gain an understanding for how to merge the production source code with the development branch? How does the Linux system administrator understand what it takes to secure all of the Windows machines in the office? Frankly, most of the time, they wont. The only ones who do are the ones that really stretch out beyond their roles to learn something new — and we all know that is both unusual and difficult in most organizations. As a result, when you want to hire someone who would manage the database administrators, system administrators, programmers, and networking administrators, it’s hard to find someone that would have broad enough understanding of IT to even know if the people below him are competent.

And even if you’re just trying to hire a new senior developer, the problem with IT is that technology changes constantly. As I highlighted above, companies don’t want to pay for mistakes their employees may make due to inexperience. You want to hire someone “very experienced” in a particular technology, but the problem is that the technology is still new. Of course it’s difficult to find someone who is qualified!

Conclusion

So is there a IT shortage? Not in the general sense, no. IT is like the gold rush of the new millennia. Lots of people got rich in the spot light doing things in IT (think MS, Apple, Google, Yahoo, etc.). In fact, I have a friend who bought a house (in full) when he was 19 after getting a job while he was 17 doing basic system administration. Granted, it was during the Boom, but when you hear stories like that, how can you not inflate your expectations?

Lots of people flocked to the field with very high expectations, only to realize it’s not all gold-lined monitors and silk-laced mouse pads. This is an issue where employees are frustrated at salaries that are lower than they thought they would be getting, and employers are unable to find qualified all-star IT staff. The times have changed since 2000: the IT employment economy became realistic.

Error Reporting

For those you striving to become great PHP developers, make sure you code with error reporting set to report on “strict” mode. This particular error reporting type gives suggestions on your code for commonly made mistakes due to sloppiness. The ideal configuration setting for error reporting is:

E_STRICT | E_ALL

Or in PHP, you would start the page with:

error_reporting(E_STRICT | E_ALL);

Voting Machine Software

So I’m sure you’ve read a thing or two about all those crazy electronic voting machines being inaccurate. One thing I find slightly perplexing is why the misrepresented votes seem to always be in favor of the Republican party. I don’t get it. It’s not like the voting machine companies would be so blatant or stupid to try to rig an election so outright. Especially in a world that is already so suspicious of electronic machines. But if it were purely a bug, wouldn’t it be equally likely that Democrats benefit? Of course this could always be explained by the fact that perhaps there is a procedure in place for inputting candidates and Republicans and Democrats get placed into the system in a specific order (such as Democrats being added in first). Who knows.

So with the constant attention those digital voting machines get, a lot of people ask, “WHAT is so difficult about writing software that tallies votes?” Now I’m not one to study the Diebold machines, but I thought it would be interesting to pick at the problem.

Database Issues

First of all, the votes must be logged. But not just any log. It must be secure and immune from tampering. And when I say “tamper,” I am talking about from everybody. That includes the developers, the database administrators, the voters, and the polling staff. I can only begin to imagine that they use a bunch of one way hardware encryption and md5 checksums.

The votes would need to be isolated from each other from the data integrity perspective: if vote #35252 breaks the system, all prior votes (#1 through #35251) must remain unscathed. Although most modern databases use transactions to ensure data integrity, I would imagine there is no fool proof means without creating a replica of the vote on a second or third physical location.

Of course, such data replication causes problems in the event data is inconsistent. What happens if the primary fails and the vote was only recorded on one of the two slaves. Do you count that half vote? What if a replication error had occurred where one slave copied something differently from the primary? Which is right? These things happen (database corruption) and they usually tend to clump up together to result in catastrophic failures.

Purposeful Fraud Issues

Let’s attack this from another angle. The main culprit to election day problems will probably be human “error.” An electronic machine must protect against this. Unlike a punch card that the actual human physically pokes, a digital machine does the card punching for you (on its hard drive), which is almost like telling someone to punch in your vote as you specify.

There’s been instances of a programmer placing bugs in slot machines that gave them jackpots if they bet in a certain order. There have been cases of system administrators leaving back doors into the servers. There’s a huge list of historical events that show that no system, no matter how hard a company tries, is secure from malicious employees. But that is exactly what this system must be designed to fight. How would you ensure it is safe? Peer code reviews? Multi-part passwords that require three separate people with three separate passwords to authenticate? Physical keys, like the one you see in movies, where both people have to have different keys turned at the same time to open a machine? Okay, so let’s say you somehow secure your employees. The problem doesn’t stop there.

I’m setting up the machines. “Let’s see,” I say to myself with a grin, “Kerry is going to be candidate 1, and Bush will be candidate 2… for now. At the end of the night, I go back and say, “Oops, I meant 1 equates to Bush and 2 equates to Kerry!” With any regular database, this is entirely possible, and everybody’s votes just got reversed. Of course a smart voting machine would never let you change around the names for a created record. But then again, hackers don’t need to worry about that.

So the voting machine company decides that you “can’t” change the name of a candidate after it’s been put into the system. What happens if I were to put in a second “Bush” to dilute his votes between his mystical twin? Or what happens if I create a new candidate half way through the election under his name? Well, in some instances, the software might just show him twice (this is good) or in others, it would show him once (this is very bad). In crappier software, that of course means voters would be voting for one OR the other “Bush,” but nobody would know exactly which.

Of course the voting company would protect us from ourselves by ensuring candidates can’t be added in after the machine is shipped out. But therein lies another problem.

Synchronizing Issues

Let’s say you’re running the voting company that is running an election across a few dozen districts. Of course, all the votes must be tallied. A “Bush” vote in one county must group up with a “Bush” vote in another. But how? The human answer is to use the name, but realistically, we know that another “Bush” might be running under a different position in some counties. You can’t just use the name as the qualifier because it is not unique. So you would use IDs, I presume.

But of course this means every machine must use an ID that is not internal to it. You would say, “All 1’s are Kerry’s and all 2’s are for Bush!” Now that this is decided, you would have shipped out all of the machines to only accept votes for Kerry = 1 and Bush = 2. And when the machine gets back, you would save it into the main system as 1 = Kerry and 2 = Bush.

But where’s the sanity check? Who knows what happened while that box was out there in the wild. How do you know that 1 is indeed still representing Kerry for that box? How do you know that everybody that voted “Kerry” on that box got saved in as a “1”? This is even more of a problem if you do the counting right in the same place that the voting is taking place.

And even if you did use names, despite it being a horrible idea, how do you know that a “Kerry” vote got saved as “Kerry?” For all you know, there is a bug, and all Kerry votes are getting saved as “Bush” and all Bush votes are getting saved as “Nadar” because someone forgot that array indices start at 0, not 1 (theoretical technical explanation for how these bugs could arise).

So of course, that means you would write a binary log of all activities that box experienced. But what is this log for? Auditing? Shouldn’t auditing be happening at every step of the way regardless? If anything, problems are much harder to catch in the digital version of voting so this audit trail would rarely if ever be used except in the most extreme cases. Okay, so I’ve convinced you that it should be used all the time, right? Okay, but then what?

Is it being replicated? Is it safe from incomplete transactions? Will a corrupted insert break the entire file? What happens if the power cuts out right as it is writing a record? Is the whole file toast? Suddenly you realize the log file must also use a database to ensure its integrity. Possibly on a separate process to ensure it is isolated from the main vote records.

But what the hell is the point of all this? If there is going to be a discrepancy, shouldn’t it have been caught during testing? Why go through all this trouble double logging and replicating all of this data?

Conclusion

The last point is the most important. You’ll notice that through simple logic, we suddenly had to have tons of auditing overhead to do something so simple. And despite your best testing efforts, things that should be absolutely positively without error are still being audited to ensure their integrity. So what happens when you overlook one of these “no-brainer” assumptions?

You get voter fraud.

This only covers some theoretical problems that I might face when trying to put together a voting machine. I would assume a well-funded corporation would generate a list or problems 10x this length. While tallying votes may be simple in concept, if your application must be 200% bug free and hacker proof, developing the application becomes immensely difficult.

This still doesn’t explain my original thought about the Republican vote bias though.