The Mathematics of Evolution

I found a really neat comment on Slashdot about the mathematical justification of evolution:

First two minor points, then I’ll get to the real subject, the math of evolution.

theory is a theory my friend

Every field of science is a theory, my friend. Everything from the theory of the atom to the theory of zymosis (that's fermentation). You may as well try to attack relativity as being “just a theory”.

sort of like the un-provable assumption of evolution?????

What un-provable assumption of evolution? Evolution fundamentally says that if if you have heritable variation and mutations and selection pressures on that variation then you will get evolution over generations. This is trivially observable fact. There is no genuine scientific dispute over biological evolution exactly because there is so much evidence that cross checks and cross validates across so many fields, both current observations and study of prehistoric evidence left behind. Trying to even scratch the surface of this mountain of evidence in this post would be hopeless. If you are questioning the quantity and quality of the evidence, I suggest you either crack open a text book on the subject or at least browse the talkorigins [talkorigins.org] website. It’s all well documented if you actually question the issue. If you don’t truly question the issue and you instead simply reject the entire subject on non-rational grounds, well obviously you’re not going to be swayed by something silly like actual evidence and actual science.

Anyway, the real issue I wanted to address was this one:

the sheer numeric improbability of evolution

Correction, the sheer numeric CERTIANTY. There’s powerful mathematics to evolution, powerful effects going on that you don’t hear about in the common explanations of evolution. The common idea of evolution is as a sequence of individual beneficial mutations, like climbing a ladder. If that’s how evolution actually worked then critics would be right, it would have been mathematically impossible for evolution to produce the incredible complexity we see today.

To show the true mathematical power of evolution I will first abandon that “ladder climbing” of beneficial mutations. In fact lets assume that every single mutation that occurs is either neutral or harmful. I’ll demonstrate that we still get the real and powerful mechanism of evolution, the math of evolution.
A good place to start is with the common complaint of creationists that mutation and evolution “cannot create information”. Well in the initial mutation phase they are right. When a mutation occurs it introduces noise, it tends to degrade information. But look what happens the moment that mutation gets passed on to an offspring. That mutation is now no longer random noise, it now carries a small bit on information. It carries a little tag saying “this is a nonfatal mutation”. The presence of this mutation in the offspring is new and created information, the discovery and living record of a new nonfatal mutation. Over time the population builds up a LIBRARY of nonfatal mutations. This library is a vast accumulation of new information.

That information actually undergoes even more processing and synthesis. Over generations beneficial mutations would obviously multiply, but we’re assuming there are none of those here. However entirely neutral mutations will also tend to accumulate and multiply. Nearly harmless mutations would also accumulate and multiply to a lesser extent. Somewhat harmful mutations will even accumulate, and extremely harmful-but-nonfatal mutations will pop up and disappear at the rarest frequencies. So not only do we build up a library of nonfatal mutations, the mutations get tagged with a tagged with a frequency, the percentage of the population carrying that mutation. Each mutation is tagged with a measurement. Every mutation now carries a cost/benefit information tag at the population level. The best ones have a high percentage representation and the most harmful ones have a near zero representation percentage. Our library now contains far more valuable and sophisticated newly created information.

The individuals in the population are on average going to carry a roughly stable load of harmful mutations, a roughly constant “cost” in harmful mutations. Individuals loaded with more than the average cost are generally going to die and remove a more-than-average load of harm out of the population pushing the average up, and individuals with a less than average load will multiply and pull the population average upwards. The cleansing effect of selection removing “damage” from the gene pool will automatically scale to offset the exact rate that mutation is causing “damage”. Harm/cost/damage will be weeded out by selection at the same rate it is added by mutation. Neutral mutations will steadily accumulate in the library, and negative mutations will remain at a roughly fixed level constantly measured and scaled by the cost of each. Some mutations will disappear while new ones appear.

The real power in evolution is the recombination. Every offspring contains a random mixture of mutations from that library. every offspring is a test case searching for a jackpot beneficial combination of mutations. Lets assume an individual has a million random mutations across its entire code. There are 500,000,000,000 mutation-pairs being simultaneously tested within that individual in parallel. Perhaps one is a mutation creating a toxin and another mutation for mutant skin pores. Either mutation alone may be harmful, but the pairing could be breakthrough protecting against predators.

There are 160,000,000,000,000,000 mutation-triples. Each individual is also testing all of these triples in parallel. One mutation might be for a toxin, a second might might crank up production of that toxin to fatal levels (which would ordinarily a fatal evolutionary dead end), and the third might be a costly and ordinarily useless anti-toxin. The triplet is now a breakthrough, either a powerful defense against predators or a weapon for a predator to use, or even both at once.

Each individual is also testing 40,000,000,000,000,000,000,000,000 mutation quadruples in parallel for free. Maybe those four mutations individually yield useless proteins and enzymes, but the chain of four together may yield a new breakthrough digestive pathway.

Each individual also tests a near infinite number of mutation pentuplets and mutation sextuplets and more. Each individual actually acts as a test of a near infinity number of possibilities and it does this testing in parallel and it does so for free. This is called implicit parallelism. It astronomically multiplies the power of evolution to search for jackpot breakthroughs.

Another point that I raised and haven’t actually applied yet is the fact that each mutation is present with a frequency percentage in the population. The measurement of the cost/benefit of that mutation. When you want the most efficient search pattern you want to minimize wasted effort and minimize your costs and maximize your return-on-investment for your available resources. Well each offspring is an investment of resources, a test effort. When you are investing your effort looking for a payoff you want to expend most of your effort on the mutations that have paid off the best in the past and the least effort on the almost-fatal mutations. You mostly want to test combinations of good stuff with good stuff, and you almost never want to bother testing two nearly fatal mutations that will most likely combine to cause a dead offspring and a wasted investment. However you do still want to make a very rare test of two nearly fatal mutations because it *might* just be a jackpot payoff. In mathematics this exact investment-of-effort and search pattern had already been studied and a mathematical optimization pattern found. And guess what? By an almost staggering coincidence the evolutionary population frequency on each mutation in the population and in the offspring exactly matches and produces the mathematically optimal and most efficient search pattern for the next generation of offspring. You invest lots of effort and lots of offspring on testing the best mutations and groups of the best mutations and you invest exactly the right level of very rare testing of really bad combinations that will probably be fatal but which *might* just find a jackpot payoff. Mutations at all levels are tested proportionally to the measured cost it impose on the host.

So evolution has a nearly infinite multiplier on its search power and it just happens to invest its search effort in the mathematically optimal most efficient search allocation. Two fairly deep and powerful mathematical results that are hardly apparent in the usual way evolution is explained.

A further point is that once some beneficial mutation or combination of mutations is found, evolution then searches that vast library of stored nonfatal mutations. Most new breakthroughs will be extremely crude at whatever it is they do, and they will probably come with harmful side effects. A set of limbs might be mutated into some useful form for getting some new food source, yet be horribly mutated and otherwise dysfunctional. Evolution then searches the library for mutations that combine to further improve that new breakthrough, and it also searches the library for mutations that will repair or offset any harmful side effects of the breakthrough. A search for ways to further improve the mutated limbs for the new purpose, and a search for modifications to repair problems caused by these malformed limbs.

Evolution is very rarely a simple ladder-climb series of beneficial mutations. Evolution is an information processing system building vast database of information and synthesizing complex measurements of that information and doing an incredibly powerful search and mining of that information database to discover and refine improvements.

And this fits in perfectly with punctuated equilibrium. During the quiet phase the library is accumulating new mutation contributions and measuring those mutations into a percentage of the population, and then when there is a breakthrough discovered or there is an environment shift then evolution goes into overdrive. It mines the database for contributions to the new development or to adapt to the new environment. The frequencies of all of the mutations also get re-measured to re-weigh their cost/benefit ratio in light of the new development or in the new environment. Not only can this radically shift the frequency of vast portion of the genes and mutations in the population, it can quite easily trigger the discovery of other independent breakthroughs. If the population underwent heavy selection pressure, if most of the population was exterminated or displaced by this change, then the gene pool gets decimated. Much of that accumulated library gets wiped out along with the losing majority of the population. With a depleted library in the new population you are naturally going to see little change and progress. You see a stable population, equilibrium, until that library can be very slowly rebuilt through accumulation of new mutations.

AllOfMP3 Executive Cleared of Charges

A Russian court ruled that a former executive at AllOfMP3 was not guilty of copyright infringement. After this verdict, it seems the site is ready to re-launch. This puts Russia’s bid into the WTO into murky waters again and will likely cause a whole new media circus.

In the meantime, you can use AllOfMP3’s reincarnate.

I’m always torn on this particular issue. I think AllOfMP3 fills a void that current exists in digital music: DRM free, cheap music. While labels are starting to finally see the light, so long as DRM is the standard, sites like AllOfMP3 will prosper. As for its *really* low prices, that’s another point. Music tends to be very expensive when you buy it in CD form, and a per track price that is dependant on the file size (quality) is very fair.

That said, selling music without paying royalties to the labels is clearly wrong, but I think the issue has always been how much royalties the labels deserve, since AllOfMP3 has always offered to pay (small) royalties. Lastly, while it’s clearly a little shady, it was also legal by Russian copyright law.

What do you think of AllOfMP3’s business practices?

Social Network Screening on the Rise

A new report indicates that one in ten employers are looking at an applicant’s social networking profile.

More than 60 percent said the information they see on these profiles will influence what they think about the job candidate, and more importantly, who gets hired and who doesn’t… Employers have a lot of leeway when deciding who they should and should not hire. Unless an applicant is being discriminated against because of race, age, gender, or ethnicity, there is very little the applicant can complain about later on.

I’ve been trying to warn people about this for a long time now. This all goes back to controlling your online image. Everybody goes out once in a while and gets a little plastered, but not everybody proudly displays photographic proof on their profiles.

With social networking becoming increasingly pervasive, it is becoming harder and harder to stay off the grid. That said, whatever part of you is on sites such as Myspace or Facebook needs to be tempered. This raises some scary questions about the future since I think social networks will eventually use an open directory system that centralizes the data in a decentralized distributed grid. When that happens, it will be very hard protecting your identity and image between different sites while still keeping it an accurate reflection of you.

Still, I can’t wait for elections in 2020 when the first of the Myspace kids begin running for president. I just know there will be a scandal around something they posted when they were 18. Accountability for lasts a life time now that the Internet caches everything.

Cool! Content Aware Image Resizing

I just saw a really interesting video showing off “content aware” image resizing.

The idea is that an image can’t be readily resized without distorting it or making it too small to make out. The author of the research suggests instead to use algorithms that detect and remove less “important” parts of the image. For example, background space between two people might get removed, pixel by pixel, as you shrink an image.

The idea has an amazing amount of potential for use in commercial applications, especially mobile devices.

See the video for more.

YouTube Eats up "Funny Videos" Searches

While there is very little visibility into the searches performed on YouTube, Hitwise noticed some things can be inferred about its traffic. For example, they found that searches for “Funny Videos” dropped steadily as searches for “YouTube” grew.

As in, people figured out that funny videos always ended up on You Tube, and thus, there was little purpose in searching on Google for them.

I think this is one of the first real pieces of evidence that shows how YouTube was a good buy for Google. If You Tube was owned by Yahoo, that would be a lot of searches that got gobbled up by a competitor’s site. YouTube is becoming an actual video search engine, at times completely bypassing Google. In short, Google saw an emerging search market on YouTube — now that is some good foresight.

YouTube Funny Videos

Class Action Suit Against RIAA Brewing

About time. It seems a class action suit is now brewing against the RIAA. I’m not so sure they can win on all of their claims, but they’ve got the main one in there (malicious prosecution).

The development, first reported by p2pnet, hopes to make a class out of those “who were sued or were threatened with sued by Defendants for file-sharing, downloading or other similar activities, who have not actually engaged in actual copyright infringement.”

SCO Loses to Novell: UNIX Code Belongs to Novell

Many of you may not read about this until Monday (it’s Friday now), but in the case between Novell and SCO regarding the copyright of some UNIX code, the court ruled in favor of Novell. The ruling also (explicitly) destroys SCO’s case against IBM. The court denied SCO’s motions regarding slander, breach of contract, and other bogus claims. Lastly, SCO owes Novell a ton of money for the cross licensing deal they made with Microsoft (to be worked out later).

For those of you who don’t know, SCO started threatening companies and even sued Novell and IBM over it’s supposed mystical ownership of key UNIX source code, the foundation of the now-popular Linux. Had SCO won here, it could have been major legal black eye for the open source community, probably eliminating any major corporation from wanting to take up UNIX out of fear of being sued.

But, Novell won, and buried SCO in the process, shattering any lingering doubts that open source code is safe. I was looking forward to IBM firing its cannons too (that would have been fun). 🙁

Proof of a Bubble: Success of Myspace is Pretty Overrated

How much yearly profit would you expect from a $500 million purchase? How about a profit of just $10 million equating to a profit margin just under 2%. That’s right, the world’s largest social networking site, constantly in the top 10 web sites in the world, managed to make only $10 million on $550 million in revenue!

I’m not an expert, but a 1.9% margin is pretty low. For example, the average profit margin for a company in the technology sector is currently 14%. It’s insane to think Myspace is worth “20 billion.” Even at a billion dollar valuation and revenue increased five fold, it would take 20 years to repay the purchase price while assuming social networking stays hot the entire time!

The most important distinction to make is that Myspace is in the notoriously fickle and very untested social networking market. It must recruit a completely fresh batch of users every few years as people grow older and move on. It must fight against social stigmas that come from the younger generations that might sound something like, “Ew, Myspace? My mom is on there.” For all we know, social networking as we know it may fade out of prominence in the next three years. Or even more likely is that another new competitor will eat into Myspace and take away its page views.

I have been observing signs of a significant bubble re-emerging, and this is the straw the breaks the camel’s back. Worse yet, when professional analysts throw out insane multi-billion dollar valuations on Myspace without sound financial reasoning, it’s time to be scared. Valuations are always relative, but I disagree with this valuation without having access to some more impressive metrics. Myspace is already at the top of the web — it doesn’t have opportunities to grow 1000% in the next few years.

Let me frame this in a more understandable way, if I told you this blog makes $200 a year in profit, would you be willing to buy it for $10,000? That’s the same ratios used assuming Myspace’s original valuation ($580M). But if Myspace is worth a little more than $5 billion as some people seem to believe, it would be like selling this blog for $100,000 on the $200 a year profit. Granted, maybe you could improve the profit margins by a factor of ten to $2000 a year (20% margins) — good luck.

Doesn’t seem like such a sound investment now, does it? The top social networking site in the world is barely profitable and there’s talks of it being worth 20x its purchase price. There’s a bubble, folks.

The Secret of SQL_CALC_FOUND_ROWS

Today, I wanted to go over a relatively simple MySQL feature that a lot of people don’t understand: SQL_CALC_FOUND_ROWS. To use this mystical key word, simply put it in your query right after the SELECT statement. For example:

SELECT * FROM USER WHERE id > 10 LIMIT 2,1 –see just second record

Becomes

SELECT SQL_CALC_FOUND_ROWS * FROM USER WHERE id > 10 LIMIT 2,1

This won’t change your results. It may, however, make your query run slower than when you select just one row the regular way. What this statement does is tell MySQL to find out just how many total records exist that match your criteria (in this case, where id is bigger than 10). For example, let’s assume that the user table has 100 records that have an id bigger than 10, then the query will take as long as it would have taken for the engine to find those 100 records.

The returned result will still be one the records you are expecting (in this case, the second record it found). But here is where the magic starts: If the very next query you run is a special select statement, you will have access to the total that was found. As in:

SELECT FOUND_ROWS(); –returns 100

The MySQL documentation on this subject says:

[SELECT FOUND_ROWS()] returns a number indicating how many rows the first SELECT would have returned had it been written without the LIMIT clause. In the absence of the SQL_CALC_FOUND_ROWS option in the most recent SELECT statement, FOUND_ROWS() returns the number of rows in the result set returned by that statement.

No matter what your LIMIT clause looks like (such as LIMIT 10, 1), this second query will still return the same number (in this example, 100). Why is this useful? Pagination. Often times, beginners (including me a few years ago) are stuck doing something like this:

SELECT count(*) FROM USER WHERE id > 10 –figure out how many total records there are
SELECT * FROM USER WHERE id > 10 LIMIT 50, 1 –get to record #50

People do this because you need the total to know if other matching results exist or what the last page number is.

This requires the engine to run the same query twice. This can be disastrous in cases where that query already takes a very long time to run. By including SQL_CALC_FOUND_ROWS, the overhead of running that count is grouped up with the process of actually retrieving the row of interest. So while the initial query might take a little longer to run than if you hadn’t tried to do a count, it is definitely faster than running the same query twice.

To take this to the next level, your pagination code should omit the use of SQL_CALC_FOUND_ROWS in subsequent page loads by caching the total count in the URL or session.

Happy hunting!