Tuesday, November 24, 2015

NP Complete

An exerpt I found in mit opencourseware notes on NP-Complete problems in nature...

"1.2 In Nature

Nature deals with NP-complete problems constantly; is there anything we can learn from it? Biology employs what computer scientists know as genetic algorithms. Sometimes biology simply mutates the candidate solutions, but other times it combines them (which we call sexual reproduction).

Biology has adopted the sexual reproduction technique so widely that it must have something going for it, even if we don’t yet fully understand what it is. In experiments, local search algorithms that employ only mutations (i.e. “asexual reproduction”) consistently do better at solving NPcomplete problems than do algorithms that incorporate sex.

The best theory people currently have as to why nature relies so heavily on sex is that nature isn’t trying to solve a fixed problem; the constraints are always changing. For example, the features that were ideal for an organism living in the dinosaur age are different from the features that are ideal for modern-day organisms. This is still very much an open issue.

As another example of nature apparently solving NP-complete problems, let’s look at protein folding. Proteins are one of the basic building blocks of a cell. DNA gets converted into RNA, which gets converted into a protein, which is a chain of amino acids. We can think of that chain of amino acids as just encoded information. At some point, however, that information needs to be converted into a chemical that interacts with the body in some way. It does this by folding up into a molecule or other special shape that has the requisite chemical properties. The folding is done based on the amino acids in the chain, and it’s an extremely complicated process.

The key computational problem is the following: given a chain of amino acids, can you predict how it’s going to fold up? In principle, you could simulate all the low-level physics, but that’s computationally prohibitive. Biologists like to simplify the problem by saying that every folded state has potential energy, and the folding is designed to minimize that energy. For idealized versions of the problem, it’s possible to prove that finding a minimum-energy configuration is NPcomplete. But is every cell in our body really solving hard instances of NP-complete problems millions of times per second?

There is another explanation. Evolution wouldn’t select for proteins that are extremely hard to fold (e.g., such that any optimal folding must encode a proof of the Riemann Hypothesis). Instead, it would select for proteins whose folding problems are computationally easy, since those are the ones that will fold reliably into the desired configuration.

Note that when things go wrong, and the folding does get caught in a local optimum, you can have anomalies such as prions, which are the cause of Mad Cow Disease."

Pretty interesting stuff if you're into it...

SQL Reversion Scripts

When people new to sql think of databases they tend to think of a table of numbers that ends up in a graph of some sort. What they don't think of is the configuration and user settings that are also stored in that database. So lets say you have a column in the database that should be a series of comma separated values, but was for whatever reason is actually tab delimited. You want to update each column, but you don't want to run the risk of accidentally changing everybody, so here we can actually make a query that generates a series of queries that set the column back to the original value. Check out the example/gist below. You can simply copy paste your results into a new sql file, then run that sql file if anything goes wrong! Keep in mind that the triple quotes are for the case that your value to be changed is a string. If the value is an integer you won't be needing them.

Thursday, November 12, 2015

Timsort

So quicksort is pretty cool. Everybody knows "how" it works, meaning we've all seen a visualization of quicksort picking a pivot, going through and finding values less than or greater than and moving them around the pivot. Unfortunately very few people correctly implement the algorithm correctly on their first try, or if they did, they've never implemented it since. 

Less common that quicksort is the man behind the curtain (edit: I made this post when I was pretty exhausted, is "man behind the curtain" even a phrase? Does it make sense here?)... Timsort. Timsort is based on Tim Peter's list sort for python described here. It even runs O(n lg n) in the worst case, and often faster on real world data! As an adaptive sorting routine, Timsort uses a binary sort when it recurses deeply enough and combines sets of data that were ordered to begin with. (More edits... Tim sort is also a stable sorting algorithm, meaning that if you want to compare array vs array like with Strings, you can compare each piece of that array. This means objects with equal keys will not be sorted arbitrarily.)

Aside from the coolness of having beaten quicksort Timsort has its downsides, for example it does also use n/2 extra space where qiucksort does sorting in-place. In most applications I would have, the extra space would be no problem, however for big data analysis n/2 extra space could be GBs or TBs.

Tuesday, November 10, 2015

SQL Server 2005 Memory

SQL Server 2005 is a complete memory hog. Keep in mind this problem is fixed in newer versions, though I've seen those suck up a lot of ram as well. Today I was working on a server with 32 gigs of ram, tried to delete a few million rows, and watched as SQL Server memory allocation ticked up from 2.5 gigs, to 3, to 5, all the way to 9 gigs. Basically it takes up all that ram trying to do its deletion, but never releases it. So after five minutes when I noticed the deletions were completed, SQL Server was still using 9 gigs of ram. Old problems still exist I guess.

Sunday, November 8, 2015

VLC Streams

Today I found out you can stream videos using vlc media player. Given an address you found by, say, googling "Giants live stream reddit", you can paste it into Media -> Open Network Stream... and they come in all sorts of qualities, generally very good ones.

Saturday, November 7, 2015

Zipf's Law

I was reading Sam Kean's The Violinist's Thumb when he brought up an interesting empirical fact about natural language commonly known as Zipf's Law. What the law states is that the frequency of use of any word is inversely proportional to it's rank in the frequency table. That's straight from Wikipedia, but in human language it means that the most popular word in English (the) appears twice as often as the second most popular word (be).

Zipf's Law has actually been applied to fields far beyond language. The example used in The Violinist's Thumb was DNA. In our DNA pairs of ACTG appear in triplets to form genes, so a line of ACC (opposite to which would be TGG) would be a triplet. As it turns out, the most common triplet occurs twice as often as the second most common, and so on with the third most. 

Computer science researchers at UC Berkeley even found that web requests follow a Zipf-like distribution. Pretty interesting stuff, it'd be cool to see how often the most popular method, method names, or data types appear and whether they follow Zipf's law as well. Maybe someday in our infinite spare time.

Friday, November 6, 2015

XOR Linked List

Today was fairly slow at work so when Walter messaged me on gchat I responded. Having just finished App Academy Walter is very up to date on the most efficient data structures, and had apparently spent the morning researching xor (exclusive or) linked lists. I am currently taking an evening class in data structures so when he messaged me to try and teach me about them I was interested to hear more. In fact, my professor had mentioned them, but not demonstrated an implementation. 

To understand xor linked lists we need to understand the problem they solve. Suppose you're super memory constricted and want a linked list you can traverse both forward and backward, but with only one extra reference rather than the prev and next references typical of doubly linked lists. That's where xor linked lists come in; one reference at each node containing information for both next and previous.

As indicated in the name, the crucial part of this is the xor operator (^ in javascript) which returns the higher of two values if they are different, and zero if they are the same. So when xoring two 16 bit ints, 00000000000000001^00000000000000010 = 00000000000000011. This means 1 xor 2 = 3, and as Walter told me to do, you can check this in the javascript console of your web browser with 1^2. Walt also informed me that because xor is lossless, 1^2 =3, 1^3 =2, 3^1 = 2, 3^2 =1 ... and so on for all permutations of the three numbers.

(Before I really get into it here's some notation I just invented, this won't hold true elsewhere, but in the following, {} is a variable resulting from the xor operation and anything in [] is an object storing our node data and the previously mentioned variable.)

Now think for a moment that this property of xor holds true for everything, not just numbers in bitwise format. Think if we had [null], [1], [2], [3], [4], [null] as our example list, and {[null] xor [2]} stored in node [1], {[1] xor [3]} stored in node 2, and so on, where the objects being xor'd are actually addresses in memory! Then given the information [null], [1], {[null] xor [2]} we can take {[null] xor [2]} xor [null] to get the memory address that [2] is stored in. We can now drop [null] and store [1], [2] in memory. In [2] we'll also find the value for {[1] xor [3]}, and by taking {[1] xor [3]} xor [1] we will find the address for the value after [2] is [3]. 

Continuing this pattern we could traverse the whole tree. In all we would need to store four nodes of information, our first two and last two, then use the above pattern to traverse the list forward or backward. I did some googling and there are a few problems with this technique. First it's pretty risky in Java, where you cannot access memory directly and are liable to have your unreferenced objects get gc'd. Also, it apparently really messes with debuggers and makes it hard or impossible to step through code. Finally, memory is cheap, this is a data structure I could see used in some kind of controller with little memory, but not a modern desktop or laptop.

I know this was a long post and probably difficult to read, but thanks for reading anyways. Future posts will be much less daunting, and there will probably be days where I post something like "I'm hungover and learned that naps are great". Thanks to Walter for having sparked this post, check out his github.

An Introduction

Before we get into a first post on what I've learned today I want to introduce myself. 

I was raised in a small town in New Jersey called Kenilworth, the type of place where your gym teacher yelled things like 'I remember when your father climbed this rope in ten seconds!'. Luckily this was not directed at me, my parents had moved to Elizabeth NJ from Cuba. When I was two years old we moved out of Elizabeth to Kenilworth in a great corner of town with a cul-de-sac and kids my age to play manhunt with. I've tried every sport I had the chance to, and in college picked up rugby. Now, during seasons that I'm not too busy, I play with the Bayonne Bombers in Bayonne NJ. My favorite beer changes with my mood, but I like Lagunitas IPA.

Welcome!

Hi Everyone and welcome to my blog! Hopefully I'll post something new that I've learned.