Wednesday, May 17, 2017

Apartment Pricing Model

Note: The following was copied directly from my github project readme: Project
I am going to be looking for a new one-bedroom apartment pretty soon. Given that I just finished a course in Machine Learning and a semester of computer science, my first reaction was whether or not I could find data to try and model the rental market in my area. That should be simple right?

Finding Data

This was much more difficult than I expected. Once I started to google "Hoboken, NJ Rent Data" it was pretty clear there was no obvious data set or survey that I could get for free. My next best bet was to use the US Census data. Specifically, I used the American Housing Survey (AHS) 2015 National Public Use File, and the AHS 2015 Summary Tables. Both can be found here.

Understanding The Data

The first task was to understand my dataset, and what all the values mean. Luckily for me, Hudson County NJ is included in the New York data, as specified here. It also meant going through the Summary Tables and deciding which fields I wanted to include in my model. The fields had to be things I could estimate after visiting an apartment and plug back into my model. If you look at the get_descriptions() function in HousingData.py you'll see exactly which features I included and their description.

Cleaning The Data

Next I needed to understand how individual features behaved to see if features or rows should be removed. Specifically, I wanted to understand the continuous features like rent price and square footage. I found that the maximum rent was $10,600 which seemed much too high, so I simply removed those points. The same problem arose with people who paid $0 in rent. Alternately, I'm considering leaving the maximum rent field untouched and changing the zeros to the reported monthly mortgage for those same rows, if the mortgage is reported. I also found that unreported square footages were marked as -6, more fields that I just removed from the data set.

Training A Model

Finally, I needed to train a model. I decided to begin with a simple linear model and leave a more complicated model for the future. In order to account for non-linear relationships between my predicted value (rent) and features, I trained the linear model 7 times on the polynomial sets 0 - 6. I then chose the model which scored best on my testing data. Because I was using polynomial features I needed to punish overfitting and therefore used a Bayesian linear model. Bayesian linear models protect from overfitting by introducing a prior distribution on the model parameters. The log likelihood then ends up having some value lambda/2 * weights^2 subtracted from it, thereby penalizing your likelihood for many, or large weights. Specifically, I decided to use the BayesianRidge model provided by sklearn, and the default distribution parameters, with normalized data. I saw no reason to change the default distribution parameters yet because really I had no prior intuition on their distribution.

Results

This model succeeded pretty well in spite of only using ~600 of the ~2000 New York values reported. The score and model could be improved by following through on what was mentioned in Understanding the data, or maybe using a neural network (next project :).

Questions

  1. How could I go about deciding better values for my prior parameters than the defaults?
  2. Should I try higher degree polynomials?
  3. Would a model other than Bayesian Ridge work better? Why?
  4. Is it worth normalizing the data in this case? When would one NOT normalize the data?
Please contact me if you think there's something I could improve or am misunderstanding. Specifically I am worried I am misunderstanding the score value returned by the model on testing data. Sklearn documentation says R^2 is defined as (1 - u/v), but also says that the best possible score is 1.0. I am measuring distance from 1.0, but worry that this is already being done in the 'definition' of the score.

Monday, May 15, 2017

Jacobian

The Jacobian matrix is used when performing a change of coordinates, often to simplify an integration. This is because the Jacobian tells us how much the area around an N-dimensional space is skewed when changing bases. Intuitively we would think that for some change of bases function 
$$T(u,v)\ =\ (x(u,v), y(u,v))$$ 
that the integral 
$$\int\int_D\ f(x,y) dx\ dy = \int\int_{D^*} f(x(u,v), y(u,v)) du\ dv$$
However, it is easy to find cases where this does not hold. For example, take
$$T(u, v) = (-u^2 + 4u, v)$$ 
Over the unit square 
The area of the unit square is obviously 1, but after applying the change of basis function T, we arrive at the area below

Please note that in this graph the x intercept is 3, making the area captured under the new basis 3. You can prove this to yourself by seeing that
$$T(1,1) = (-1 + 4, 1) = (3, 1)$$
And similarly for the other coordinates in the unit square. Showing that 
$$\int\int_D f(x,y) dx dy \neq \int\int_{D^*} f(x(u,v), y(u,v)) du dv$$
What this really shows is that our intuition was wrong, we need to multiply the result of our coordinate change by some value indicating the sensitivity of the points to the change. This is known as the Jacobian. More specifically, we multiply by the determinant of the Jacobian matrix. Please take a look at the bottom of this post to learn about the man himself, Carl Gustav Jacob Jacobi.
The Jacobian is defined as 
$$\frac{\delta\ x_i}{\delta\ y_i}\ \forall\ i$$
That is, the derivative over all your change of basis functions with respect to each new basis. The result of this is a matrix, but we are looking for a single value to multiply our integral by so we take the determinant. The determinant of a matrix of values 
$$a_{(0,0)}, a_{(1,0)}, a_{(0,1)}, a_{(1,1)}$$ 
Where subscripts indicate the (row, col) position in the matrix is
$$a_{(0,0)} *a_{(1,1)} - a_{(1,0)} * a_{(0,1)}$$
Above is the most well known application to the Jacobian matrix. However, the Jacobian is really just a set of derivatives that indicate the sensitivity of functions to changes in their input values. In fact, the Jacobian matrix comes up in Machine Learning when using backpropagation to calculate the loss of a particular component in a modular neural network.
In the modular network above, note that minimizing the error with respect to w requires finding the intermediary derivative of each y_i with respect to the inputs z_i. Since the Jacobian of all y_i with respect to all x_i tells us how sensitive y is to changes in inputs x to our network, we can use it to say that any errors are approximately the sum of the Jacobian over all y_i with respect to x_i evaluated at x_i. We can think of this as a measure of the sensitivity of the error with respect to the inputs. For a more detailed explanation please see Pattern Recognition And Machine Learning by Christopher M. Bishop, page 247, section 5.3.4., from which I took this example and the above image. 
The Jacobian matrix itself is interesting in the information it provides and its applications, but before looking at it in more detail I knew nothing about its creator, Carl Gustav Jacob Jacobi. A quick look at the wikipedia page shows he was an impressive Mathematician and part of a family of distinguished people. His older brother Moritz Von Jacobi contributed to physics and engineering and Carl was home schooled by his uncle until the age of 12. At 12 he was moved to a private school, and in half a year was put into senior level courses. The only reason he didn't go to college immediately is that the universities were not accepting applicants under 16 years old. Before going to college, by now bored by his education, he tried to solve the quintic equation by radicals (whatever that is). At 21 years old he was lecturing on the theory of curved surfaces at the University of Berlin. On the other hand, when I was 21 I was either learning or forgetting about the Jacobian matrix in my multivariate calculus course.

graph 1 (unit square) - source
graph 2 - wolfram alpha
example - Vector Calculus 5th edition by Marsden and Tromba 
modular network image - Pattern Recognition and Machine Learning - Bishop pg. 247
Jacobian application - Pattern Recognition and Machine Learning - Bishop pg. 247

Thursday, January 12, 2017

Lucky 13

The following was taken from Nautilus article "How Designers Engineer Luck Into Video Games" by Simon Parkin which can be found here.

"Olaf Haraldsson, an 11th-century Norwegian king, once wagered a kingdom in a faith-testing game of dice. Olaf was locked in a territorial dispute with the king of Sweden over the island of Hising; eventually the two agreed to settle the matter with a dice throw. The Swedish king rolled two sixes, the highest possible score, and said there was no point in continuing the game. Olaf insisted on taking his throw; a recent convert to Christianity, he was certain that God would steer the dice in his favour. His faith was vindicated with double sixes. The men continued to take turns throwing their dice, twelve after twelve. The matter was finally settled when, during Olaf’s final throw, one of the dice split in two, to show both a six and a one, winning him the kingdom on an unprecedentedly lucky 13."

Friday, December 2, 2016

Python Swap Function

In the coming semester it is likely that I will T.A. a course in Python. Lately I've been researching the language and using it for my pet projects that never get finished, interviews, etc...

One thing I found interesting is writing a swap in function. For some sorting algorithms we often have to swap the values at two indices in an array, so this does come up fairly often. What I found especially interesting was that you could do this in one line using Python. Where in most programming languages you have to write something like...

In Python, this can be written as... 

This is because Python evaluates assignments right to left. When the right hand side gets evaluated, Python creates a tuple (val_i, val_j) then assigns the variables in the left their corresponding tuple values. 

Most experienced Python programmers already knew this, but I found it pretty cool. All credit for the knowledge goes to this stack overflow post.

Wednesday, November 23, 2016

Researching Encryption in Windows and C# .NET

I was recently tasked with researching different methods of encryption for a license management module within my company and thought it might be useful to write down some of my findings. 

Cryptography has been around for a long time. Some of the more famous forms of old cryptography are in the Roman army, which used the "Caesar Cypher" to shift all letters to the left by three. "A" would become "X", "B" would become "Y", etc.

These days there are many more forms of data, and different ways to intercept it. Specifically we have data in-transit, and data at-rest.

For data-in-transit we use a public key private key method. The actual encryption of this method involves factoring extremely large numbers into their primes. If you're interested, this is a great video to watch.

With data-at-rest we use more complicated algorithms. In the Microsoft .NET library, these encryption methods are very well documented and held within the System.Security.Cryptography namespace. This includes methods for both in-transit and at-rest cryptography. 

Within the namespace there are two main encryption methods, Rijndael and Aes. The Microsoft Recommended method of encrypting data-at-rest is the AES method because Rijndael will not work when the FIPS-compliant security setting is enabled on a windows operating system. 

In conclusion, if you are encrypting a file, database, or some data on disk in a windows environment you should be using either the AesManaged, or AesCryptoServiceProvider classes from the System.Security.Cryptography namespaces. 

Tuesday, September 27, 2016

Fermat's Near Miss

A lot of people have heard about Fermat's last theorem. Originally proposed in the 1600s, the theorem was finally solved in the 20th century by a mathematician named Andrew WIles. 

Fermat's theorem says that there are no solutions (x,y,z) to the equation x^n + y^n = z^n for n > 2. If you've studied math or physics at a university or in some cases high school level you have likely heard of it. Today I was reminded of an example from the Simpsons which is purported to show a solution where x^n + y^n = z^n. Specifically...


There we see 3987^12 + 4365^12 = 4472^12. And if you were to plug this into a handheld calculator, it would be correct!

So what is going on? How could Homer Simpson have a counter-example to Andrew Wiles proof? In fact, these numbers are so large that most calculators round the error off and they appear to be the same! X, y, z combinations like this that "solve" Fermat's last theorem are known as near misses. I was reminded of this interesting example by this video, but have also read it in this book.

Also of note, in the holy doughnut to filled in doughnut progression along the bottom of the chalkboard, Homer is "solving" a question in topology! Unfortunately, in topology, a doughnut and a sphere do not equate... 

Monday, August 15, 2016

C# .Net Utility in WPF

I very recently had the opportunity to develop a simple tool in wpf and C# .Net to execute some procedural actions necessary to quickly make copies of our application. Surely there is an "easier" way to do this with windows system scripting, but for the sake of education and usability (we don't want our clients running windows scripts) we chose to write an executable. 

Several things surprised me during the development. First, wpf has become much easier. The most useful resource I found was WPF MVVM In Depth by Brian Noyes. In his tutorial he builds WPF up from code-behinds and simple applications that take inputs and generate outputs. The most essential parts for me were the building blocks of implementing INotifyPropertyChanged and ICommand. This was good enough for my current application, but more complex applications meant to display and manipulate data will probably go into using ObservableCollections and defining sql server data sources. 

This project, along with other personal things kept me very busy the last few weeks, though I hope to return to posting more frequently. I've now changed the blog name from "Daily Blog" to simply "Blog".