## Efficient Matlab (I): pairwise distances

Changed some code from impossibly shore to merely slow! Machine Learning Rumination

I want to share some tricks for making Matlab function more efficient and robust. I decide to write a series of blog posts. This is the first one of this series, in which I want to show a simple function for computing pairwise Euclidean distances between points in high dimensional vector space.

The most important thing for efficient Matlab code is VECTORIZATION! Here I will take distance function as an example to demonstrate how accelerate your code by getting ride of “for-loops”.
Say, we have two sets of points X and Y in d-dimensional space. We want to compute the distances between any point in X and any other point in Y. This function is very useful. For example, if you want to find the k-nearest-neighbor of a point x in a database Y, an naive method is to first compute the distances between the point x and all points in Y…

View original post 335 more words

## Nim game

The other I was cutting the hedge and the smell of cut box reminded me of my late grandad. He was a pretty cool guy and inspired me a lot to study maths and physics, as he was always busy solving some puzzle or other.

One of the games he taught us was “take the stick”. This is a two player game where a certain number of ticks are drawn on each row. Each player has to remove at least one or more ticks from a particular row. The last player to remove a stick loses. We used to play with five rows containing five, four, three, two and one stick respectively. I always has a stinking suspicion that the first player always wins if they play optimally, but could never prove it. I figured, now that I am all grown up I should give it a go. I should add the old man was a bit of a luddite, so I probably should not do it with a computer!

The technique that I will use to prove this, is similar to a minimax algorithm. In particular, I observe that there are a finite number of states in the game and that each of these can be converted into some of the other ones. This means that the game can be represented as a directed graph. For example, if we were playing “take the stick” with two rows, the first with two ticks, the second with one, the directed graph representing every possible game would be: In this picture each box contains the disposition of sticks in a particular state of the game. Since the player to remove the last stick loses, the box with one tick is a losing state. In addition, any configuration of the game which can lead to a losing state, is itself a winning state. Recapping there are two types of states:

1. Winning states can lead in the next moves to a losing state
2. Anything that is not a winning state is a losing state. The state with a single tick is by default a losing state

For any game except for the 2-1 game, drawing out the entire graph by hand is difficult, too many states and too many arrows. I therefore invent a bit of notation. I am going to therefore invent a bit of notation: winning states are going to be drawn in a rectangular square and losing games are going to be drawing in a circular one. Each losing game will be labelled A,B,C etc. and each winning game will be labelled with the letter of one of the losing games that it is connected to. Furthermore I am going to cunningly order the states by ensuring that the possible successors of a state are always after it (*).

For the 3-2-1 game the graph now looks like this: This shows that in the 3-2-1 game, the starting position is a losing position. Whatever the starting player does, the next player will be in a winning position. Clearly if the 3-2-1 game is a losing position, the 4-3-2-1 is a winning one. But what about 5-4-3-2-1 that we used to play so often? This is at the limit of being doable by hand as it has about 130  individual states. The following pictures show my graphic for this case:  So, clearly, the 5-4-3-2-1 game is a winning game. (A) winning strategy is to remove one of the ticks from the row of five, leaving the second player in the 4-4-3-2-1 position which is a losing one!

(*) This is ensured by sorting the rows of strings such that the longest one comes first and then ordering the states such that they are in lexical order.

## Unit testing destructors

I asked the excellent Kevlin Henney about techniques to test the destruction of pointers in C++. He suggested that the basic idea is to overload the “new” and “delete” operators and count how many times each is called. A testing framework can then check that the number of calls to new and delete balances out.

Testing destructors should definitely be tested in unit tests, but it strikes me that simply ensuring that new is called as many times as delete might not be sufficient. For example this function:

void leaky_function () {
new double;
delete NULL;
}

This code shows an overloaded new and delete that keeps counts allocation/deallocation. Note that the register that keeps track of which pointers have been allocated/deallocated cannot be an stl container, as these call “new” and delete themselves causing a stack overflow.

Posted in Uncategorized | 3 Comments Cook's Tales

Augusta suggested a new recipe she’d seen on TV. So, ever ready to please…

I whizzed a thick slice of white bread with an anchovy, a handful of capers, some black olives, a few small cherry tomatoes, herbs from the terrace and a little salt and pepper. I put some seasoned cod fillets into a lightly oiled dish, covered with the “crumble” topping and baked for 10-15 minutes at 180 C.

Next time I try this dish, and there will be a next time because it certainly has promise, I need to find a  a way of giving the crumble more crunch. maybe by increasing the oven temperature?

View original post

## Would Icarus’s wings _actually_ melt on way to the sun?

The other day, we were arguing in the office about Icarus’s flight to the sun. As you all know, the legend goes that Icarus made himself some wings out of wax and feathers. Unfortunately his hubris led him to fly higher and higher until his wings melted due to the heat from the sun.

Our argument was about whether, having abandoned the safety of the atmosphere, would the wax melt or freeze when in sunlight? In order to answer this question, we need to make use of the principle of detailed balance. According to detailed balance, energy exchanges between bodies at equilibrium must cancel out. Therefore if Icarus is receiving a certain flux of energy from the sun, he must also be emitting the same amount of energy. The corresponding equation is: $E_{incoming} = E_{outgoing}$, The only way for Icarus to emit energy is by radiating black body radiation. Since the intensity of black body radiation is related to the temperature of the body by Stefans law, we know that: $E_{outgoing}=S\sigma T^4$.

where T is Icarus’s temperature, $\sigma$ is Stefan’s constant and $S$ is Icarus’s total surface area. Similarly, the incoming energy depends on the solar flux and on the surface of Icarus that is exposed to sunlight: $E_{incoming} = S_{solar} \Phi$.

I will assume that the surface of Icarus exposed to the sun is half of the total surface area. Therefore Icarus’s temperature will be: $T = ( \frac{\Phi}{2 \sigma} )^{1/4}$ .

Now, the solar flux $\Phi$ is approximately 1 $kW m^{-2}$ at the distance from the Earth to the sun, therefore Icarus’s temperature is approximately 306 degrees Kelvin. Or about 40 degrees Celsius. The melting point of wax is 60 degrees, so Icarus would be fine. In your face legends!

Posted in Uncategorized | 3 Comments

## Solving Sudoku by depth first search

I am trying to remind myself of a little C++ and have decided the ideal project to do so would be to write a little solver for Sudoku. My aim is, given a board,  to find all valid solutions. A solution is valid if each number is present exactly once on each line and row and in each quadrant. There are two ideas that I use: updating the constraints and performing a depth first search.

# Updating the constraints

Given a particular assigned position on the board, no other position on the same row or in the same quadrant is allowed to have that value. I therefore store in an array of 9bits the allowed values for each position. For example, if a particular position is allowed to be in any position its state would be “111111111”. If, however it is allowed to only have a value of ‘1’ or a value of ‘3’ its’ state would be: “000000101”.

There are two particular important states. If a position is not allowed to take any value, then its state will be 0. This means that the board configuration I am considering is not valid. If on the other hand the state has a single true bit, then the position is in a definite state and I can assign that position.  So the pseudo code for updating the constraints is:

1. UpdateConstraints(AssignedPositions):
2. for aPos in AssignedPositions:
3.       if aPos has not yet been assigned:
4.            assign aPos
5.            for neigh in (list of neighbour of aPos):
6.                 unset aPos as possible state of neigh
7.                 if neigh has no possible states left: return -1
8.            for Pos in AllPositions:
9.                  if Pos has a single state left: GOTO 1
10. return 0

# Depth first search

The approach above is not sufficient, as given the initially assigned cells, it is possible for certain positions to still not have a determined state even after updating the constraints. In order to try out all possible states, I perform a depth first search, by creating all board states  that differ from the original one by having a single cell occupied and performing the algorithm above.

# Testing

I first tested on a simple news paper problem.

Starting problem:

_____________
|304|610|005|
|708|000|306|
|000|903|400|
_____________
|807|000|510|
|020|705|040|
|600|091|002|
_____________
|480|352|007|
|000|000|900|
|106|009|280|
_____________

After a single guess I can find the solution to  be:

_____________
|394|617|825|
|718|524|396|
|562|983|471|
_____________
|837|246|519|
|921|735|648|
|645|891|732|
_____________
|489|352|167|
|273|168|954|
|156|479|283|
_____________

I then tried a 17 given problem from  Wolfram Alpha.

_____________
|030|000|900|
|006|000|000|
|000|241|030|
_____________
|000|900|700|
|000|002|004|
|080|070|020|
_____________
|850|000|000|
|090|704|000|
|000|006|001|
_____________

this problem is clearly much much harder. The problem still is solved in a fraction of a second, having tried just 88 guesses with solution:

_____________
|532|768|941|
|416|395|872|
|978|241|635|
_____________
|325|984|716|
|769|512|384|
|184|673|529|
_____________
|853|129|467|
|691|437|258|
|247|856|193|
_____________

unfortunately I seem to have also spat out another valid solution:

_____________
|532|768|941|
|416|593|872|
|978|241|635|
_____________
|325|984|716|
|769|312|584|
|184|675|329|
_____________
|853|129|467|
|691|437|258|
|247|856|193|
_____________

It turns out that a “mathematician’s” Sudoku is different from a regular Jo one. The maths ones are allowed to differ by certain symmetry operations that I do not understand. Still main objective – to brush up on my C++ is achieved! Here is the source code – if you like  you can inspect my code here

## Are you more likely to give birth on a full moon? Revised

In my previous post Ratbag and Roland pointed out that there seems to be a slight increase in probability in the birth rate for the week after a full moon and they suggested I carry out a significance test. I returned to the original data and found that the assumption that birth is equally likely on all days is in  significant disagreement with the data: could it be that the moon actually has an effect on the probability of birth? Luckily, if I correct my assumptions by noting that births are less likely on weekends, the influence of the full moon disappears again. So you are not more likely to give birth on a full moon, but you are more likely to give birth during the week.

In order to make a significance test it is necessary to have  a model of the underlying probability. In this case, we need to make some assumption about the probability of being born at a certain number of days distance from the nearest full moon. The most naive assumption is that babies are born with equal probability on each day of the year. In 1969 there were 13 full moons. Of the 12 intervals between each full moon, 6 were 29 days long and 6 were 30 days  long. Thus, assuming each day of the year is equally probable for birth, we can compute that the  probability of a baby being born between -14 and +14 days from a full moon is 0.03391 (to 4 s.f.) and  that  the probability of a baby being born 15 days from a full moon is 0.01667 (to 4 s.f.). The variance of a binomial distribution is $v=p(1-p)n$, where p is the probability of an even and n is the number of events considered. Since about 1.8 million births is the sample size used, this leads to a standard deviation $\sigma=0.000135$. Since any deviations from the probabilities of more than $3~\sigma$ would be significant, the increase in probability in the ten days after full moons seems significant.

This result surprised me: are women’s wombs really in tune to the music of the spheres? To make my mind up, I decided to look at the distribution of births as a function of day of the year, which is shown in the graph below. Every 7 days there seems to be an event which leads to a significant reduction in births. This event happens on Saturdays and Sundays.  I am not sure why, but I reckon that doctors and nurses being on holidays might have something to do with it :D.

A better model for the probability of being born a certain number of days away from the full moon, must take this variation into account. If I do this, I obtain the curve shown in green in the graph below. The error bars on the green and red line show the 99.7% confidence interval (3 $\sigma$) . Notice that they match the data very closely indeed.

So the bottom line is that you are certainly more likely to give birth during the week, but the moon has nothing to do with it.

## Are you more likely to give birth on a full moon?

I have been attending ante-natal classes recently and, although I greatly appreciated our teacher, I found myself skeptical about some of her claims.  For example, our instructor was adamant that it is more likely to give birth during a full moon than on other days. She even told us that maternity wards are over-run during full moon nights!

In order to investigate this claim, I decided to look at some birth records. The centre for disease control in the US keeps very thorough records of births. These can be freely downloaded from: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm . It is also possible to download the date for full moons from this website .

Having obtained data for dates of birth and dates of full moon, I  wrote a program to extract the probability of giving birth as a function of phase in the moon. The following graph shows the results for births in the US in 1969. The x axis shows the number of days between the birth and the full moon and the y axis shows the probability of birth. If there was a correlation between full moon and likelihood of giving birth, there should be a peak in the distribution about x=0. However, clearly, the distribution is flat. Therefore there is no correlation between these two quantities. I guess it is still possible that only british women feel the influence of the moon. I somehow doubt that!

Here is the code I used to analyse the data:


import bisect
#birth data from CDC: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
#Phases of the moon from http://moonphases.info/past_full_moon_dates_calendar.html#1970

codeToYear={'9':1969, '0':1970, '1':1971}

FullMoon1969 = [ [03,01,1969], [02,02,1969], [04,03,1969], [02,04,1969], [02,05,1969], [31,05,1969], [29,06,1969], \
[29,07,1969], [27, 8,1969], [25, 9,1969], [25,10,1969], [23,11,1969], [23,12,1969] ]

#here i build the cumulative number of days from the beginning of a year to a certain month, in order to convert a date
#into an integer describing the # of the day of the year
length_of_month = [ 31, 28, 31, 30, 31,30,31,31,30,31, 30,31]
cum_day_by_month = dict()
cum = 0;
for i in range(1, 13):
cum_day_by_month[i] = cum
cum += length_of_month[i-1]

def get_birth_day(s):
year = codeToYear[s]
month = int(s[83:85])
day   = int(s[85:87])
return [ day, month, year]

def hash_date(date):
[d,m,y] = date
# when the day is not recored, the field is entered with a 99
if d != 99 :
return cum_day_by_month[m] + d
else :
return - 1

def get_hash_birth_day(s):
return hash_date(get_birth_day(s))

# this list contains the day of the year when a full moon occured
full_moon_hashed = map(lambda x : hash_date(x) , FullMoon1969)

def d_from_full_moon(h):
# bisect_left ensures that h <= full_moon[i]
i = bisect.bisect_left(full_moon_hashed, h)
if i != len(full_moon_hashed) :
d1 = h-full_moon_hashed[i]
d2 = h-full_moon_hashed[i-1]
if abs(d1) < abs(d2) :
return d1
else:
return d2
else:
return h-full_moon_hashed[i-1]

distribution_full_moon_d = list()
for i in range(36):
distribution_full_moon_d.append(0)

f = open("NATL1969.PUB", 'r')
count = 0
for line in f:
hd = get_hash_birth_day(line)
#ignore the births where the date is non-reported
if hd != -1 :
d = d_from_full_moon(hd)
distribution_full_moon_d[d+18] += 1
count += 1

print "#number of records counted: " , count
print "#days from/to full moon, probability"
for d in range(36):
print d-18, float(distribution_full_moon_d[d])/count


Posted in Uncategorized | 5 Comments

## Which way round do the pedals go?

Since my last post was cycle related, I thought I would share the most fiendish problem regarding bikes that I have come across so far, courtesy of one my colleagues from Southampton (GR!).  It has resurfaced with a colleague at Imperial (PB) and I fear I may in the spur of the moment given the wrong solution. Here is my attempt to right my wrongs.

Imagine you have a bike with its pedals in the vertical position. The bike is assumed to be magically balancing and not falling sidewise, whilst still being free to move backwards and forwards. You grab the bottom pedal and push it backwards. Will the bike move forwards or backwards?

Consider the diagram below: The two important points on this diagram are: 1), the bottom of the rear wheel and 2) the bottom pedal.What the problem boils down to is the position of the bike as a function of the position of 2.  The other quantities shown on the diagram are the angles φ and θ describing the positions of 1 and 2 relative to the bike itself, the radius of the back wheel ‘r’ and the length of the pedal cranks ‘k’.

Assuming that the bike travels in a straight line, the centre of the cranks (the bottom bracket) is at a point with coordinates (d,0), where d is the distance that the bike has travelled. I will define d such that it is positive if the bike is moving forward and that it has a value of 0 when the pedal and the bottom of the wheel are in the positions shown in the diagram. It is very straightforward to relate d to the angle φ: $d=r \phi$,

this is simply because the wheel is not allowed to slip, therefore for each inch that the bike moves forward the wheel must rotate anti-clockwise. Notice that φ is implicitly defined to be positive when the wheel is rotating anti-clockwise.

Next we want to relate φ to θ. In order to do this, we need to know the gearing g of the bike. I define this as the number of turns the wheel makes for every turn the pedals make. So a high gearing is the one you use when speeding downhill and a low gearing when struggling to get up a steep mountain.  With this definition in mind, it is clear that the two angles can be related by: $\phi=g\theta$

Now we know φ, all that is left to do is to write down the position of the pedal in cartesian coordinates. Since the position of the centre of the cranks is (d,0) and since the pedals are turning in a circle of radius k and defining φ as 0 when in the position shown in the diagram above we conclude that the position of the cranks can be written as: $p=\left( d-k~sin(\phi),-k~cos(\phi)\right)$

we can use our other formulae to eliminate d for φ: $p=\left( \phi~rg-k~sin(\phi),-k~cos(\phi)\right)$

When we push the pedal backwards, we are making the x position of the pedal small and negative. The question boils down to whether this corresponds to φ being positive or negative. It is evident from the formula above that the answer must be “it depends” on the values of g,r and k. Let’s do a little bit more analysis to show exactly how. If the x position is small, then we can assume that φ is also small and that the sin can be approximated linearly. If we then measure distance in units of k and divide the whole problem through by k we find: $x=\phi\left(g'-1\right)$

where g’ has been defined as $g'=\frac{g r}{k}$. So there are two cases. If g’ is greater than 1, then a negative x corresponds to a negative φ. This means that moving the pedal backwards corresponds to the cranks and the wheel rotating clockwise and hence to the bike moving backwards. On the other hand, if g’ is smaller than 1, then a negative x corresponds to a positive φ. In this case, moving the pedal backward makes the wheel turn anti-clockwise and hence the bike moves forward. Since most normal bikes have a gearing g greater than 1, and since it would be hard to design a bike with cranks longer than the radius of the wheel, g’ is almost always greater than 1 therefore for a normal bike, if you push the pedals back, the bike moves back!

Posted in Uncategorized | 4 Comments

## Times cycle campaign I invite you guys to subscribe to the Times’ cycle campaign to make British roads safer for cyclist.