Deep learning and Bayesian Nonparametrics

Deep learning thrives on problems where automatic feature learning is crucial and lots of data are available. Large amounts of data are needed to optimize the large number of free parameters in a deep network. In many domains, that would benefit from automatic feature learning, large amounts of data are not available, or not available initially. This means that models with low capacity must be used with hand crafted features, and if large amounts of data become available at a later time, it may be worthwhile to learn a deep network on the problem. If deep learning really reflects the way the human brain works, then I imagine that the network being trained by a baby must not be very deep, and must have few parameters than networks that are trained later in life, as more data becomes available. This is reminiscent of Bayesian Nonparametric models, wherein the model capacity is allowed to grow with increasing amounts of data. 

Religion, GDP and Happiness

The percentage of a country's population that feels religion plays an important role in their lives (x-axis) is negatively correlated with the GDP per capita (Y-axis, in log-scale):

Here are the heatmaps, first for religion:

and per capita GDP (PPP):

From the heat maps above, several countries in Africa and Asia stand out as having low per capita GDP and high importance of religion, while all of Scandinavia and Australia stand out as places with high per capita GDP and low importance of religion. The US is somewhat of an outlier with with highest per capita GDP in the world of ~$35k and about 65% of the population reporting that religion plays an important role in their lives.

For completeness sake, here is the heat map for life satisfaction, or long-term happiness:

Combining data on religion and GDP with data with data on long term happiness, we get roughly the following picture:

  1. Long term happiness correlates positively with per capita GDP.
  2. Long term happiness correlates negatively with importance of religion.
  3. Importance of religion correlates negatively with per capita GDP.

My guess is that the causal structure between the 3 variables: Religion (R), Happiness (H) and per capita GDP is the following:

When per capita  GDP is high, the basic needs of life are met, people are relatively comfortable, and therefore long term happiness is high, but for the same reason, very few people feel compelled to ask fundamental questions of the type that people turn to religion for. This has the effect of creating a negative correlation between Religion and Happiness. In the causal structure above, religion and happiness are related, but given per capita GDP, they are independent.

Understanding Happiness

Happiness is as hard to define as it is to achieve. Everybody wants to be happy. Even masochists. I think it is best to use  a non-constructive definition:

Happiness is the goal that drives all human actions and desires.

If long term happiness if everybody's ultimate goal, then it is worth learning how to achieve long term happiness. In fact, if being happy is the ultimate goal (as opposed to say, being wealthy), then our education system should also be teaching us how to be happy over a life time, rather than purely technical or vocational skills. Simple GDP growth does not imply an increase in the happiness of a society -- as indicated by data from the last ~40 years in the US, comparing per capita GDP and happiness levels:

Screen Shot 2012-12-17 at 3.09.43 PM

While per capita GDP has risen more or less steadily, happiness levels have remained more or less stagnant in the last ~40 years.

Should countries develop public policy with the goal of making a society happier, rather than with the goal of increasing GDP? I think it is an idea worth exploring (Scandinavian countries seem to rank highest in in the world in happiness scores, despite high taxes). The government of Bhutan came up with the Gross National Happiness index, which measures the average life satisfaction of the citizens in a country.


This correlates well with health, access to education, and wealth (GDP per capita). At any given time, the relationship between average happiness of a country and per capita GDP seems to log-linear, meaning that happiness is roughly linear in the log of the per capita GDP.

Screen Shot 2012-12-15 at 6.07.29 PM

This is because in order to increase the happiness level of a society by 1 unit, the increase in wealth required is proportional to the current welath. For e.g., if the required amount of increase in personal wealth for a group with per capita income of $1000 is $x, then it is $10x for a group with per capita income of $10,000.

Near the end of this talk, Daniel Kahneman says that in a study done with the Gallup organization, he found that:

Below an income of … $60,000 a year, people are unhappy, and they get progressively unhappier the poorer they get. Above that, we get an absolutely flat line. … Money does not buy you experiential happiness, but lack of money certainly buys you misery.

Kahneman distinguishes between two types of happiness: that of the experiencing self and that of the reflecting self. It is possible to be happy in the experiencing self but have a poor happiness score when reflecting on a long time frame in the past, and vice-versa. For the type of happiness that measure life satisfaction in retrospect, there is no flat time -- i.e. it continues to increase with increasing wealth. I don't find this too surprising. It is the difference between short term and long term happiness. It is easy to be happy in the short term at the expense of the long term. On the other hand, tolerating displeasure during hard work in the present can have a huge payoff in long term happiness in the future.

In this TED talk, Dan Gilbert showcases his research that shows that happiness can be synthesized by individuals. So happiness is not some finite resource that needs to be distributed among people, instead one can simply choose to be happy, despite seemingly adverse conditions. This is fascinating, because it provides experimental evidence that happiness has to do not just with our external circumstances (such as GDP per capita), but also with how we process information in our minds. Several religions have the concept of expressing gratitude. The act of being grateful basically synthesizes happiness out of thin air.

Average age of first-time Mothers

The average age of first time mothers in the developed countries of the world has been rising for the last ~40 years.

Here is another plot that shows the rate of occurrence of Down Syndrome, a chromosomal defect, as a function of the age of the mother at the time of child birth.

The curve really starts to shoot up at 30. In the UK, the average age of a first time mother is 30 years. It is well known that the fertility rate in women decreases after the age of 30 and drops rapidly after 35. Older mothers are likely to find it harder to have a baby and if they do, then they run a higher risk of chromosomal defects. Given the possibilities of all these negative consequences, the increase in the average age is a bit disturbing. It seems like there is a hidden cost to more women working and for longer.

Why is it that women are waiting longer before having their first born despite the risks? Most of my hypotheses (of which more than one, or none, may be true) have to do with women working:

  • Having invested significantly into an education, greater number of women are entering the workforce, with the desire to be financially independent.
  • There is greater financial pressure in families for women to work.
  • The policies of workplaces in these countries are not favourable to childbirth. I can see this is true in the US, but I doubt this holds for Wester European countries, which I know have policies favorable to child birth.

One source of further information is the following map, showing the absolute increase, in years, in the age of a first time mother, over the last 40 years, state by state in the US:

This number is highest in the North Eastern states of NY, NJ, MA, CT, etc. The intensity of the colors in the map above correlates well with population density, and with economic activity in general (meaning more working women).  Here are two more plots I came across in a US based study done by P&G, that suggest that at least in the US, employer policies may be responsible.

What would a mother value most from their employers?:

How much guilt to mothers feel about work-life balance?:

In the eye of the Beholder

Why is the picture on the right more appealing than the one on the left?

What is it that we find more interesting about the picture on the right, compared to the one on the left? The picture on the left contains more information. So we are certainly not looking for more information. One might say we don't know how to interpret the image on the left into anything familiar, but it is television static. A more precise answer is given by Jurgen Schmidhuber, who argues convincingly that:

Artists (and observers of art) get rewarded for making (and observing) novel patterns: data that is neither arbitrary (like incompressible random white noise) nor regular in an already known way, but regular in way that is new with respect to the observer's current knowledge, yet learnable (that is, after learning fewer bits are needed to encode the data).

This explains the pictures on top. The picture on the left is not compressible because it is a matrix of uniformly random 0/1 pixels. The Monet on the right evokes familiar feelings, and yet adds something new. I think what Schmidhuber is saying is that the amount of compressibility should neither be too little, nor too much. If  something is not very compressible, then it is too unfamiliar. If something is too compressible, then it is basically boring. In other words, the pleasure derived first increases and then decreases with the compressibility, not unlike this binary entropy curve.

Let us ask the same question again for the following pair of images (you have to pick one over the other):

My guess is that most people will find the image on the right more appealing (it is for me at least). Please drop me a comment with a reason if you differ. When I look at the image on the right, it feels a little more familiar, there are some experiences in my mind that I can relate to the image - for example looking straight up at the sky through a canopy of trees (white = sky, black=tree leaves), or a splatter of semisolid food in the kitchen.

In order for an object to be appealing, the beholder must have some side information, or familiarity with the object beforehand. I learnt this lesson the hard way. About 2 years ago, I gave a talk at a premier research institution in the New York area. Even though I had received complements when I'd given this talk at other venues, to my surprise, this time, audience almost slept through my talk. I learnt later that I had made the following mistake: in the abstract I'd sent to the talk's organizer, I had failed to signal that my work would likely appeal to an audience of information theorists and signal processing researchers. My audience had ended up being a bunch of systems researchers. The reason they dozed through my talk was that they had just a bit less than the required background to connect the dots I was showing them.

It is the same with cultural side information or context -- the familiar portion of the object allows the observer to latch on. The extra portion is the fun. Without the familiar, there is nothing to latch on to. The following phrases suddenly take on a precise quantifiable meaning:

  • "Beauty lies in the eyes of the beholder": the beholder carries a codebook that allows her to compress the object she is observing. Each beholder has a different codebook, and this explains 'subjective taste'.
  • "Ahead of its time": Something is ahead of its time if it is very good but does not have enough of a familiar portion to it, to be appreciated by the majority of observers.

I can think of lots of examples of art forms that deliberately incorporate partial familiarity into them -- e.g. music remixes, Bollywood story lines. Even classical music first establishes a base pattern and then builds on top of it. In this TED talk, Kirby Ferguson argues that all successful creative activity is a type of remix, meaning that it builds upon something familiar.


  1. When writing a paper or giving a talk, always make sure the audience has something familiar to latch on to first. Otherwise, even a breakthrough result will appear uninteresting
  2. Ditto for telling a story or a joke, DJing music at a party, or building a consumer facing startup. Need something familiar to latch on to.
  3. In some situations it may be possible to gauge the codebook of the audience (e.g. having a dinner party conversation with a person you just met), to make sure you seem neither too familiar, nor too obscure.

Big data at Foursquare

Attended a meeting of the NY ML meetup. The speakers presented on the BIG DATA work happening at foursquare. I noticed that the speakers as well as the audience is heavily tilted towards programming rather than statistics/ml. Most questions were about hadoop, hive and mongo db. Hardly any questions about the statistical aspects. This could be because the audience understood the math behind recommendation systems quite well. One exception was the 'cold start' problem. One thing that surprised me was that Foursquare doesn't use matrix factorization but simpler nearest neighbor methods! I asked. Bell, koren and volinsky showed in the Netflix prize that latent models based on matrix factorization are better than NN.

Also felt that some of the questions were interesting ideas. For example, how they can collect implicit feedback by noticing what a user does after being offered recommendations. I thought this was a really neat idea for tuning the recommendations for a user -- someone in the audience asked if they were doing it, and they replied they aren't but they are thinking about it.

Voice morphing Why this is so cool (and dangerous): Imagine that I have a large enough sample of your voice (if I have access to some of your phone messages, or voice chats, this is easy). I can then go and build a model of your voice. Using the technique in the papers above, I can then transform my voice into your voice *in real time*. This means that I maybe able to defeat a speaker-recognition/speaker-verification system, thereby breaking biometric systems that rely on voice for authentication. For example, if the biometric system asks me to speak a particular random phrase or sentence, I speak into a computer, that modifies my voice to sound like your voice in software, and then plays it back to the biometric system. Biometric systems should have 2 or 3 factor authentication -- biometric alone are broken!