Monday, April 29, 2019

Mota Bhai, Let’s Listen to the Messages from NOTA

In India, NOTA i.e., ‘None of the Above’ was introduced as a choice for the Electronic Voting Machines and Postal Ballots in the General Elections in 2014. It is widely accepted that a NOTA vote is lack of endorsement by the voter for any of the candidates contesting in a given constituency. Regarding NOTA, the Election Commission of India has clarified the following:
  1. Even in the extreme case when the NOTA votes in a constituency are higher than any of the candidates, the candidate securing the highest votes will be declared the winner. In other words, as per the norms of first past the post, NOTA cannot be declared the winner.
  2. NOTA votes will not be considered for forfeiture of deposits in an election. Please note that in an election in India, if a candidate fails to secure at least 1/6th (16.67%) of the valid votes cast, he or she forfeits the security deposit.     
A lot has been talked about NOTA and political scientists have put forward different viewpoints on NOTA. I am not discussing or debating these viewpoints. In this blog, I will analyze the NOTA votes cast in 2014 and try to draw some conclusions objectively.

In the general elections in 2014, roughly 1.08% of the votes cast were NOTA. This amounts to 5.99 million votes out of a total of 547 million votes. Some of the basic statistics pertaining to NOTA votes are in the table below.

 Table I: Summary Statistics of NOTA Votes by Constituency, 2014 General Elections

As we can see, the average NOTA votes in a constituency were 1.12%. In a large democracy like India which has more than 800 million voters, such NOTA votes represent a very small percentage. Therefore, there could always be an attempt to brush the NOTA votes under the carpet or to treat them as having nuisance value at best. However, we can also see from the table above that the maximum NOTA votes in a constituency were slightly higher than 5% and that the 90th percentile is 2.4%. As with any statistical analysis, it is important to look at the frequency distribution instead of just the mean. The frequency distribution of the percentage of NOTA votes in all Lok Sabha constituencies is presented below in Chart A.

Chart A: Frequency distribution of the percentage of NOTA votes in All Lok Sabha Constituencies
 
From the frequency distribution we can see that in the 2014 General Elections, in 417 constituencies out of a total of 543 the NOTA votes were 1.5% or lower. In addition, there are 25 constituencies in which the percentage of NOTA votes were 2.5% or higher. I decided to look at the characteristics of these constituencies which recorded much higher than average NOTA votes. Table II below shows the top ten constituencies with highest percentage of NOTA votes.

Table II: Top 10 Constituencies which recorded the highest percentage of NOTA Votes in 2014 General Elections 
 
It is quite apparent from the table able that these top ten constituencies are all in the reserved category and 8 out of these 10 are reserved for Scheduled Tribes. In addition, of the top 25 constituencies by highest percentage of NOTA votes, 17 are reserved; 3 for Scheduled Castes and 14 for Scheduled Tribes. And of the top 50, 30 are reserved; 7 for Scheduled Castes and 23 for Scheduled Tribes. One can thus draw a definite conclusion that in the constituencies reserved for Scheduled Tribes, the percentage of NOTA votes have been higher than the norm. It is also noteworthy to point out that the Nilgiris constituency in Tamil Nadu which recorded the second highest proportion of NOTA votes in the country is the home of the Badaga tribe. This tribe has been agitating to be recognized as a Scheduled Tribe for a long time.
I decided to go one step further and analyze the frequency distribution of NOTA Vote % for the reserved and general constituencies. The objective was to evaluate if the distribution of seats with different NOTA Vote percentages is markedly different in reserved seats versus the general category ones. From the distribution (see Chart B), it is quite evident that the NOTA Votes cast in seats reserved for Scheduled Tribes is significantly different than the general seats as well those seats reserved for Scheduled Castes.

Chart B: Frequency distribution of the percentage of NOTA votes in for General and Reserved Constituencies
 
Let’s take for example the constituencies where the percentage of NOTA votes were between 2.0 and 2.5%. As evident from the chart above, in less than 10% of the general constituencies and those reserved for Scheduled Castes, the proportion of NOTA votes were in the range 2.0 to 2.5%. By contrast in constituencies reserved for Scheduled Tribes, slightly more than 25% of them recorded NOTA votes in that range of 2.0 to 2.5%. Next, let us consider those constituencies where the percentage of NOTA votes were 0.5 to 1.0%. In the general constituencies and the ones reserved for Scheduled Castes, more than 35% of the constituencies recorded NOTA votes between 0.5 and 1.0%. By contrast in constituencies reserved for Scheduled Tribes, in only about 5% of the constituencies, the proportion of NOTA votes were in the range 0.5 to 1.0%.
Finally, we look at the proportion of NOTA votes in constituencies reserved for Scheduled Tribes in all the states where there is at least 1 seat reserved for the Scheduled Tribes. We then compare the proportion of NOTA votes in these reserved constituencies to the proportion of NOTA votes in general constituencies and the ones reserved for Scheduled Castes (see Table III below).

Table III: Comparison of NOTA vote % in constituencies reserved for ST versus other constituencies
 
From this table it is evident in every single state where there is more than 1 seat reserved for the Scheduled Tribes, the proportion of NOTA votes has been significantly higher as compared to the other constituencies.
In Gujarati, the elder brother is affectionately referred to as ‘Mota Bhai’. I want to borrow this phrase and make an impassioned plea not to ignore the NOTA votes in the constituencies reserved for the Scheduled Tribes. Even with a low percentage of NOTA votes, the Scheduled Tribes in India are conveying a message. It could be that they are genuinely unhappy with the candidates who are in fray and hence, decide to cast a NOTA vote. But I fear that if they casting a NOTA vote because they feel marginalized in the economy and polity of modern India, then it is unfortunate and is probably leading to an incendiary situation. Mota Bhai, the least we can do is to understand the root cause and try to address it.

Sunday, April 28, 2019

The Consumer Centric Behavior of Indian Stock Markets

Gross Domestic Product (GDP) is the sum of Consumption (C), Government Spending (G), Savings (S) and the difference of Exports and Imports (X – M). Mathematically, GDP = C + G + S + X – M. Different countries exhibit different characteristics in GDP and GDP growth. For example, in the United States, the GDP is primarily fueled by consumption. Consumption accounts for about 70% of the total GDP in the United States. On the other hand, in China, savings and exports contribute a significant portion of the country’s GDP. In India, GDP is primarily driven by consumption as seen in Chart A below.

Charts A: GDP and Household Consumption in India (1960 – 2017) Source: World Bank

In 2017, India’s GDP was US $2.66 trillion of which US $1.488 trillion came from household consumption - a contribution of 56%. One could argue from the chart above that over the years, the contribution of consumption to GDP has gone down significantly and the argument is true. In 1990, household consumption accounted for 67% of India’s GDP whereas in 2017, the contribution was 56%, a drop of 11% in 27 years. However, one has to understand that from 1990 to 2017, India’s GDP has increased from US $507 billion to US $2.66 trillion which is more than a 5-fold increase. Consequently, household consumption has also grown significantly from US $38 billion to US $1.488 trillion – a 4.4 times increase which is not insignificant by any means. (Note: All GDP and related figures are in constant 2010 US $ terms to eliminate any effects of foreign exchange translation).
In addition, we have to also look at how the population and per capita consumption has increased from 1990 to 2017. In 2018, India’s population was 1.339 billion as compared to 870 million in 1990 (see Chart B below). This accounts for an increase in 54% over 27 years or a population growth of about 1.6% per annum.

Chart B: Population in India (1960-2017) Source: World Bank

During the same period from 1990 to 2017, per capita household consumption in India has increased from US $388 to US $1,111 (see Chart C below). Also, it is interesting that the rate at which per capita household consumption has increased has also accelerated. Between 2000 and 2010, per capital household consumption increased by 47% whereas, the same increase between 1990 and 2000 was 31%. In seven years between 2010 and 2017, per capita household consumption has increased by 49%. At this pace, in 2020 per capita household consumption in India might be 70% more than what it used to be in 2010.

Chart C: Per Capita Household Consumption in India (1960 – 2017) Source: World Bank
 
Based on the analysis thus far, one can safely arrive at two important conclusions:
  • In India, household consumption is a major driver of GDP.
  • In the coming years, per capita household consumption is expected to increase and could thus provide significant boost to the GDP.
The question then is how has the stock market in India reacted to increasing household consumption. After all, the financial markets are efficient and prices reflect all available information. Therefore, our hypothesis is that the companies in India that cater to household consumer demand should have done well.

In order to validate this hypothesis, we need to classify publicly traded companies. For this purpose, we used the FactSet Revere Business Classification System (RBICS). RBICS is a multi-level industry classification system and we have used the topmost level called ‘RBICS Economy’ for this exercise. Each of the RBICS Economies has been classified into one of the three Economic Sectors – Consumer, Infrastructure and Technology (see Table I below).

Table I: RBICS Economy and classification into Consumer, Infrastructure or Technology


Based on this classification, we then created two stock market indices of publicly traded companies in India - India Consumer Index and India Infrastructure Index. For each of these indexes, we consider companies whose RBICS Economy Code fall in the corresponding economic sectors as outlined in Table I. After all, the best way to assess the performance of a stock market or a segment of the stock market is to create an index and track the performance of the index over time. In financial engineering, this process is called ‘Back-testing’.

The specifications for these indices are as follows:
  • Number of constituents: 50
  • Currency: Indian Rupee
  • Review and Rebalancing Frequency: Quarterly on Mar 31st, June 30th, Sep 30th and Dec 31st
  • Weighting Methodology: Free-float Market Value Weighted
  • Eligibility Criteria:
    • Average monthly turnover (turnover is monetary value of stocks traded) over last 12 months is at least US $2 million
    • Average market capitalization over the last 3 months is at least US $250 million
  • Ranking:
    • The eligible companies are then ranked by free-float market capitalization and the top 50 are selected in each review period
Free-float Market Capitalization = Free-float Factor * Market Capitalization. Free-float Factor represents the proportion of shares that are tradeable and are not held by entities like government, promoters, holding company, joint venture, sovereign wealth fund, etc. The rationale is that these owners buy and hold the shares for long periods and these shares are not available for trading in the stock markets.

The performance of the India Consumer Index is very impressive. The index was back-tested from Jan 1st, 2008 to Dec 31st, 2018 – a span of 11 years. The year 2008 was deliberately selected as the starting point to capture the global economic downturn precipitated by the banking crisis worldwide. Over these 11 years, the cumulative returns of the India Consumer Index is 270% as compared to 77% for Nifty-50 and 78% for BSE Sensex (see Chart D and Tables II & III below).


Chart D: Comparative Performance of India Consumer Index and Nifty-50
 
Table II: Comparative Performance of India Consumer Index and Nifty-50
Table III: Top 20 constituents of India Consumer Index
By comparison, the India Infrastructure Index underperforms significantly. Over the same period Jan 1st, 2008 to Dec 31st, 2018, the cumulative returns are 26% and in addition, the volatility is higher than Nifty-50 and BSE Sensex by about 3% (See Chart E and Tables IV & V below).

Chart E: Comparative Performance of India Infrastructure Index and Nifty-50
Table IV: Comparative Performance of India Infrastructure Index and Nifty-50
Table V: Top 20 constituents of India Infrastructure Index

Conclusions:
  • In todays’ world, financial markets worldwide are very efficient and they incorporate all available information into the prices of securities and the Indian stock market is not an exception. As is evident from the analysis, the Indian equity markets have obviously reacted to the fact that household consumption is a major driver of the Indian economy and will continue to be so in the future.
  • Given the fact that household consumption is becoming such a major driver of the economy, it is worth giving a substantive tax cut to the middle class. That has the potential of driving up household consumption even higher and could spur the economy instantaneously. The skeptics will argue that such a move will result in inflation. The counter argument for that is in recent years in India, inflation has been historically low and this could be the best time to spur economic growth further by boosting household consumption through a substantive tax break.
  • Also, the Indian equity markets are not optimistic about the infrastructure sector. The prices in equity markets project future trends and based on such belief, one has to conclude that in the near future, things do not look very bright for companies in the infrastructure sector in India. There could however be potential value in this sector in the long run say, 20-30 years.

Disclaimer: This analysis is neither an endorsement nor a criticism of the economic and monetary policies of the Government of India or the policies followed by the governments in power since 1990. This study is a purely an attempt to understand some of the drivers of the equity markets in India.

 

 

Sunday, April 10, 2016

The Thrill of Analytics in IPL - Part I: What is a Par Score in an IPL Match?

The Indian Premier League (IPL) started today. In 2008 when the inaugural version of IPL was launched, the razzle-dazzle of this shorter format of the game definitely caught the attention of the cricket lovers. Some even argue that it roped in a completely new generation of fans into its fold, particularly, the younger generation who would rather indulge in the excitement of this shortest format of the game. Since its inauguration, the IPL has had its own share of problems. The most notorious amongst them being the controversy surrounding the spot-fixing and betting scandal of 2013. The betting scandal subsequently led to the suspension of two franchises - Chennai Super Kings and Rajasthan Royals. In spite of such hiccups, the games must go on and the so did the IPL. In the 2016 version of the IPL there are two new franchises – Rising Pune Super Giants and Gujarat Lions and the cricket lovers are awaiting another fun filled tournament.
 
As for me, I have always loved watching all forms of cricket but, T-20 and IPL have always been intriguing because they offer the excitement of some phenomenal analytics. In 2011, while visiting London, I remember telling a British friend that T-20 cricket offers some incredible opportunity for sophisticated analytics. He was so amused (or rather shocked) that he choked on the red wine! He asked me if I also watch WWE which, I have to confess that I sometimes do late at night in my hotel rooms to keep me awake. Skepticism aside, I am however sticking to my point and, I over the next 6 weeks while the IPL is in progress, I will write a few blogs on analytics surrounding this tournament. This one is the first in that series.
 
An important question that comes to my mind is ‘what is the par score in an IPL match’? In order to answer this question in a scientific manner, I collected data on all the IPL games since 2008 from www.espncricinfo.com. As a first step, I looked at the descriptive statistics of the first innings scores of all the IPL matches played since 2008 (see Chart A). One can see that the average score in the first innings has gone up since 2011 and that is a welcome sign. Overall it means that the scoring efficiency is increasing in this version of the sport. The average score from all seasons is 159 and in a naïve manner, we could infer that this is the par score. By the way, I would define the par score as one where the team has at least a 50% chance of winning the game.
 
Chart A: Descriptive Statistics of First Innings Scores in IPL
 
Now we know from Statistics that the average may not lie just in the center. If the distribution is skewed, the average is not necessarily a measure of the central tendency and, in those instances, a more reliable measure of the central tendency is the median. In this case, the median of first innings score is 160 which is close to the average. In other words, it means that the first innings scores are symmetrically distributed (see Chart B). 
 
Chart B: Frequency Distribution of Runs Scored in First Innings - Closely Follows a Bell Curve
 
The bell curve of first innings scores mean that there is an equal probability for a team to score on either side of the mean. This curve however does not tell us the probability of winning given a first innings score. In order to get that answer, I bucketed the scores in intervals of 10 and for each such interval, calculated the number of times the team batting first had won the match. I then calculated the probability of winning for each such interval (see Chart C). A few things are quite apparent from this chart: 
  • No team has ever won a match in IPL scoring less than 100 runs.
  • Each time a team has scored 220 runs or more, it has always won the match.
  • The probability of winning increases with the first innings score and follows the shape of an S-curve.

Chart C: First Innings Score and Probability of Winning

I fitted the S-curve using the equation shown in the inset. By statistical measures (p-Value et al), the fit of this S-curve to the underlying data is extremely good. Using this equation, one can verify that the score where the probability of winning is 50% is 163. In other words, the par score is 163. The probability of winning rises quite sharply as the score increases (see inset in Char C). Whereas the probability of winning is 50% at 163, it rises to almost 60% for another 10 runs. This explains why the runs scored in the last 2 or 3 overs are so important.

I also validated the outcome from the S-curve by calculating the average first innings score for each team and, then comparing the predicted winning ratio (based on the S-Curve formula) with the actual winning ratio (see Chart D). As expected, there are some deviations but, in general, the predicted values are quite close to the actual ratios. 
 
Chart D: Comparison of Predicted and Actual Winning Ratios
 
The readers now know how to calculate the probability of win for the team batting first. I encourage you to use this formula in this edition of the IPL and compare the predicted versus actual results. In the next blog, I will write on how the probability of the team which batted first changes once the other team starts batting.

Monday, November 30, 2015

Two Nations: One Fast and One Slow - What is the Driving Factor?

On a recent trip to London, UK, a friend of mine who is an Investment Banker stated that even though the Chinese economy is slowing down, the tapered growth is still quite meaningful to the entire world. His argument – at present, Chinese economy is so large that even if the growth were to slow down to 6 or 7%, it would still create a lot of demand worldwide. A few friends argued with him but, I believe those arguments were prompted more by Guinness than facts. On the way to my hotel, the discussions at the dinner table prompted me to analyze the growth of the Chinese economy and compare it with that of the Indian economy. As soon as I was back in my hotel room, I downloaded economic data from the World Bank database and started analyzing.

As I delved deeper into my analysis, a few things popped up that are worth mentioning. Firstly, in between 1991 and 2014, the Chinese economy has grown at an average rate of 10.1% as compared to 6.6% for the Indian economy. Also, in this span of 24 years, there have been 10 years when the GDP growth rate in China has exceeded 10% (see Chart A below). By contrast, in India, the GDP growth rate has exceeded 10% only once - in 2010. It is important to point out that in China the growth rates in excess of 10% have been in consecutive years – from 1992 to 1995 and from 2003 to 2007. Obviously, by all standards, this is a remarkable performance. 

Chart A: China & India - GDP Growth Rates (1991-2014), Data Source: World Bank

In addition, the GDP growth rates in China have been more predictable as compared to India. It is easy to prove this assertion solely on the basis of historical data. If we were to plot the current year’s GDP growth rate against the prior year’s GDP growth rates, we can see that in the case of China, the R-Squared is 0.37 whereas, for India the R-Squared is 0.05 (see Chart B). In statistical terms, the higher the R-Squared, the higher the predictability from one year to the next. Obviously, in the case of China, a higher R-squared means that the GDP growth rates in consecutive years are similar and not as volatile as is the case with India.

Chart B: Relationship between Current and Prior Year's GDP Growth Rates in China & India, Data Source: World Bank

The effect of predictably high GDP growth rates has been that the Chinese economy has grown much faster than the Indian economy during the period 1991 to 2014. In 1991 the size of the Chinese economy was 1.63 times the Indian economy. In 2014, the Chinese economy is 3.3 times the Indian economy (see Chart C). Now here comes the interesting part – if the GDP growth rate in India accelerates to 10% per annum whereas that in China it slows down to 5% per annum, the Chinese economy will still be twice the size of the Indian economy 10 years later in 2025.

Chart C: China & India - GDP in Constant 2005 US Dollars (1991-2014), Data Source: World Bank

When talking of growth, it is important to look at the effect of Foreign Direct Investments in China and India. It’s a well-known fact that the policies of economic liberalization in both these countries have helped attract foreign capital which consequently has fueled growth. The question is – on a comparative basis, what is the quantum of FDI that these two countries have attracted. Well, the level of FDI in India pales as compared to that in China (see Chart D). Since 1995, there have been several years when the FDI in China has been ten times or more than that in India.

 Chart D: China & India - Foreign Direct Investment (1991-2013), Data Source: World Bank
 
We have looked at the GDP and GDP growth rates in the two countries as well as the levels of FDI since 1991. What needs to be statistically analyzed next is the relative impact of FDI on GDP growth. I tried to study the relationship between increase in GDP in a given year (in other words the growth in GDP) versus the FDI in previous year. The underlying assumption is that once the investments are made, it takes at least a year for its impact on production of goods and services. Also, in order to account for the difference in scale between GDP and FDI, I used the log-log scale, i.e. both the x and y axes were transformed to log scale. As you can see in Chart E, the relationship between increase in GDP in a given year and FDI in previous year is quite strong for both countries as evidenced by high R-squared values. Nonetheless, for China, the relationship is stronger. Also, in the case of China, the slope of the line defining this relationship is steeper as compared to India. This implies that for every unit of FDI invested in China, the impact on GDP is higher when compared to India. One could draw a corollary that this also means that for every unit of FDI in China, the revenue output is higher than in India. This conclusion is quite significant because it easily explains why between the two countries China is a natural destination for FDI.


Chart E: Relationship between FDI and GDP in China and India, Data Source: World Bank

At last, the million dollar question is whether we can prove statistically that the difference in GDP growth between China and India is influenced by the difference in FDI. I took the difference in FDI between China and India in the prior year on the x-axis. The y-axis, represents the difference between China and India in the increase in GDP in the current year as compared to the prior year (see the formula for the Y-axis in Chart F below). As shown in Chart F, the scatterplot shows an upward trend. Also, a polynomial of second degree fits the data quite well with an R-squared of 0.77. In other words, we have proven that the increased GDP growth in China as compared to India is fueled by increased FDI inflows. The moral of the story – FDI is the fuel for growth in both these countries and the more you can attract FDI, the better it is for the economies.

Chart F: Relationship showing increased amount of FDI fueling increased GDP growth
 

Saturday, February 21, 2015

India versus South Africa: It will be one cracker of a game

Tomorrow, in the ICC World Cup, India will play South Africa in Melbourne. After India’s performance against Pakistan a week ago, we are expecting that the men in blue will continue to maintain their momentum. However, South Africa is a markedly different opponent than Pakistan and if India were to win this one, they have to fire on all cylinders.

India has played South Africa in the World Cup thrice before - in 1992, 1999 and 2011 and, they have never won against the Proteas. In 1992, at the Adelaide Oval, the match was curtailed to 30 overs each because of rain. Azharuddin scored 79 and Kapil Dev supported him with a quick fire 42 off 29 balls to help India post a respectable score of 180. But the Proteas’ opening pair of Andrew Hudson and Peter Kirsten built a strong partnership and took the game away from India. South Africa won that game by 4 wickets with 5 balls to spare. In 1999, India played South Africa during the English summer in Brighton, a relatively smaller venue. In that game, Ganguly anchored the innings with a patient 97. He was run out to a throw from Jhonty Rhodes, probably one of the best fielders in the history of cricket. India managed to score 253 and Srinath got two early wickets – both the openers Herschelle Gibbs and Gary Kirsten were sent back in the 7th over for 22 runs. However, Jacques Kallis stood tall on that day, scored a marvelous 96 off 128 balls and took the game away from India. South Africa won by 4 wickets and 2 overs and 4 balls to spare. I still have bad feelings about India losing to South Africa in 2011. In that game in Nagpur, the openers Sehwag and Tendulkar game India a dream start – India were 142 in 17.4 overs when Sehwag was bowled by Faf du Plessis. Tendulkar scored a century (in Cricketing terms a Nelson i.e., 111) in that game and built a strong foundation with Gambhir after Sehwag’s departure. But, India lost 9 wickets in a span of 9 overs and were restricted to 296. In contrast South Africa started slowly but, eventually accelerated towards the end with some powerful hitting from JP Duminy, Johan Botha and Robin Peterson and, won by 3 wickets with 2 balls to spare. I remember being extremely upset after that game because after such a wonderful performance of the Indian opening pair, it hurts to lose a game because of such a colossal collapse of the middle order.

If we were to analyze the batting and bowling performance of Indians and the Proteas in the three World Cup fixtures in 1992, 1999 and 2011 by taking the historical performance of the respective players based on all the games played up to the meet, we can clearly say that in 1992 and 1999, South Africa had a distinct advantage. In 2011, the Statistical advantages were split – India had the advantage in bowling whereas South Africa excelled in batting (see chart A below).
Chart A: Comparative advantage of India and South Africa in previous World Cup encounters

The story of 2015 is that India has a small advantage in batting but, South Africa has a significant advantage over India in bowling. The economy rate of the Proteas is half a run better per over which means that in a 50 over game, they have an advantage of about 25 runs. Also, South Africa has a far better strike rate than India – on an average they take 6 less balls to take a wicket compared to the Indians.

I also compared the individual performances of the batsmen from the two teams (see Chart B below). It is pretty obvious that AB de Villiers, Hashim Amla and David Miller are the danger men. de Villiers has a strike rate close to 100 and an average of 52 runs per game. Hashim Amla has the highest average amongst all the batsmen from both the teams and similarly, David Miller has the highest strike rate amongst them all. India has to restrict at least 2 of these 3 if they were to give themselves a decent chance to win.

Chart B: Comparative Stats of Indian and South African Batsmen

The analysis of bowling figures for the two teams is in line with what we already know. The quartet of Dale Steyn, Morne Morkel, Vernon Philander and Imran Tahir are the best amongst the bowlers from both the countries (see Chart C below).

Chart C: Comparative Stats of Indian and South African Bowlers

As I have said above, the key to winning this game is that India has to fire on all cylinders which implies a solid opening partnership followed by middle order consolidation, restrict at least 2 out of 3 amongst the trio of de Villiers, Amla and Miller, maintain the line and length in bowling and refrain from giving too many loose deliveries. India has won against South Africa in recent times. In Champion’s Trophy in England, India beat South Africa by 26 runs. Shikahr Dhawan scored 114 and together with Rohit Sharma, the opening pair put up 127 runs. A late innings cameo from Ravindra Jadeja (47 off 29 balls) helped India post a score of 331 runs. Even though de Villiers scored 70, Amla was restricted to 22 runs and David Miller was out for a duck. India's batting and bowling performance in that game is exactly in line with my analysis here and, the result was in India’s favor.

I am forecasting one cracker of a contest and I believe India can win against South Africa just like it did at Sophia Garders, Cardiff during Champion’s Trophy in 2013.

Best of Luck Men in Blue!

Sunday, December 28, 2014

Literacy in India Sixty Years After Independence - Some Interesting Observations

India is home for about 1/6th of the world’s population and as such, all measures of quality of life namely life expectancy, infant mortality, literacy, per capita income, nutrition, etc. are of paramount importance in today’s globalized economy. A lot has been said and written about illiteracy in India and since Independence, a number of initiatives have been launched by the central as well as various state governments to deal with this malice. National Literacy Mission, Sarva Siksha Abhiyaan, the Midday Meal Program in Tamil Nadu, 1 Rupee grant per day for school going children in Bihar are some of these initiatives. Though India has taken significant strides in eradication of illiteracy, we all know that much remains to be done.

In this blog, I will present some interesting facts borne by the data on literacy rates in India. I happened to download the 2011 Census data from the website www.censusindia.gov.in and thought of performing some analysis on literacy rates. Personally, I did not hope to unravel much from this initiative but, as I delved more into the data, a number of very interesting facts stood out.

If we try to analyze the literacy rates and poverty levels in various states, a very interesting pattern emerges. It is well known that poverty results in increased levels of school dropouts because families in poor economic conditions rather prefer that their children help in augmenting income. Consequently, school dropouts result in illiteracy and illiteracy definitely does not aid in poverty alleviation. Thus the vicious cycle of illiteracy and poverty continues. Let’s analyze how serious is this vicious cycle in India. The plot of illiteracy versus percentage of people below the poverty line shows a clear relationship between illiteracy and poverty (see Chart A below). As you can clearly observe, higher levels of poverty implies higher levels of illiteracy. There are three distinct clusters in this chart. Such clustering is performed in a scientific manner using a Statistical method called K-Means. The first cluster (Cluster 1) is one with poverty levels and illiteracy that are higher than the norm. Some of the states in this cluster namely, Bihar, Uttar Pradesh, Madhya Pradesh and Odisha are populous states. In fact, the states in this cluster represent a whopping 42% of India’s population. So in summary one can say that amongst 42% of India’s population, higher levels of poverty is driving illiteracy.


Chart A: Relationship between Illiteracy and Poverty
 
Next, the analysis of male-female literacy rate differential yields some disturbing facts. In a perfect world, the male-female literacy rate differential should be independent of the overall literacy rate. We know that gender bias exists in India and, I thought of analyzing how pronounced is this gender bias. The following chart depicts the relationship between male-female literacy rate differential (Y-axis) versus literacy rates (X-axis) in various states.


Chart B: Gender gap in literacy rates versus overall literacy
 
The conclusions from this chart are as follows:
  • There is a strong relationship between illiteracy and gender gap in literacy rates. In Statistical terminology, we use a technique known as Linear Regression to understand relationships between two data series. Here the orange line (also known as the Regression Line) depicts the relationship between male-female literacy rate differential and overall literacy rates in various states. As is evident from this chart, the higher the literacy rate, the lower the gender gap in literacy and vice versa. This obviously points to the fact that to some extent, higher levels of illiteracy amongst the masses is because a larger proportion of women are illiterate. In Statistics, we measure the strength of the relationship between two data series with a metric known as R-squared. R-squared could range from 0 to 1 and, the higher the value of R-squared to 1 the stronger the relationship. In this case with an R-squared of 0.45, I would classify the relationship as moderately strong.
  • There are some obvious outliers. States like Meghalaya, Nagaland, Mizoram, Punjab and Assam are far below the orange line. In other words, this implies that in these states, the gender gap in illiteracy is relatively low as compared to its peers with the same level of literacy rates. In fact Punjab and Haryana, the two neighboring states depict a contrasting syndrome. They both have overall literacy rates that are similar – in Punjab it is 67% versus 65% in Haryana. Nonetheless, in Punjab the male-female differential in literacy rates is much less (8%) as compared to Haryana (15%).
  • Three outliers above the orange line clearly stand out. These are Rajasthan, Dadra and Nagar Haveli and, Daman and Diu. In these states the male-female differential in literacy rates is quite pronounced. Rajasthan poses a major challenge for literacy of women. In this state, the gender bias in literacy rates is the highest in the country. I thought of analyzing the gender gap in literacy for the age group 10 to 24 years (see Chart C below). This age group represents children and young adults who should be in middle school, high school or college. As is evident, among the major states in the union, Rajasthan is an anomaly with significantly higher male-female literacy rate differential. Even amongst the urban population, this difference is quite significant as compared to the other major states.

Chart C: Gender gap in literacy rates versus overall literacy in age group 10-24
 
Some other interesting observations can be made if we try to analyze the literacy rates by age groups for the general category of population, Scheduled Castes and Scheduled Tribes (see Chart D below). The gap in literacy rates for Scheduled Castes and Scheduled Tribes exist in all age groups and gets more pronounced with higher age groups. This is somewhat expected because we know that a larger proportion of elderly people an illiterate as compared to the younger generation. What is troubling though is that even for the younger population, literacy amongst the Scheduled Castes and Scheduled Tribes is lagging behind the general population quite considerably. Take for instance the age group 25-29. This is the age when most people start their career and unfortunately, in this very age group, the literacy rates amongst the Scheduled Castes is 10% lower than the general category. For the Scheduled Tribes, the corresponding gap is 20%. Let me point out that in India, the Scheduled Castes and Scheduled Tribes represent 25% of the population and therefore, such differentials in literacy rates are definitely not helping the cause of inclusive growth.
 
 
Chart D: Literacy rates by various age groups for the general category, Scheduled Castes and Scheduled Tribes

Now, other than poverty, let me point out another side effect of illiteracy i.e., infant mortality (see Chart E below). Based on the Statistical measure of R-squared, the relationship between infant mortality and illiteracy can be classified as substantially strong. It is quite evident from this chart that higher levels of literacy helps is curbing infant mortality rates as seen in the states of Kerala, Goa, and the Union Territories of Andaman & Nicobar Islands and Lakshadweep.


Chart E: Relationship between infant mortality and literacy rates

In summary, I would state that in order to improve literacy rates in India, we should be focusing on the following:
  • Stress upon improving literacy in the impoverished states like Bihar, Uttar Pradesh, Madhya Pradesh, Odisha, etc. If special incentives have to be given to attract and retain students from poor in schools and colleges, then so be it. We obviously cannot afford to have 42% of the population falling behind in literacy levels as compared to the rest of the country.
  • Initiatives are required especially in states like Rajasthan to narrow the gender gap in literacy levels. Again, half the population cannot be disadvantaged in terms of education if we are to promote inclusive growth. It is also conceivable that higher levels of literacy amongst women will reduce infant mortality on account of better family planning, immunization, hygiene, sanitation, etc.
  • Similarly, the literacy rates amongst the Scheduled Castes and Scheduled Tribes needs to be augmented significantly. Again, a quarter of the population in a developing country simply cannot fall behind in education.

Friday, December 26, 2014

Analytically Yours


I finally decided to write my own blogs. A number of friends, colleagues and well-wishers have been requesting me to write blogs related to the application of analytics. We all know that in today’s world, analytics and its application in business has become a hot topic. Obviously, this has been prompted by the movement that is popularly known as ‘Big Data’. I have spent quite a few years working in the field of analytics and, I consider myself extremely fortunate to have been able to work on some very interesting and real life problems in areas like investment banking, risk management, web based marketing, predictive healthcare, etc. My association with analytics though has inculcated a deep rooted belief that advanced analytics should not just reside in the Ivory Tower of Statisticians but, it should be applied to the common person’s life.

I am of the firm opinion that in today’s world, the solution for the most complex problems require the application of technology, analytics and human resources. Let’s take the example of drop outs in high school in developing or under developed countries. There are lots of reasons for high dropout rates namely, necessity to substantiate family income, lack of infrastructure, inability to cope up with coarse load, lack of the appreciation for the value of education, and so on and so forth. Technology can definitely solve the infrastructure issue to a large extent through web based learning. Massive Open Online Course (MOOC) is already there and I expect the MOOC movement to gain rapid momentum in the coming years. Predictive analytics can add the missing touch in preventing dropouts by identifying those who are most likely to dropout, by clustering students based on their abilities and interests and proposing MOOC courses, analyzing performance of each student at a micro-level and proposing corrective action to the teachers, etc.

My blogs will focus on the application of analytics in all walks of life, not necessarily just business. The blogs that I will write here will focus on addressing the pressing needs of our society today namely – poverty alleviation, eradication of illiteracy, affordable healthcare, sanitation and drinking water, and safety and security of mankind. I will also be inclined to write about various political and social topics with pertinent facts highlighted by the underlying data. I am a keen follower of elections across the globe and particularly in the US, UK and India, and I do intend to write about electoral politics from time to time. I am not a commentator on political or social topics and would only like to focus on the findings based on thorough analysis of the data. Nonetheless, I expect these blogs to stir some thoughtful debates on topics of political and social importance.

At times I do intend to use sophisticated techniques for analyzing the data. But, my attempt will always be to explain these techniques and the predicted outcomes in layman’s terms. After all, if analytics were to be used in daily life, we should be able to articulate it in a way that the common man understands and appreciates. At times, I also intend to write on topics that will be less serious in nature for example, application of analytics in sports or in matching your tastes to movies or books. Hopefully, blogs on such topics will add the necessary spice and keep my audience interested.

I am in this for the long haul and I expect to get moral support and constructive feedback from the readers. Lastly to my audience - If you would like me to analyze and write on a particular topic, please suggest and I promise that I will seriously examine the feasibility of doing so.

Analytically yours,
Partha Sen