Overview of Indian datasets

Data Discovery: A rough guide to microdata in Brazil, China, India and South Africa has an excellent overview of the various available datasets in the four countries. Here’s the screenshot for India:


(Click on the above image to open a zoomed-in image in a new tab.)

Hat tip: Markus Eberhardt‘s fantastic MEDevEcon.


Text analysis of Rahul Gandhi’s interview

So, Arnab Goswami’s interview of Rahul Gandhi concluded a while ago and now that the transcript is online, it’s time to do some text analysis (I will leave the meta analysis to political commentators/analysts):

Total word count: 12720
Rahul’s word count: 7595 (60%)
Arnab’s word count: 5125 (40%)

The most frequently used words by Rahul (after filtering out some commonly used words):
system (70)
people (66)
going (52)
party (50)
country (45)
want/wants/wanted (40)
thing/things (37)
congress (34)
power (32)
rti (32)
political (31)
think/thinks/thinking (29)
one (28)
issue (26)
riots (25)

2 word phrase frequency:
i am or i’m (70)
in the (57)
going to (44)
the system (43)
this country (39)
we have (38)
i have (33)
of the (32)
to do (29)

3 word frequency:
the congress party (23)
in this country (22)
i want to (18)
we have to (13)

4 word frequency:
we are going to (9)
are we going to (8)
in the congress party (8)

Rahul’s word cloud

Arnab’s word cloud

Note: Word clouds created using Wordle and text analysis conducted using Textalyser and ATLAS.ti. The list of English stopwords was taken from Ranks.nl. To download the data in the spreadsheet, click here. (Please click on ‘file > ‘download as’ to save a copy of the file on your computer)

PS: In case you are wondering, Rahul Gandhi referred to himself in third person 7 times; he didn’t refer to his opponents by their names (Akhilesh or Arvind Kejriwal had 0 references, but Modi had 3); and oh, the word empower or a version of it like empowering/empowered/empowerment had 23 occurrences.

Tools for data wrangling

Did you say you are interested in some data wrangling? Or perhaps some data scraping? Wait, you say you just want to learn how to clean data and maybe geocode it? For all this and much more, log on to School of Data now! You can even take a course online. The following are some of the recommended tools:

Extracting: Google chrome scraper extension, Google spreadsheets, Scraperwiki, gImageReader + Tesseract

Cleaning: Open Refine, Spreadsheets, Nomenklatura

Analysing: Spreadsheets, R, Gephi

Presenting: Tile Mill, Fusion Tables, Gephi, Many Eyes, D3

Sharing: The Datahub, Google Docs, Github

Source: School of Data

PS: Other good places to check out are Codeacademy and the resource page at Exposing the Invisible.

Slanted Reporting on Naxalism

Here’s Supriya Sharma reporting on the bias in news:

“In a study of more than 500 stories published in four newspapers in the year 2011, I found nearly half were simply accounts of violent events. An analysis of sources showed that 62 percent of the stories were based on information supplied by security personnel and government spokespersons. Only 5 percent of the stories quoted the Maoists. And just 5 percent gave voice to the villagers.” Source: Caravan

The study she refers to is called “Guns and Protests”, which she undertook at Reuters Institute. The main finding that jumps out is that even in a left-leaning newspaper like The Hindu there is little space given to voices of villagers who are in the midst of this conflict. The key table is:

(The percentage figures are slightly different because I think there might be some typos in her table — I recalculate the percentages based on her raw numbers.)

To read more about the study click here.
To download the data in the spreadsheet, click here.
(Please click on ‘file’> ‘download’ to save a copy of the file on your computer)