Data Mining for the Social Sciences

By David Monaghan, co-author of Data Mining for the Social Sciences: An Introduction

This guest post is published in advance of the American Sociological Association conference in Chicago.Check back every day for new posts through the end of the conference on Tuesday, August 25th. 

Many social scientists focus on qualitative methods for their inquiry and analysis. Do you have any examples of how qualitative researchers have employed data mining techniques to assist them in their work? 

Usually, in the social sciences, when we say “quantitative” we mean analysis of numbers, and when we say “qualitative” we mean  the analysis of words (and sometimes images, sound data, etc.)  The fact is that just as our stocks of numeric data have exploded in recent decades, so too do we now have far, far more data pertaining to the social world that is “qualitative” in nature.  One great example is twitter feeds.  Over 300 million people use Twitter at this point worldwide to discuss everything from the details of their personal relationships to politics to media to promoting their own businesses or artistic careers.  Twitter data can be “scraped” and then analyzed, and this is tremendous amount of real-time data on the social world.  This field is, at this point, wide open – we have only just begun to think about how this data might be best analyzed.  The rules that we rely on in standard statistical analyses, which presume that data come from a random sample, clearly won’t hold very well here.  And this is a big-data problem par excellence – with this amount of data, standard “qualitative” interpretive techniques won’t work either.  This is an instance of the worlds of qualitative research and computationally-intensive analytic methods meeting.  And it is certainly not just Twitter – the same can be said of google searches, of Facebook posts, of the massive numbers of books and other texts which have been digitized.

What are some of the most compelling ways you’ve seen social scientists use data mining in their research?

Social scientists, as a whole, have been rather slow to embrace data mining.  In all honesty, social scientists – and sociologists in particular – have spent far more time (and text) discussing the sociological implications of “Big Data” and computationally intensive methods, and the fact that social scientists should be getting into these areas, then they have actually performing 9780520280984useful or interesting analyses.  At this point, the most interesting “social science” work done with data mining methods has been done by computer scientists.  There are probably a number of reasons for this.  Perhaps it is because social scientists are not comfortable with the methods themselves or the software necessary to use them.  To some degree, these methods have been dismissed as unscientific “data dredging”.  It is probably harder to get an article that uses these methods into a top journal, because the norms for how to present these sorts of analyses haven’t really been developed yet.  But I think, most importantly, it is because the type of data we now use is neatly fitted to a certain type of social-science question, and in order to profitably use computationally-intensive methods, we need to be using different sorts of data (particularly data that is very wide or long) and to be asking different sorts of questions.

What do you think are the most important lessons you have learned about data mining that you would like students of sociology to know?

I think there is a lot of mystique surrounding data mining, in the lay public and even among a lot of social scientists.  Data mining methods are discussed as something almost magical, a way of “discovering structure in data” uncovering otherwise hidden knowledge.  At the same time and as a corollary, it is presumed that these methods are very abstract, difficult to understand, difficult to use.  I think the most important thing, first and foremost, is to puncture this mythology.  Data mining methods are not magical ways of automatically uncovering knowledge.  Like traditional techniques, they are computational tools, and what they tell you depends on what you tell them to do.  And they are not particularly hard to understand or inaccessible.  In fact, a fair number of methods – like decision trees or association rule mining, to give examples – in fact use very simple algorithms.  It is just that they perform fairly simple mathematical operations a huger number of times.  And increasingly, software has been developed that makes these methods accessible to  people other than computer scientists.  Our world has, as has been noted ad nauseum, become much more data driven than ever in the past. These sorts of methods are being increasingly applied to analyze the massive stocks of data we find in our possession.  So it is all the more important that people understand them. The good news is, this is very, very possible.

David B. Monaghan is a doctoral candidate in Sociology at the Graduate Center of the City University of New York, and has taught courses on quantitative research methods, demography, and education. His research is focused on the relationship between higher education and social stratification.