Collection, storage and analysis of online teenage talk: assets and challenges

I will address a range of issues based on 10 years of experience with sociolinguistic research on informal computer-mediated communication (CMC) produced by youngsters. Starting from the two main datasets we are currently working with (corpus 2007-2013 and corpus 2015-2016), I’ll discuss some challenges with respect to gathering data on the social profile of the informants and some ethical issues. Next, attention will be devoted to the consequences of the size and (often imbalanced) composition of CMC-corpora for the data processing. In order to illustrate the challenges of the genre I'll briefly deal with a specific methodological issue: whether or not to operationalize the occurrence of CMC-features as binary or ordinal variables. Finally, while large corpora generally trigger (and necessitate) quantitative data processing, I want to stress that supplementary qualitative research may be indispensable if we do not want to get alienated from CMC-pragmatics.