Creation of Standards for Social Media Corpora: a Digital Humanities Topic Par Excellence

Even though empirical research of computer-mediated communication (CMC) has a tradition of almost two decades, there are still only very few annotated CMC/social media corpora which are available to the scientific community and the public. The major reason for that situation is the lack of standards and tools for collecting, representing, annotating and providing resources of that type. One crucial issue is the unclear legal situation w.r.t. CMC/social media data. On the example of a legal expertise sought for the integration of an existing German chat corpus into CLARIN-D, the talk will highlight this issue (according to German law) and describe how it has been handled in the project. Another crucial issue arises from the fact that, due to the distinct communicative characteristics of CMC/social media discourse, standards and tools for the representation and annotation of text corpora can not be adopted for CMC/social media corpora without modifications. The creation of standards and the adaptation of NLP tools for that new type of language resource is a digital humanities topic par excellence since (1) it focuses on data which are born digital while at the same time (2) it requires a combination of expertise from humanities and computational sciences.