This chapter describes the TEI encoding mechanisms available for textual data that represents discourse from genres of computer-mediated communication (CMC). It is intended to provide the basic framework needed to encode CMC corpora.
While the term computer-mediated communication might be used broadly to describe all kinds of communications that are mediated by digital technologies (such as text on web pages, written exchanges in chats and forums, interactions with artificial intelligence systems, the spoken conversations in internet video meetings), for the purposes of these Guidelines we use the term to apply to forms of communication that share the following features:
Such communications may be expressed as posts (cf. 9.3.1. CMC Posts), utterances, onscreen activities, or bodily activities exerted by a virtual avatar.
The following kinds of platforms support CMC:
CMC supports multimodal expression combining text, images, sound. Whereas early CMC systems (e.g. Internet Relay Chat, ‘IRC’ for short, the Usenet ‘newsgroups’, or even the Unix talk system) were completely ASCII-based, most CMC applications now permit combining media formats (e.g. written or spoken language with graphic icons and images) and mixing communication technologies on one platform (e.g. combined use of an audio connection, a chat system, and a 3D interface in which users control a virtual avatar as in many multiplayer online computer games or in virtual worlds).
This section describes the encoding mechanisms for the basic units of CMC and for their combined use to encode CMC data.
We use the term basic CMC unit to refer to a communication produced by an interlocutor to initiate or contribute to an ongoing CMC interaction or joint CMC activity. Contributions to an ongoing interaction are produced to perform a move to develop the interactional sequence, for instance to respond in chats or forum discussions. Contributions to joint CMC activities may not all be directly interactional; some may be part of a collaborative project of the involved individuals. Such collaboration could involve editing activities in a shared text editor or whiteboard in parallel with an ongoing CMC interaction (chat, audio conversation, or audio-video conference) in the same CMC environment in which these editing activities are discussed by the participants.
Basic units of CMC can be described according to three criteria:
A taxonomy of basic CMC units resulting from these criteria is given in the following figure.

The most important distinction in the CMC taxonomy concerns the temporal nature of units exchanged via CMC technologies. The left part of the taxonomy describes units that are performed (by a producer) and perceived (by a recipient) as a continuous stream of behaviour. Units of this type can be performed as
The right part of the CMC taxonomy describes units in which the production, transmission, and perception of contributions to CMC interactions are organized in a strictly consecutive order: The content—verbal, nonverbal, or multimodal—of the contribution has to be produced before it can be transmitted through a network and made available on the computer monitor or mobile screen of any other party as a preserved and persistent unit. We term this type of unit a post. Posts occur in two different variants:
Three of the four basic CMC units described above can be represented with models that are described elsewhere in the TEI Guidelines:
| CMC unit | Type of corpus data | TEI P5 element |
| spoken utterance | transcription of speech | u |
| bodily activity | textual description | kinesic |
| onscreen activity | textual description | incident |
The u, kinesic, and incident elements are not limited to CMC, but apply to encode textual transcriptions of spoken turns and CMC data about bodily activity and onscreen activity. The CMC unit post, which is specific to CMC, is introduced in 9.3.1. CMC Posts.
This section describes elements, attributes, and models which are unique to CMC and the TEI CMC module.
While the concept of a post is not unique to computer-mediated communication (ask anyone who has posted a ‘lost cat’ sign in the local market), this chapter concerns itself only with postings within a framework of a CMC system. Thus the element post is unique to the encoding of computer-mediated communication (CMC).
Posts occur in a broad range of written CMC genres, including (but not limited to) messages in chats and WhatsApp dialogues, tweets in X (Twitter) timelines, comments on Facebook pages, posts in forum threads, and comments or contributions to discussions on Wikipedia talk pages or in the comment sections of weblogs.
Posts can be either written or spoken:
The element post may co-occur with u, kinesic, incident, or other existing TEI elements within a div, or directly within the body, and may contain headings, paragraphs, openers, closers, or salutations.
The post element is a member of several TEI attribute classes, including att.ascribed, att.canonical, att.datable, att.global, att.timed, and att.typed, and as such may take a variety of attributes.
| modality | written or spoken mode. Suggested values include: 1] written; 2] spoken (for audio (or audio-visual) posts) |
| replyTo | indicates to which previous post the current post replies or refers. |
| indentLevel | specifies the level of indentation of an item using a numeric value. |
The replyTo attribute is used to capture information drawn from the original metadata associated with a post that asserts to which previous post the current post is a response, or to which previous post it refers. This metadata is included by many, but not all, CMC environments, when the user executes a formal reply action (e.g., by clicking or tapping a reply button). This attribute should not be used to encode interpreted or inferred reply relations based on linguistic cues or discourse markers.
In the CMC genre of wiki talk, users insert their contribution to a discussion by modifying the wiki page of the discussion—the talk page. Since there is no technical reply action available in wiki software, users apply textual indentation in the wiki code to indicate a reply to a previous message, and a threaded structure is formed by a series of such indentations. The attribute indentLevel records the level of indentation, that is the nesting depth of the current post in such a thread-like structure (as defined by its author and in relation to the standard level of non-indentation which should be encoded with an indentLevel of 0). It is used in wiki talk corpora but may also be used for other threaded genres, for example when HTML is used as a source.
The attribute generatedBy is also unique to CMC encoding. But unlike modality, replyTo, and indentLevel, generatedBy is available not only on the post element, but on any of its descendants as well.
| generatedBy | (generated by) categorizes how the content of an element was generated in a CMC environment. Suggested values include: 1] human; 2] template; 3] system; 4] bot; 5] unspecified |
The generatedBy attribute may indicate, for post or any of its descendants, how the content transcribed in an element was generated in a CMC environment. That is, whether the source text being transcribed was created by a human user, created by the CMC system at the request of a human user (e.g., when the user activates a template that generates the content, such as in a signature), generated by the CMC system (e.g. a status message or a timestamp), or generated by an automated process external to the CMC system itself. This attribute is optional; when it is not specified on a post element its value is presumed to be unspecified; when it is unspecified on any descendant of post its value is inherited from the immediately enclosing element. In turn, if generatedBy is not specified on that element it inherits the value from its immediately enclosing element, and so on up the document hierarchy until a post is reached; the post either has a generatedBy attribute specified or its presumed value is unspecified.
A list of suggested values for generatedBy follows:
In many CMC genres, posts may occur in a variety of ways: e.g. in a sequence or in threads, or grouped in some other way. For example, in chat communication such as WhatsApp, posts are part of ‘a chat’ of one user with another user or among a group of users. When an entire chat is saved, typically a ‘logfile’ of the chat is obtained from the CMC system and downloaded. Similarly, Wikipedia discussions occur on a talk page, which ultimately is a web page containing the user posts, sub-structured in threads. Likewise, YouTube comments occur on a webpage containing the YouTube video along with comment posts and posts replying to those comments displayed below the video. The video serves as a prompt for the whole discussion. In forum discussions, the prompt may be a news item, and in Wikipedia, an article may be viewed as the prompt for the discussion on the talk page associated with that article.
The level of a corpus or collection of CMC texts of a particular genre, generally obtained from a particular CMC platform, sometimes even from several platforms. This level may be represented by either a TEI element or a teiCorpus element. The teiHeader of the corpus (i.e., the teiHeader that is a child of the outermost TEI or teiCorpus) will contain metadata in its sourceDesc about the CMC platform(s). Metadata about the project responsible for collecting the data and building the corpus, if applicable, should be recorded as well.
A set of posts collected (or sampled) by a researcher for analysis. The posts of the document will often map directly to the set of posts grouped on an existing web page, thread, or document within a CMC environment. Within the CMC environment the document as such is often created by a particular user, thereby initiating the communication which other users may read, and to which some other users might contribute. This level will naturally be represented by the TEI element. The teiCorpus (or TEI) element that represents the corpus will contain one or more TEI elements as usual.
In the teiHeader of a document level TEI, the sourceDesc will contain metadata about the CMC document such as a title, its author or owner, its URL, the date of its creation, the date of the last change made to it, and other metadata that are available and to be recorded such as one or more categories associated with the document.
The document sometimes contains, or is associated with, a prompt such as a video or a news item, either provided by the initiating user herself or located elsewhere and referenced at the beginning of the document. In such cases, the teiHeader of the document should also contain metadata about this prompt.
The level of the individual post is naturally represented by the post element; its encoding is further described in section 9.3.1. CMC Posts. A TEI element will contain a number of post elements, which can be grouped or ordered in div elements representing sequences or threads (section 9.4.2. Sequences, Sections, Threads) if appropriate.
As shown in Example 9.3.2. Attributes Specific to CMC post above, nested threads of posts may be encoded sequentially, while the indentLevel attribute of post is used to keep track of the original nesting depth. This is especially meant for CMC text obtained from a wiki code or HTML source, where it is not always entirely clear whether the indentation information actually reflects a reply action from a user.
In genres where technical reply information is available for each post, reply links can be encoded using the replyTo attribute on post elements, as shown in the second example of 9.3.2. Attributes Specific to CMC post. The network of all reply links will then also form a threaded structure, and visual indentations can be reconstructed from it and need not be explicitly encoded.
As explained in section 9.2. Basic Units of CMC, the elements post, u, kinesic, and incident are available to encode textual transcriptions of written posts, spoken turns, bodily activities of avatars, and onscreen activity by users that occur in CMC data; and, as discussed in section 9.3.2. Attributes Specific to CMC post, graphics or other media data within posts are encoded in a post with modality set to written. When two or more of these features occur in a CMC interaction, we can speak of multimodal CMC.
Some basic multimodality is available in many private chat systems such as WhatsApp, where spoken and written posts and media posts containing images or video clips can alternate in the sequence of posts. The following shows the suggested encoding of an extended part of the haircut chat example from above, including a spoken post, several written posts, and a post containing a graphic image (adapted from the MoCoDa2 corpus Beißwenger et al. (eds.) (visited 30 March 2022))
Note that the spoken utterance u represents a speaker turn that was transmitted via an audio channel of the application that is continuously open during a session, whereas a spoken post represents a spoken message that has been recorded in private and been posted to the CMC server as a whole. See section 9.2. Basic Units of CMC.
The teiHeader of the corpus should contain metadata about the CMC platform(s), e.g. its name, information about its owner (often a company) including their address or location, the URL of the server where the CMC data were collected from, or the filename of a database dump that was used as a source. Metadata about the project responsible for collecting the data and building the corpus, if applicable, should be recorded as well.
A CMC document may be a chat logfile, a discussion page, or a thematical thread of posts as encoded within a TEI element. Among the metadata to be recorded in the sourceDesc of its teiHeader are, if available, its title, author or owner, its URL, the date of its creation and/or the date of its last change (i.e. the time when the last post was added to it).
The documentation of how the data were collected, e.g. how it was scraped or sampled from the web, or downloaded from a server, should be recorded in the samplingDecl. Like other metadata, information about sampling should be recorded at the highest level applicable. That is, if the information applies to an entire corpus, the samplingDecl should appear in the teiHeader of the corpus level; if the information is different for each document, it should appear in the teiHeader of the document level texts.
A listPerson may be used to maintain an inventory of users and bots taking part in a CMC interaction, along with information about them. As with other such contextual information, it may be kept in the teiHeader (where it would occur in a particDesc within a profileDesc) or in a separate document completely. In either case, an encoded post may then be linked to its author by use of the who attribute.
uL: could be used to map the value uL:06 to file:/userList.xml#cmc_user_06. See 17.2.3. Using Abbreviated Pointers for more information on establishing prefix definitions.This indirection—using a listPerson, particularly one in a separate file, to store information about the users involved in a CMC interaction—is particularly useful when there is both a need to keep such information locally, and to remove it (e.g., to ‘anonymize’ the data) when the data are published or shared with other researchers.
Emojis are iconic or symbolic, invariant graphic units which the users of social media applications such as WhatsApp, Instagram, and X (Twitter) can select from a menu or ‘emoji keyboard’ and embed into their written posts. Examples are 😁, 😷, 🌈, 😱, and 🙈. An emoji is encoded by one or more Unicode characters which are intended to be mapped directly to a pictorial symbol.
Emoticons predate emojis and are created as combinations of ASCII punctuation and other characters using the keyboard. Examples are :-), ;-), :-(, :-x, \O/, and Oo. They first occurred on a computer bulletin board system at Carnegie Mellon University (Fahlman, 2021) and then became frequent in chat communications during the mid-1980s. An emoticon typically consists of several Unicode characters (from the ASCII subset) in a row, each of which has an intended use other than as part of an emoticon.
Both emoticons and emojis may be simply transcribed as a sequence of characters. As with any other characters, they may be entered as numeric character entities if this is more convenient. (E.g., ❤ might be transcribed as ❤ in any XML document, including a TEI document; see Entry of Characters.)
When the text of a post is being tokenized, e.g. for linguistic analysis, it may be useful to encode the emoticon or emoji as a separate token. In such cases elements such as w or c may be used for tokenization, and the pos attribute may be used to indicate that the encoded string is an emoji or an emoticon. (See 18.1. Linguistic Segment Categories.)
The values of pos in the above examples are from the STTS_IBK Tagset for German (see Beißwenger et al. (2015-09-13)), which includes tags for CMC-specific elements such as EMOASC for an ASCII-based emoticon and EMOIMG for an icon-based emoji.
Sometimes, e.g. when the source of the TEI document was a web page in HTML, the emojis may occur only as an icon graphic in the source. In such a case, they may be encoded using figure. The corresponding Unicode character can then be recorded in the desc element by the encoder if desired.
A post in a CMC interaction may contain a graphic in addition to some text or even contain only a graphic (without any text). As explained in 9.3.2. Attributes Specific to CMC post, the modality of such a post should be considered as written. To encode the graphic information, the figure element may be used at the appropriate place.
The following recommendations on how to encode features of the circulation of posts, such as IDs, re-posts (retweets), hashtags, and mentions use X (Twitter) posts (tweets) as an example; this phenomenon is not in any way unique to X (Twitter), however.
In the following example, the type of post (in this case, a tweet) is recorded using the type attribute of post. If it were useful to record a particular sub-categorization of tweet, the subtype attribute could also be used. Furthermore, the original unique identifer of the tweet as supplied by X (Twitter) is recorded as part of the value of the xml:id attribute of the post.
Also in the following example a retweet and its corresponding retweeted tweet are encoded as two separate posts each with its own set of attributes. The post representing the retweet itself does not contain or duplicate the content of the retweeted tweet. Instead it refers to the ID of the retweeted tweet via a ptr in the post content. All original content of the retweet goes in the content of the post element as well. In addition, the hashtags found in the body of the source tweets have been encoded using ref elements (with a type of hashtag), as they are links like any other hyperlink.
Note that in the above example ‘CoMeRe’ style (cf. Thierry et al. (2014)) encoding is used to represent the number of favorites. It would also be reasonable to use a TEI measure element instead of the fs.
For encoding linguistic analyses of CMC text, we may use the dedicated elements and attributes from the analysis module, which is described in 17. Linking, Segmentation, and Alignment. For example, the tokenization (segmentation into word-like units) of a CMC text should be encoded using the w element.
In many CMC genres, especially in private chat, informal writing abounds including irregular spellings imitating spoken language, omitted word boundaries, and spurious boundaries leading to tokens separated in parts. For encoding these writing phenomena typical of CMC, the TEI attributes norm and join may be used.
In the preceding example, pairs of a gap and a supplied element encode the fact that some substring has been removed and replaced with another string for anonymization purposes. Note that in this example, the name and the w elements and their attributes also provide some categorical information about what has been removed. Using gap and supplied to record the anonymization is especially recommendable when the original name or referencing string has been ‘pseudonymized’, i.e. replaced by a different referencing string of the same ontological category (such as replacing the female name Konstanze by the female name Kornelia.). In that case, the markup would be the only place where it can be seen that a pseudonymization has been carried out, as in the following version of the example.
The module described in this chapter makes available the following components:
The selection and combination of modules to form a TEI schema is described in 1.2. Defining a TEI Schema.