Saturday, April 16, 2016

COST ELN STSM on multi-word production (3): A first look at the data

For keystroke-logging the writing sessions in our experiment, we use Inputlog. It's developed at the University of Antwerp and free to use for everybody. On the Inputlog website you also find information on how to use it and on how it has been used in other studies.

In the record tab, you enter the information on the participant and the writing session and press "Record". MS Word opens automatically with a fresh new document. The Inputlog window goes to the background and doesn't disturb your writing. When you finish your writing, you bring the Inputlog window back to the front, press the "Stop recording" button and you're done. Inputlog switches to the Analyze tab and lets you select analyzing scripts which you can also modify.

Of course, you could run your own scripts on the recorded file. All keystrokes are stored in an idfx file, which is an XML file containing the participants information as meta data and all information on keys pressed and mouse movements as events. You could load it into an XML Database like BaseX and run XQuery scripts.

So, everything looks as ready for processing. But it's better to look more closely at your data first. The main issue when dealing with non-English language data is always encoding. The snippet shown above actually has Greek letters and it is encoded in UTF-8 (Emacs makes this information explicit at the bottom). Students wrote texts in Greek, all final texts also show Greek letters. So everything should be fine, shouldn't it?

Actually, the information in the idfx file is not taken from MS Word, but directly from the keyboard. So no matter what your setting in MS Word is, the setting for the keyboard in Windows is relevant. And we discovered that for some sessions, this was set to English and not to Greek. Which means that the information in the idfx file are actually ASCII keys pressed -- because of the setting in MS Word, this information is converted into the corresponding Greek characters and the characters in MS Word appear as Greek characters.

The question is: Does this affect the analysis? We could simply replace the English letter with the corresponding Greek letter. There are conversion tables available and even the keyboards are labeled accordingly, so this should be easy. But then, Greek has accented vowels which are not characters of its own, but are constructed similar as you would write them by hand: You put an accent on the vowel. Which means you press the key for the tonos (the key right to P (which would be the "ΓΌ" on a German keyboard and the ";" on a US-English keyboard)) and then the vowel. The result is a vowel with tonos, one character only although we pressed two keys. And that's how it is recorded when the keyboard is set to Greek.

However, if the keyboard is still set to English, Inputlog records that two keys have been pressed, the key for the tonos is not treated as a dead key. The following image shows two idfx files

In both cases, we produce the same character, the small letter eta with tonos: ή. In the right file (GR-03_0.idfx), the key for the tonos is pressed (VK_OEM_1) at position 375 as the 491st event. Then the key VK_H for the small eta is pressed as the 492nd event, but we are still at position 375. The tonos key is actually treated as dead key (there is no key value) and after pressing the eta key η, the value is "ή". But if we look at the left file (GR-11_0.idfx), we find the production of ή to be recorded differently: the key for the tonos is pressed as event 38 at position 19 and there actually is a value: ";". Then the key VK_H is pressed as event 39 at position 20 and the value is "h". So the tonos is not treated as a dead key, but as any other key with an actual value. In the idfx file, no accented value is visible, we cannot simply replace ASCII values with the corresponding UTF-8 characters. A more sophisticated recoding would be needed.

Let's see what this means for our analyses in a later post.

Tuesday, April 12, 2016

COST ELN STSM on multi-word production (2): Data collection

In order to explore how multi-word expressions are produced, we need somebody to write something. We decided to have students come to write short argumentative essays. In those texts, you would expect to find discourse expressions like "in my opinion", or "on the one hand -- on the other hand". This would give us freely produced MWE without explicitly triggering them.

Students come to the lab and first get some information about the experiment on paper. They sign a consensus form and fill in a small questionnaire. The questionnaire asks about their native language and other languages the speak/write, and about their writing (how many fingers do they use, do the look at the screen or at the keyboard, etc.). I also take observational notes and we will later see whether or not their self image is true. They get a topic to write about. First they can plan for 5 minutes and make notes on paper, after that they start writing for 30 to 35 minutes.

Students write a text about one of two topics: "Should students pay tuition for post-graduate studies in Greek Universities?" or " argue in favor or against having the options to be tested in all courses they take at each semester". The target audience are other students, so they write a letter to the editor of an imagined student news paper.

It's a small lab, so we can have four students at a time. However, they drop in from time to time and sometimes there are four, most often there is one writing while we start analyzing the incoming data. I will tell about this in the next post.

All four computers run Windows, but different versions. Ioannis installed Inputlog for keystroke-logging. You start the logging and MS Word is opening. You write as usual, Inputlog does not interfere with MS Word.

According to our plans, we will have around 60 writing sessions with Greek data in the end.

Monday, April 11, 2016

COST ELN STSM on multi-word production (1): The start

At the end of 2014, the COST Action IS1401 Strengthening Europeans' Capabilities by Establishing the European Literacy Network (ELN) started. We will explore how to help people (students, adults, novices and experts, and foreign language learners) to write and read better. You can read about he official statement, goals, and working groups on the COST ELN website.

One instrument in COST actions are STSM (short term scientific missions). Combining my interests in writing processes and multiword expressions (which I follow in the COST action PARSEME (PARSing and Multi-word Expressions) Towards linguistic precision and computational efficiency in natural language processing, I applied for a research adventure with Ioannis Dimakos from the Department of Primary Education of the University of Patras. He heads the Laboratory of Cognitive Analysis of Learning, Language and Dyslexia. Under Constantin Porpodas this lab participated in the COST action A8 "Learning disorders as a a barrier to human development."

For this STSM we work on a multi-lingual study on multi-word expression (MWE) production. A great part of natural language (either spoken or written) consists of MWE (i.e., sequences of words with special meaning and/or syntactic properties). Those units have to be learned, the use and meaning cannot be deduced from a simple combination of the words involved. MWEs are rather fixed units and they are typically stored as complete units in dictionaries. It has been shown widely that knowledge of such units plays a key role in reading and listening. However, there is very little research on the production of multi-word expressions. In a pilot study, I could show that MWEs of various kinds (idiomatic phrases, terminology, grammatical constructions, etc.) are produced with significantly shorter pauses between the words involved than when producing any other sequence of words. This study was based on texts produced by German university students who wrote short argumentative essays, the writing was recorded using Inputlog. It has been shown in great detail that use, structure, and semantics of MWEs are similar across European languages and European cultures. In this STSM we will investigate whether this holds also for the production of MWEs in German and Greek.

So, in the second week of April 2016, I travelled to Patras, found a really nice hotel by the sea with a great view, and we started our small project.