Saturday, April 16, 2016

COST ELN STSM on multi-word production (3): A first look at the data

For keystroke-logging the writing sessions in our experiment, we use Inputlog. It's developed at the University of Antwerp and free to use for everybody. On the Inputlog website you also find information on how to use it and on how it has been used in other studies.

In the record tab, you enter the information on the participant and the writing session and press "Record". MS Word opens automatically with a fresh new document. The Inputlog window goes to the background and doesn't disturb your writing. When you finish your writing, you bring the Inputlog window back to the front, press the "Stop recording" button and you're done. Inputlog switches to the Analyze tab and lets you select analyzing scripts which you can also modify.

Of course, you could run your own scripts on the recorded file. All keystrokes are stored in an idfx file, which is an XML file containing the participants information as meta data and all information on keys pressed and mouse movements as events. You could load it into an XML Database like BaseX and run XQuery scripts.

So, everything looks as ready for processing. But it's better to look more closely at your data first. The main issue when dealing with non-English language data is always encoding. The snippet shown above actually has Greek letters and it is encoded in UTF-8 (Emacs makes this information explicit at the bottom). Students wrote texts in Greek, all final texts also show Greek letters. So everything should be fine, shouldn't it?

Actually, the information in the idfx file is not taken from MS Word, but directly from the keyboard. So no matter what your setting in MS Word is, the setting for the keyboard in Windows is relevant. And we discovered that for some sessions, this was set to English and not to Greek. Which means that the information in the idfx file are actually ASCII keys pressed -- because of the setting in MS Word, this information is converted into the corresponding Greek characters and the characters in MS Word appear as Greek characters.

The question is: Does this affect the analysis? We could simply replace the English letter with the corresponding Greek letter. There are conversion tables available and even the keyboards are labeled accordingly, so this should be easy. But then, Greek has accented vowels which are not characters of its own, but are constructed similar as you would write them by hand: You put an accent on the vowel. Which means you press the key for the tonos (the key right to P (which would be the "ü" on a German keyboard and the ";" on a US-English keyboard)) and then the vowel. The result is a vowel with tonos, one character only although we pressed two keys. And that's how it is recorded when the keyboard is set to Greek.

However, if the keyboard is still set to English, Inputlog records that two keys have been pressed, the key for the tonos is not treated as a dead key. The following image shows two idfx files

In both cases, we produce the same character, the small letter eta with tonos: ή. In the right file (GR-03_0.idfx), the key for the tonos is pressed (VK_OEM_1) at position 375 as the 491st event. Then the key VK_H for the small eta is pressed as the 492nd event, but we are still at position 375. The tonos key is actually treated as dead key (there is no key value) and after pressing the eta key η, the value is "ή". But if we look at the left file (GR-11_0.idfx), we find the production of ή to be recorded differently: the key for the tonos is pressed as event 38 at position 19 and there actually is a value: ";". Then the key VK_H is pressed as event 39 at position 20 and the value is "h". So the tonos is not treated as a dead key, but as any other key with an actual value. In the idfx file, no accented value is visible, we cannot simply replace ASCII values with the corresponding UTF-8 characters. A more sophisticated recoding would be needed.

Let's see what this means for our analyses in a later post.

Tuesday, April 12, 2016

COST ELN STSM on multi-word production (2): Data collection

In order to explore how multi-word expressions are produced, we need somebody to write something. We decided to have students come to write short argumentative essays. In those texts, you would expect to find discourse expressions like "in my opinion", or "on the one hand -- on the other hand". This would give us freely produced MWE without explicitly triggering them.

Students come to the lab and first get some information about the experiment on paper. They sign a consensus form and fill in a small questionnaire. The questionnaire asks about their native language and other languages the speak/write, and about their writing (how many fingers do they use, do the look at the screen or at the keyboard, etc.). I also take observational notes and we will later see whether or not their self image is true. They get a topic to write about. First they can plan for 5 minutes and make notes on paper, after that they start writing for 30 to 35 minutes.

Students write a text about one of two topics: "Should students pay tuition for post-graduate studies in Greek Universities?" or " argue in favor or against having the options to be tested in all courses they take at each semester". The target audience are other students, so they write a letter to the editor of an imagined student news paper.

It's a small lab, so we can have four students at a time. However, they drop in from time to time and sometimes there are four, most often there is one writing while we start analyzing the incoming data. I will tell about this in the next post.

All four computers run Windows, but different versions. Ioannis installed Inputlog for keystroke-logging. You start the logging and MS Word is opening. You write as usual, Inputlog does not interfere with MS Word.

According to our plans, we will have around 60 writing sessions with Greek data in the end.

Monday, April 11, 2016

COST ELN STSM on multi-word production (1): The start

At the end of 2014, the COST Action IS1401 Strengthening Europeans' Capabilities by Establishing the European Literacy Network (ELN) started. We will explore how to help people (students, adults, novices and experts, and foreign language learners) to write and read better. You can read about he official statement, goals, and working groups on the COST ELN website.

One instrument in COST actions are STSM (short term scientific missions). Combining my interests in writing processes and multiword expressions (which I follow in the COST action PARSEME (PARSing and Multi-word Expressions) Towards linguistic precision and computational efficiency in natural language processing, I applied for a research adventure with Ioannis Dimakos from the Department of Primary Education of the University of Patras. He heads the Laboratory of Cognitive Analysis of Learning, Language and Dyslexia. Under Constantin Porpodas this lab participated in the COST action A8 "Learning disorders as a a barrier to human development."

For this STSM we work on a multi-lingual study on multi-word expression (MWE) production. A great part of natural language (either spoken or written) consists of MWE (i.e., sequences of words with special meaning and/or syntactic properties). Those units have to be learned, the use and meaning cannot be deduced from a simple combination of the words involved. MWEs are rather fixed units and they are typically stored as complete units in dictionaries. It has been shown widely that knowledge of such units plays a key role in reading and listening. However, there is very little research on the production of multi-word expressions. In a pilot study, I could show that MWEs of various kinds (idiomatic phrases, terminology, grammatical constructions, etc.) are produced with significantly shorter pauses between the words involved than when producing any other sequence of words. This study was based on texts produced by German university students who wrote short argumentative essays, the writing was recorded using Inputlog. It has been shown in great detail that use, structure, and semantics of MWEs are similar across European languages and European cultures. In this STSM we will investigate whether this holds also for the production of MWEs in German and Greek.

So, in the second week of April 2016, I travelled to Patras, found a really nice hotel by the sea with a great view, and we started our small project.

Friday, July 25, 2014

Professor for one year (week 48): Who does profit from MOOCs?

Actually, this blog post was scheduled for the first week of March. However, the topic is still relevant even a few weeks later. Just pretend it's early March 2014 (i.e., cold and rainy) while reading.

During our visit of US higher ed institutions last year, we met James P. Honan from Harvard's Grad School of Education. We discussed various things and also touched e-learning and MOOCs. Honan told us about his experiences as a teacher and consumer of e-learning courses and contents and then some musing started about the underlying principles of MOOCs. I will briefly follow up here.

From a didactically point of view, massive open online courses (MOOCs) are old wine in new skins. I wrote about this part in an earlier post. E-Learning courses hosted on servers of universities started around 2000, and courses supported by current technology are as old as TV. The only new aspect is the "massiveness". At a university, e-learning courses are offered to the students of a particular subject at a certain point of their studies enrolled at that specific university. So there might be several hundreds of students using the materials of a course.

Going "massive" and "open", those courses skip restrictions -- everybody can take part -- but no change in didactics might be involved. Allowing more than only a few hundred users to access the material may involve changes in server architecture, maybe clustering, but not necessarily in the general technology used for user interaction and the like.

However, someone has to run those servers and someone should be paid for maintenance. The first MOOCs were developed from scratch, not just scaled e-learning courses (there will be another post on this aspect, stay tuned) -- maybe the content providers would need some payment, too. But declaring those courses as "open" doesn't only mean everybody may join, but also means nobody should pay anything for taking part. So where should the money come from to pay development and maintenance?

Honan gave a hint when he told about the fear of teaching staff at universities: Attending a course may have two main reasons. People just are curious about a certain topic (a), or people have to acquire certain knowledge (due to job demands or the like) and that involves getting a certificate (b). For a certificate, attendees would have to do some sort of exam. And this exam would have to be assessed and graded by someone. And guess who is qualified for assessing and grading student work? Right, teaching staff.

So while in the early years of e-learning instructors feared to be replaced by machines, the advent of MOOCs makes instructors fear to be used for grading only. And in the end, to be replaced by cheap grading staff -- why should you need highly qualified academics when you can have people trained to grade certain exams only. MOOCs would not result in replacing humans, but in downgrading educators.

At the one hand, this nightmare might not become true to the extend instructors might expect -- similar to the fear of teachers being replaced be educational TV shows or e-learning courses --, but on the other hand, that's probably part of the business model of companies like Coursera, edX, or Udacity. Participation in MOOCs might be free, but to get a certificate you would have to pay -- part of this money might get down to the graders, but most of it will go to the company owners. Those certificates don't have to cost a fortune. Look at prices for apps -- as long as the audience is big enough, small fees are fine.

Of course, with "certificate" a mean any piece of paper stating that you passed the exam of this course. As soon as participants actively demand official certificates of the hosting institutions, e.g., from Stanford or the MIT, another question arises: How much is such a certificate worth? As an on-campus student, you would have to pay a lot of money -- if you would ever get accepted in the first place. However, nobody would pay several thousand dollars for an on-line course offered or developed by Stanford or MIT staff.

So maybe several hundred dollars? But wouldn't that be a hard competition for those Ivy League Universities? If I could get a prestigious certificate without moving to Stanford and without enormous debts, why should I even send an application to Stanford? But here we're already touching another topic.

Wednesday, June 11, 2014

Professor for one year (week 47): Teaching investment and payoff

This is the 47th post in the series "Professor for one Year."  Initially, I had planned to post something every week.  However, my year is over and I still have some weeks left in the series.  The topics for the missing posts are already planned, so the only thing I need is some time to write ...

Apropos of time:  How much time do you spend on teaching, including preparation, interacting with students, assessment, grading?  As I wrote two weeks ago and also in week 24, teaching did take up a lot of my time.  I argued that the time allocated to teaching -- including preparation and grading -- should be the same as the time students have to invest to take a particular class -- i.e., the ECTS credits should describe the amount of work students and instructors have to invest.  However, for a regular seminar with 9 ECTS credits, this would mean 18 hours per semester week.  So, no more than three courses (54 hours per week) and then you would have to do some of the other work in the non-lecture time aka semester break to stay at least somewhat healthy and within the regulatory framework of labor law (41 hours per week).

Let's have a look at the workload of professors; 39 to 41 hours per week include:

  • administrative work (keeping track of all the different contracts for your PhD students and PostDocs, help with finding new researchers, mentoring your PhD students, hold staff meetings, etc.)
  • committee activities at your local university (attend faculty meetings, serve on appointment committees, attend senate meetings, etc.)
  • committee activities in your scientific community (attend meetings of societies, have some duties there, review for conferences and workshops, review for funding agencies, etc.)
  • write grant proposals (you don't get much state or university money for staff)
  • teaching
  • doing research
  • publish about research
There are studies on professoral activities, showing that professors work more than the 40 hours they get paid for, and that they spend only little time on activities one would usually associate with "being a professor" -- teaching and doing research.

Having a social live, too, and assuming that maybe you don't want to work every weekend, but roughly 50 hours per week -- of course, you think about some issues during your non-working time and you have ideas outside you office --, the question remains: which of the activities are really important and where could you spend less time?  You cannot cut on administrative work, but you could try to delegate some tasks.  Most of the committee activities at your university are related to the status of a professor, so no chance of delegating something there.  You can delegate tasks for your scientific community like reviewing conference or workshop papers -- however, as an author, you'd rather want to get feedback from senior researchers, not from PhD students, so this is a bit tricky.  You could hire someone for writing grant proposals and you could even let your PhD students and PostDocs write most of the articles on which you appear as co-author.  Even the research you could delegate to members of your group, at least part of it.  So you are the one who has ideas and then somebody else is experimenting if they are worth to be investigated much deeper -- for computational linguistics, this means that you find someone who does the programming along the lines of your roughly sketched new approach.  So most of the activities could be delegated to other people, and maybe the quality even improves because you profit from including more people and thus more ideas and more skills than one could have oneself.

And for teaching?  Oh, that's easy: You take the slides and exercises you developed years ago (or you even borrowed from somebody) and use them term after term without changes.  You find teaching assistants doing all the tutoring and exercises with students.  You cut short on mentoring: students have to come up with topics for theses themselves, and somehow they should know by then how to write a thesis, don't they?  This way, you can drastically reduce the time spent on teaching.  And to be honest, that's the most obvious way:  I didn't have duties in committees at the university during my year as professor, but even then, I could hardly keep up with my scientific community activities, and I did have absolutely no time to write grant proposals, do research, or even publish.  In other words: I had to invest almost all of my time in teaching and I definitely couldn't afford this for a real professorship.  On the positive side, I now have quite a teaching record, from which I can benefit in the future.  But honestly, I also enjoyed mentoring and advising students even though this takes up a lot of time.  And in the end it's the only way to have someone try an idea and report some results I might be able to use for proposals or further research (eventually resulting in publications, too).

However, having a teaching load as high as 9 SWS must result in reduced teaching effort, and thus in  lower-quality teaching as professors cannot afford investing most of their working hours in teaching. So one solution would be to value teaching more, or to reduce teaching load -- students then could expect good-quality teaching and mentoring.

Tuesday, May 20, 2014

Professor for one year (week 46): Writing research across borders

After teaching is over, I packed my suitcase and took the TGV to Paris.  I attended the Third Writing Research Across Borders (WRAB) conference which took place at the University of Paris-Ouest Nanterre la Defénse.

WRAB takes place every third year, 2011 it was at the Georg Mason University in Washington, D.C., and 2008 at was hosted by the University of California Santa Barbara.  It is the biggest and most international conference on writing research I'm aware of.  The conferences organized by EARLI's SIG Writing (which take place every other year) are also international (i.e., not only European), but much smaller.

The number of participants, number of submitted and accepted proposals, and the number of concurrent sessions is constantly growing.  They actually had 26 parallel sessions!  It was almost impossible to find out which of the talks/presentations would be the most suitable one depending on your own interests.  There was a bit of Twitter traffic going on, so I could see that related topics would be discussed at various sessions all taking place at the same time.  The program was so dense and there were so many people, I draw the comparison to LREC (the International Conference on Language Resources and Evaluation), taking place every other year.  LREC is growing still and you can be sure to meet almost everybody from the NLP community there.  If LREC is the conference to be for NLP, then WRAB is the conference to be for writing research.

And WRAB shares another not so nice feature with LREC: Although you know that everybody is there (or you even searched the program for the name of some colleagues), you cannot meet someone during a coffee break or over lunch unless you actively make an appointment.  I like small- or mid-sized conferences better.

As I had no time to submit papers to NLP conferences during this year, those writing research conferences (and also conferences/workshops on linguistics) will be the only conferences I actively attend in 2014 -- you only have to submit a very short abstract, not a full paper.  At WRAB, I presented ongoing work on a systematic analysis of complex writing errors.  I argued to go a step further than current error analysis in writing research, NLP, or (second) language acquisition -- we have to consider the process that caused an error when classifying writing errors.  This way, we could on the one hand distinguish competence errors and performance errors and we could on the other hand come up with actual proposals on how to automatically prevent or correct certain types of errors.  Fortunately, I found a possibility to actually publish this -- I will give the details once the publication is a available.

I could meet colleagues from Europe and The Americas, we exchanged ideas and made loose appointments for SIG Writing's Conference on Writing Research in August.  So yes, the conference was successful.  As I already new that I would start at IMS in Stuttgart in April, I could tell people about my new affiliation and I made some loose collaboration and cooperation agreements.  I hope I can actually work on that in Stuttgart.

Monday, May 19, 2014

Professor for one year (week 45): Last week of teaching

Finally it's mid February and the last week of teaching is over.  This semester (Winter term 2013/2014), I taught four courses: on Computational Morphology aka XFST, on Computational Semantics aka Prolog, on Natural Language Processing, and on logging writing processes and exploring those data.

Except for the NLP course, I had to prepare everything from scratch.  There was material I could use and it definitely helped to "borrow" ideas and exercises for the two programming courses, but I spent all time with preparing slides (for all courses) and data (especially for the writing processes course) or assessing student solutions.

I also realized that I have to work on my elocution: When I teach two classes back-to-back, my voice is almost gone at the end of the day although I regularly sip some water.  Of course, one solution would be not to talk that much myself, but to let students contribute more.  However, staying focused for 90 minutes and trying to be louder than the 30 computers plus keyboarding noise and to keep students awake is stress to my voice.

As for the exams, I did different things:
  • For the Prolog course, I had an exam similar to the Perl course last semester.  The grade is made up by points earned during the semester by submitting solutions for three exercises and then there is a final exam consisting of a more theoretical part to be answered on paper and a more practical part where students actually program.  I could assess the theoretical part while students worked on the programming tasks, nice multitasking.
  • For the XFST course, students earned some points by solving three assignments during the semester, too.  And then they will submit small projects including documentation within 4 weeks.
  • For the NLP course, students earned some points during the semester by submitting solutions for three assignments and then I had a classic written exam at the very last session.  Students had to answer one question per topic.  Looks like the handwriting of most students is more or less readable.
  • For the writing process course, students had to work on a project during the second half of the semester.  They defined a small research question to investigate in groups of two, recorded a writing session for each person, and then explored the logged data and wrote a small report.  I will report on this experimental didactic setting at the next Conference on Writing Research in August. 
I could time the assessments for the three NLP-related courses so I had to come up with an assignment every week and I had to grade an assignment every week.  I preferred a constant but moderate workload instead of giving assignments to all three courses in the same week.  In every course, students had to read research papers (classics and recent ones) in groups and present the content to their colleagues.  Students also had to relate the paper they presented to the material presented or discussed before and they had to comment on whether or not they agree with the authors based on their prior knowledge and being almost full-fledged and approved linguists themselves (they where all master students except for the XFST course).

So teaching is over, I wait for the project submissions of two courses and I have to assess the written exam for NLP and the Prolog programs.  I hope to submit all grades by the end of March to actually finish all teaching stuff when my appointment in Konstanz ends.