#################################### # # # README file for dump_meeting # # # # Michel Galley # # Columbia University # # galley@cs.columbia.edu # # # #################################### 1 COPYRIGHT The author abandons the copyright of this program. Everyone is permitted to copy and distribute the program with no charge and no restrictions. However, the author is delightful for the user's kindness of proper usage and letting the author know bugs or problems. This software is provided "as is", and the author makes NO warranties, express or implied. 2 INTRODUCTION Efforts in the AMI project to annotate the ICSI Meeting Corpus and AMI Corpus for various properties, as well as porting the existing annotations into one framework -- NITE XML Toolkit (NXT), has led to the creation of a CVS repository accessible by AMI members and affiliates [1]. "dump_meeting" is a simple tool that exports the XML-encoded meeting data into various plain text formats. Current support include: - export meeting transcription (with or without punctuation; with or without case) - export turn and word segmentation - export dialog acts (DA) and adjacency pairs (AP) - export summary annotation - export topic segmentation Given a meeting identifier (ID), "dump_meeting" will create at least two files: - ID.trs transcription - ID.wsegs word segmentation Both files contain exactly one turn per line, i.e. the i-th line of ID.trs corresponds to the i-th line ID.wsegs. Turn segmentation is by default defined by dialog act (DA) boundaries, but "dump_meeting" can instead read a turn segmentation from file, or automatically determine a silence-based segmentation. Extra plain files can be generated: - ID.segs three columns: 1) speaker ID; 2) turn start time; 3) turn end time. - ID.da five columns: 1) speaker ID; 2) DA label; 3) AP label; 4) ID of A-speaker in the adjacency pair; 5) DA identifier - ID.sel summarization annotation (1=in summary, 0=not) - ID.vect.topic topic segmentation annotation (1=boundary; 0=not) The only output file that does not abide to the one-turn-per-line convention is (instead defines one topic segment per line): - ID.u.segs.topic two columns: 1) segment start time; 2) segment end time 3 RESTRICTIONS "dump_meeting" ... - is currently only available under linux. - was only tested with NITE XML Toolkit Version 1.2.9. - currently only works with the ICSI meeting data 4 USAGE First specify the location of the NXT-formatted data through the shell environment variable NXTBASEDIR, e.g., under bash: > export NXTBASEDIR=/usr/local/data/AMI/Data/ICSI/NXT-format Also, you need to set MGBASEDIR to point to the directory containing "dump_meeting"'s files, e.g.: > export MGBASEDIR=/usr/local/dump_meeting Basic usage: > dump_meeting -m -o e.g. > dump_meeting -m Bmr005 -o . The above command will create Bmr005.{trs,wsegs} in the current directory. By default, "dump_meeting" creates ASR-like transcriptions, i.e., everything lower case, without punctuation, and only "-pau-" and "[laugh]" as extra tokens (fragmentary words, e.g., "r-", are however still present). To get punctuation and case, add respectively "-p" and "-c". To export extractive summarization annotation, run: > dump_meeting -m -e -o where is the location of the XML file relative to $NXTBASEDIR/Contributions/Summarization/extractive/ (excluding the .extsumm.xml extension). That is, should be "Bmr005" to extract annotation from extractive/Bmr008.extsumm.xml, and "alastair/Bmr005" to extract annotation from extractive/alastair/Bmr008.extsumm.xml: > dump_meeting -m Bmr005 -e Bmr005 -o . > dump_meeting -m Bmr005 -e alastair/Bmr005 -o . Arbitrary segmentation can be provided as input, e.g.: > dump_meeting -m Bmr005 -S Bmr005_new -o . This reads turn segmentation from ./Bmr005_new.segs (which is expected to have the same format as .segs files generated by "dump_meeting"). Note that turn segments are always specified by speaker. This provides more flexibility, since it let's us take into account overlap between speakers. For more information how to generate the other files: > dump_meetings -h ------------------------------------------------------------------ To generate ASR transcripts, run: > dump_meeting.i686.out -m -A where is a directory containing the file .words (e.g., Bmr005.words). The format of this file is: e.g., me013 1.407 1.917 so me013 2.986 3.616 okay me013 3.616 3.616 so me013 3.616 3.676 it me013 3.676 3.886 looked me013 3.886 4.026 like me013 4.026 4.076 a me013 4.076 4.486 crash me013 4.486 4.676 that's me013 4.676 4.976 great 5 REFERENCES [1] http://www.inf.ed.ac.uk/systems/cvs/new