JavaNLP meeting notes for 01/30/03 This is a summary of what everyone did last week and outstanding tasks for next time. The next meeting is planned for Wed, March 5. Dan will come up with proposals for the classify package then. AUTHOR TAGS It looks like the question with author tags is being solved. Dan has a script which will be put in the bin directory and things will be coordinated between Dan and Joseph on that. TASKS FROM LAST TIME AND PROGRESS Sep: from last time - Add 20 news groups to /u/nlp/data and create dbm classes for reading them in. We want to use this as a proof-of-concept for text categorization using the new document and classifier framework. done for next time -same as for last time Joseph: from last time - Fix AbstractDataCollection so that it stores its data in itself (and not in a data field) - Look at creating a DocumentCollection class for storing text documents, with some of the general-purpose functionality that's currently in Corpus (vocab, splits, etc) - Consider creating a simple XML schema for documents and document collections that we can convert other file formats to done -fixed the DataCollection classes so they store their contents internally, rather than in a private list. -created a new class DocumentCollection that now sits between AbstractDataCollection and Corpus and provides general functionality for managing collections of Documents. -defined a simple XML format for storing entire document collections in one file, and DocumentCollection can read in these files. for next time -coordinate adding author tags with Dan -catch up on JavaNLP minutes Dan: from last time - make oldparser use the new interface - fix API for parse methods - add a work limit param that returns the PCFG - add word length range args (like -ms and -mx in java) - make sure TaggedWords work - modularize language-dependent features of parser - fix other misc. files for I/O stuff (readers etc) done -wrote a Perl script to add author tags ; the script has some non-trivial capabilities -added TaggedWords to parser -found out there are problems with the existing organization of dictionaries for next time -will work on new organization for parser -other misc. for parser -think about the external classify and issues that need to be discussed Kristina: from last time - generalize code for log linear model with text file input - provide documentation for using log linear code done -generalized code for log-linear models with text input files - the classes edu.stanford.nlp.maxent.LoglinearFactory and edu.stanford.nlp.maxent.LoglinearClassifier do the work for next time -write documentation for the above and improve the classes Roger: from last time - parser stuff done -messed around a lot with FactoredParser and got a pretty good idea of how to modularize it and other parsers for multiple languages. No commits yet; discussion with other parties still necessary for next time -still work on modularizing and making language - independent Huy: from last time - hmm experiments done -fixed structure learning -the train methods have been updated to use Properties objects, so we can make the parameters as complicated as we want. for next time -still work on HMM , make stable version to be used for Chris's class project (good to be stable by this weekend) Teg: from last time - remove all uses of gnu.regexp from JavaNLP, using java.util.regex instead - look for potential uses of JavaNLP code to CS276B project - convert /u/nlp/data/iedata/acquisitions.txt to XML format (to be specified) done -Added the XML-parsing functionality to the nlp.ie.hmm.Corpus class. -Created a utility (not yet in CVS) to convert the files from the ugly format to XML format. for next time -the rest Cindy: from last time - fix FrameNet parser XML issues (consider using standard java XML packages) done -started looking at JAXP for next time -finish XML stuff -look at designing a richer Sentence class with Roger Chris: from last time - integrate Tim's new PTB code into JavaNLP PTB Tokenizer done - Tim did the integration - Chris changed Tree.java to use Writers for output not streams (complete with that autoflush!). for next time -finish author tags