YASMET BY Franz Josef Och (Tutorial written by Deepak Ravichandran) Introduction =================== A good technical description of Maximum Entropy is given in: Adwait Ratnaparkhi. 1996. A Simple Introduction to Maximum Entropy Models for Natural Language Processing. Technical Report 97-08, Institute for Research in Cognitive Science, University of Pennsylvania. Definitions: Events: Observation. (Input) -> X Classes: Prediction. (Output) -> Y Feature Funcations: f(X,Y) The Yasmet format may be very confusing for people using traditional classifiers such as Decision Trees, Naive Bayes etc. The difference is due to two fundamental reasons... 1. In Maximum Entropy each feature function is conditioned on BOTH the Input Event and the output class...ie. f(X,Y) IN Decision Tree each feature function is conditioned on ONLY the Input Event..ie. f(X) 2. In Maximum Entropy for each Event more that one class maybe correct. In these cases you may enter both classes with weights of {0.5, 0.5}...or {0.6, 0.4} etc. Because of this difference, people building classifiers using YASMET may find that they need to input lot of redundant information. However, the advantage is that one can have more one correct output class for a given event. Also the same software could also be used for re-ranking. Technical differences between re-ranking and classifiers are given in Ravichandran et. al (2003) Statistical QA - Classifier vs Re-ranker: What's the difference? In Proceedings of the ACL Workshop on Multilingual Summarization and Question Answering--Machine Learning and Beyond, Sapporo, Japan. Deepak Ravichandran, Eduard Hovy and Franz Josef Och. 2003. General Rules ===================== There are the following rules: The number of classes for each event has to be constant. If this is not the case one needs to PAD some events with filler (dummy) classes There are many formats for YASMET. The most GENERAL format is described. The sample GENERAL format for a 3 CLASS 2 event & 4 feature problem is: Classes are {0, 1, 2} Features are { f1, f2, f3, f4} cat train.gis 3 0 @ @ 1 f1 1 f2 89 f3 9.9 f4 8 # @ 0 f2 1 # @ 0 f3 1 # 1 @ @ 0 f1 1 f2 89 f3 2 f4 8 # @ 0.5 f1 1 # @ 0.5 f3 2 # First Line is Number of classes in each event. All other lines each describe an event. First word of each event is the first class which has non zero value (ie the class for which the current event belongs)..classes are numbered 0,1,...(Number_of_Classes -1) The string "@ 1 f1 1 f2 89 f3 9.9 f4 8 #" represents {strength, features & feature values} that fired for one class of that event. where "@ 1" defines the strength of that class given event (the strength accross all classes across any event should sum upto 1 preferably) "f1 1" says feature f1 fired and has value 1. "f2 89" says feature f2 fired and has value 89. ....... Classification example: ===================================================== Example Outlook Temperature Windy Class sunny low true Play sunny high true Don't Play sunny high false Don't Play overcast low true Play overcast high false Play overcast low false Play rain low true Don't Play rain low false Play Assumption: Basic Features:- Outlook {sunny, overcast, rain} Temperature {low, high} Windy {1, 0} Class :- {Not_Play=0, Play=1} YasmetFeatures:- [(Basic Features)*(Class)] Let the training file be called train.gis: cat train.gis 2 1 @ @ 0 outlook_sunny_notplay 1 temperature_low_notplay 1 windy_notplay 1 # @ 1 outlook_sunny_play 1 temperature_low_play 1 windy_play 1 # 0 @ @ 1 outlook_sunny_notplay 1 temperature_high_notplay 1 windy_notplay 1 # @ 0 outlook_sunny_play 1 temperature_high_play 1 windy_play 1 # 0 @ @ 1 outlook_sunny_notplay 1 temperature_high_notplay 1 # @ 0 outlook_sunny_play 1 temperature_high_play 1 # 1 @ @ 0 outlook_overlook_notplay 1 temperature_low_notplay 1 windy_notplay 1 # @ 1 outlook_overlook_play 1 temperature_low_play 1 windy_play 1 # 1 @ @ 0 outlook_overlook_notplay 1 temperature_high_notplay 1 # @ 1 outlook_overlook_play 1 temperature_high_play 1 # 1 @ @ 0 outlook_overlook_notplay 1 temperature_low_notplay 1 # @ 1 outlook_overlook_play 1 temperature_low_play 1 # 0 @ @ 1 outlook_rain_notplay 1 temperature_low_notplay 1 windy_notplay 1 # @ 0 outlook_rain_play 1 temperature_low_play 1 windy_play 1 # 1 @ @ 0 outlook_rain_notplay 1 temperature_low_notplay 1 # @ 1 outlook_rain_play 1 temperature_low_play 1 # Unseen test file is also similar to train.gis (say test.gis) On Linux ============== Training : cat train.gis | ~och/bin/i686/YASMET1.out -deltaPP 0.0 -iter 100 >! WT Here iter --> Number of features Unseen Test : cat test.gis | ~och/bin/i686/YASMET1.out WT >! test.result Here WT --> WT file generated by training On Sun ============== Training : cat train.gis | ~och/bin/sun4/YASMET1.out -deltaPP 0.0 -iter 100 >! WT Here iter --> Number of features Unseen Test : cat test.gis | ~och/bin/sun4/YASMET1.out WT >! test.result ================================================================== Re-ranking example Consider a QA re-ranking problem consisting of 2 Questions Each Question has at most 3 Answers. There are 3 feature functions {Feature-AE, Feature-DR, Feature-UM} Each answer may be correct or wrong. Question 1: Answer 1.1 Correct Feature-AE: 0.0003 Feature-DR: 0.3 Feature-UM: 800 Answer 1.2 Correct Feature-AE: 0.0004 Feature-DR: 0.4 Feature-UM: 850 Answer 1.3 Wrong Feature-AE: 0.0001 Feature-DR: 0.01 Feature-UM: 100 Question 2: Answer 2.1 Correct Feature-AE: 0.1 Feature-DR: 0.001 Feature-UM: 700 Question 3: Answer 3.1 Correct Feature-AE: 0.091 Feature-DR: 0.85 Feature-UM: 500 Answer 3.2 Wrong Feature-AE: 0.082 Feature-DR: 0.00001 Feature-UM: 150 Answer 3.3 Wrong Feature-AE: 0.072 Feature-DR: 0.00012 Feature-UM: 160 Assumption: Basic Features:- Feature-AE {real-valued} Feature-DR {real-valued} Feature-UM {real-valued} Classes :- {0,1,2} YasmetFeatures:- {Basic Features} cat train.gis 3 0 $ 1 @ @ 0.5 Feature-AE 0.0003 Feature-DR 0.3 Feature-UM 800 # @ 0.5 Feature-AE 0.0004 Feature-DR 0.4 Feature-UM 850 # @ 0 Feature-AE 0.0001 Feature-DR 0.01 Feature-UM 100 # 0 $ 1 @ @ 1 Feature-AE 0.1 Feature-DR 0.001 Feature-UM 700 # @ 0 NOCLASS 1 # @ 0 NOCLASS 1 # 0 $ 1 @ @ 1 Feature-AE 0.091 Feature-DR 0.85 Feature-UM 500 # @ 0 Feature-AE 0.082 Feature-DR 0.00001 Feature-UM 150 # @ 0 Feature-AE 0.072 Feature-DR 0.00012 Feature-UM 160 #