|
CS 121 Introduction to Artificial Intelligence Summer 2006 |
| Frequently Asked Questions |
| Final |
The final will be offered Saturday, August 19, 12:15-3:15pm, in Skilling 193.
Yes. The alternate final will be offered Friday, August 18, 7-10pm, in Gates 104. If you would like to take the final at the alternate time, you are welcome to do so, as long as you contact us beforehand.
Yes. If you would like to do so, let us know, and we will tell you the procedure.
| Project 3 |
Only as extra credit. If you would like to earn a few extra credit points for the class, we'll allow you to do a simplified version of the project for a maximum of 5 extra credit points toward your final grade in the class. The project is to successfully implement a Pac-Man agent which learns a state evaluation function (a U-function, not a Q-function) using TD learning, and then uses it somehow in action selection. There will not be any special new project handout or starter code; interested students should use the Project 1 starter code, and implement the PacManPlayer interface with a new class called LearningPacManPlayer. The extra credit project is due on Thursday, August 17, and no late days may be used on it!
Up to 5 possible extra points from the project will be added to your grade after we compute the grading curve and letter grade cutoffs, so those students who feel they are already doing well in the class should not feel that they need to do this extra credit project. Not doing the extra credit project will not put you at a disadvantage!
| Homework 3 |
Yes, indeed, specifically, the sigmoid function from part (b). We're setting it up for you to take the
derivative in parts (d) and (e), which won't work for the step function since it is discontinuous. For example,
you know that
Pw(C = 1 | x) = g(wTf(x))
where wTf(x) is the dot product of the weight and feature vectors, equivalent to the summation given in
(a). If you substitute the sigmoid from part (b), you get
Pw(C = 1 | x) = 1 / ( 1 + e^{-wTf(x)} )
The section on logistic regression in the Project 2 handout makes clear the relationship between the update rule we're developing for LR (using sigmoid) and the one we developed for perceptron in lecture. Read it over.
If you've set up your log likelihood expression correctly, this should be fairly straightforward. Here's a big HINT: taking the derivative of this expression should boil down to taking the derivative of a sum of logs with sigmoid-like expressions inside (if yours doesn't, you've probably set it up incorrectly). Taking the derivative of a sigmoid is very straightforward, but to help you out, check out this link.
If you look at that link, you'll see that the derivative of a sigmoid function g(z) is actually defined in terms of g(z). Awesome! Thus, when working your own math, I suggest you work with g(wTf) rather than the full expression for it. Don't forget the chain rule for derivatives.
Here's a bit of a start. You need to apply the chain rule for derivatives. This is what
we are doing:
d/dwi log[sigmoid(wTf(x))]
First, define A = sigmoid(wTf(x)) so that d/dA log[A] * d/dwi sigmoid(wTf(x))
Next, define B = wTf(x) so that d/dA log[A] * d/dB sigmoid(B) * d/dwi wTf(x)
So now we have the derivative of a log times the derivative of a sigmoid times the derivative of
a sum. You know how to take all of those.
Watch the the 8/11 Friday section. We covered this in depth.
No, the table is simply a data structure to store probabilities (mt[x]) as you compute them. The difference between part (a) and part (b) is how many of the precomputed values we use to compute the entries for the next timestep.
As for how big the table is...think about what your dimensions are. Obviously, one dimension must be related to timestep (hence the t subscript). The other dimension relates to the x in mt[x], which is one of k possible values in the domain of Xt, right?
The c table is the table you use to reconstruct the most likely state sequence. It has the same dimensions as m, except that it does not have entries for timestep t = 1. You fill it with values from the domain of Xt at the same time that you are filling m with probabilities, and then afterward, you work backward through it to reconstruct the most likely state sequence. Whereas mt[x] contains the probability p(Et = e | Xt = x)p(Xt = x | Xt-1 = x')mt-1[x']), ct[x] contains the value x' that we used to calculate that probability. Now you should be able to figure out how to populate and use it.
We suggest that you discretize your domain, dividing the 1 square mile map into square feet (5280x5280) or square yards (1760x1760). Another possibility is to discretize it according to your transition model, in which you are traveling roughly 30 feet per move (2mph for ~10 seconds). This makes your map 176x176 and may simplify your transition distribution.
That said, if you understand continuous distributions and would like to formulate this as a continuous domain, go for it!
You are walking blindfolded, so you have two sources of uncertainty: direction and distance. You are walking roughly 2 mph for 10 seconds (roughly 30 feet) before stopping again, presumably in the direction you were facing when you put your blindfold back on. However, you may travel more or less than 30 feet and you may not travel in quite the direction you mean to. Thus, your distribution should be over all of the locations (whether those are discrete squares or continuous "points") you might reach, which should include all squares that are at least partially within a 30 foot radius of your current location. You should probably even assign some probability mass to squares beyond that, to account for faster-than-expected movement.
If you are working in a continuous domain, then this makes a bit more sense since you have a continuous distribution over "point" locations within that radius.
Aassume this: you take off your blindfold, measure the apparent height of Hoover Tower, do some calculations involving similar triangles and the Pythagorean theorem, and finally conclude that you are et feet from the tower, where you can assume that et is a multiple of 100 (although you may make different assumptions here, if you'd like...just state them clearly). Don't worry about what the calculations are, specifically. In fact, you can simply assume that you are granted -- magically, even -- a rough estimate of how far Hoover tower is from you.
Thus, what Et gives you is a rough estimate of how far you are from Hoover Tower, and in turn, what P(Et | Xt) refers to is the "probability that Hoover Tower appears to be Et = et feet away (according to the measurement method described above), given that you are at location Xt = xt. You will want to define a distribution for this that assigns a lot of probability to the case where xt is in fact et (or close) feet from Hoover Tower and less when it's not.
If you'd like, you may assume that that the measurements are in 100 (or other) foot increments so that the distribution is discrete, or if you prefer, you can assume that they are continuous and then describe the emission probability with a continuous distribution.
| Project 2 |
If your java heap is running out of memory, then you need to change your maximum heap size (which java by default limits to 256M, hardly enough). You do this by passing the "java" command the "-Xmx" or "-mx" option followed immediately (no spaces) by size you want to specify: 512M for 512 mb, 1024M or 1G for 1 gig, etc. It looks like this:
java -mx512M classify.spam.Tester ...
V is the domain of all features in the training emails (spam and genuine), period. It does not include misspelled words, numbers, or features in the particular test email -- unless they occurred in the training emails.
The semi-colon means "parameterized by." Well, the semicolon in the likelihood function L(phi...) is a typo, but for example:
P(y^(j) ; phi_y) means the prior probability of the jth email having label y -- from the distribution defined by the parameter phi_y, such that p(y = 1) = phi_y and p(y = 0) = 1-phi_y
while
P(wi^(j) | y^(j) ; phi_w=k|y=1, phi_w=k|y=0 for all k) means the conditional probability of word i of the jth email (given that the jth email's label is y) having value k -- from the distribution defined by parameters phi_w=k|y=1 and phi_w=k|y=0, such that p(wi = k | y = 1) = phi_w=k|y=1 and p(wi = k | y = 0) = phi_w=k|y=0.
You know this! Think about it for a second. What is a valid distribution? Sums to 1, right? Thus, you need to make sure that the total probability mass assigned by any probability distribution you learn sums to 1! For NB, you have two distributions: the prior over labels P(Y) and the posterior or conditional over features given the label P(W | Y). In the vase of the vanilla NB, validity should be guaranteed by proper implementation with the Counter and CounterMap classes. Validity becomes more of an issue for your feature conditionals when you use smoothing to give mass to unseen events. Anyway, in short, for each distribution you learn, sum over all the probabilities for all possible events and see if that sum equals (or is very close to) 1.
Yes.
Yes, in the abstract, anyway; in other words, the concept of "features" is the same. Which specific features you use can vary and is worth some thought and experimentation.
Simply put, yes.
This is a mistake in the handout. The weights are not constrained in any way (different from your NB distributions).
| Homework 2 |
First, let's revisit exactly what a Bayes Net is (according to Dave's paraphrasing): a graphical representation of a set of independence assumptions about a set of random variables over which we have a joint distribution. In other words, we use the graph structure to represent (or enforce) a set of independencies that we want to assume are true. If the network does not enforce independency between two nodes, then they are allowed to be dependent. This does not mean that they are in fact dependent. That, instead, depends on what sorts of CPDs we specify for our nodes. For example, if X and Y are connected by a directed edge (to Y), they are allowed by the network structure to be dependent; however, we may still specify a CPD P(Y | X) such that Y and X are nonetheless independent (basically, Y ignores any evidence provided by X). Thus, when we ask for which independencies do not necessarily hold, we are asking which dependencies are allowed by the network structure.
Well, you can't really give the tables since I haven't specified the parameters of the network. I expect you to give probability expressions which represent the tables. For instance, P(X) represents the table which has a probability for each possible value of X, summing to 1.
Well, the math does work out so that you can prove independence either in part (d) or part (f) of each problem. However, if you're having trouble, remember that there are two ways to test or prove conditional independence, either of which I'll accept for this problem. The first, which I'm suggesting, uses the fact that P(x, z | y) = P(x | y)P(z | y) for all values of x, y, and z if and only if X and Z are conditionally independent given Y. The second uses the fact that P(z | y, x) = P(z | y) for all values of x, y, and z if and only if X and Z are conditionally independent given Y. The second property is a bit easier to use in question 3 (hint! hint!).
Also, if the variables are not conditionally independent, it's not possible to prove it with the algebra. In this case, just give an informal justification.
I mean, if you wrote out a huge table with a probability for each possible assignment to the variables, how many entries would it have?
The latter. You've chosen a network structure, which *requires* some conditional independencies, right? So I want you to list down some of the ones you've made, and say why, as a knowledge engineer, this is a reasonable thing for you to assume in your representation - like "in my opinion, studying and socializing aren't mutually exclusive, or even correlated in any significant way". The reason for this is that I want to force you to check whether the assumptions your model makes are consistent with your thinking about the subject. There's no right or wrong justification, of course.
Actually, the size of the Bayes Net is usually taken to be the sum of the sizes of the CPDs. So what I want you to do is compute the size of each CPD in your network (remember there is one for each variable) and then add them up. Once you've done this, compute the number of independent parameters for each CPD (this is usually half the total number of parameters; why?), and then add those up too. This is the size of your Bayes Net.
| Project 1 |
Yes, in fact, if you look in pacman.Game.java, you'll see that I constrain the game to have at least defaultNumberOfGhosts (which is set to 4 currently). If you give it only 2, for example, it will create two additional Basic ghosts for you.
This error:
| Exception in thread "Thread-3" | java.lang.RuntimeException: | Invalid move. |
| at pacman.Game.getNextState(Game.java:260) | ||
| at player.DFSPacManPlayer.getBestMove(DFSPacManPlayer.java:79) | ||
| at player.DFSPacManPlayer.getBestMove(DFSPacManPlayer.java:80) | ||
| at player.DFSPacManPlayer.chooseMove(DFSPacManPlayer.java:29) | ||
| at pacman.PlayerThread.run(Game.java:968) |
The problem here is your call to game.getLegalPacManMoves() inside of the search. This gives the legal moves for the current state internal to the game, NOT the one you are considering in your search. Instead, the correct call should be to the static method that takes a state as a parameter: Game.getLegalPacManMoves(State s).
These algorithms are all deterministic, which means that given an input, you should be able to predict the output. Thus, the best way to test your agent -- as mentioned in the handout -- is to create a contrived input (state) for which you expect a particular output (move), pass it into chooseMove, and then see if you get it back. A framework has been provided for you in the player.TestSearchPlayer class.
As for correctness, my answer for your minimax question still applies.
The one I implemented seems to run without too many timeouts as deep as 9-10.
True, so don't worry too much about it...we care most about correctness, not efficiency. If your DFS player is getting somewhere between 6-12 plies deep, then you're probably fine -- just make sure it works properly and then perhaps consider working on efficiency later. That said, I can get 9-10 plies deep running on my 1.5Ghz PowerPC G4 Powerbook, with 1GB RAM. Consider running java with the -Xms, Xmn, and Xmx options to change the VM's handling of heap memory and performance. See this link for info about tuning your VM.
This error:
| Exception in thread "Thread-177" | java.lang.RuntimeException: | Can't project a final state. |
| at pacman.Game.getNextState(Game.java:257) |
is happening inside of getNextState(), and it happens when you pass a final state into getNextState(). I'd suggest refactoring your code such that it won't be calling getNextState() on a final state.
Do it with getNextState() first, then if you have time, do an "extension" that uses getProjectedStates().
| Homework 1 |
How about, if you think it's not complete, give me an example of a problem where it will find the solution anyway.
Don't worry too much, we plan to be generous on grading. There is partial credit given to the individual parts of each question. If you miss one part of one question, it shouldn't hurt you too much. The grading will be based more on correctness than effort though.
Don't overthink this problem. There's always a simple path cost.
Suppose you were doing an "easy" Sudoku problem by hand. You probably wouldn't have to do any backtracking. That is, once you've decided (somehow) that there's only one remaining possible value for some square, you just put that value in the square and don't have to change it again, right? Well, what technique did you use? I promise that it's one of the ones we talked about in class!
OK, here's an example of a space in which hill-climbing wouldn't be useful, just to get you thinking. What if I told you about a lottery game where you guess numbers between 1 and 1,000,000, and if you guess the right number I give you $100. If you guess wrong I give you a random amount of money between $0.01 and $0.05. Clearly hill-climbing in the space of possible numbers, based on the random amounts I'm giving you, wouldn't be useful - it would be equally valuable to just wander around. Can you generalize this?
For example, if you formalize something as a search problem you need a start state, a successor state function, a goal test and a path cost. To formalize something as a CSP you need a set of variables, a domain for each, and a set of constraints. You don't have to use formal logic or anything, but you probably need to use some sort of mathematical notation: variables, equals/not equals, etc.
Fair enough, this is sort of vague. I mean, suppose you were willing to be a little more flexible about the problem. Instead of "disallowing" certain people from sitting next to each other, suppose you just "disprefer" it, so that some seating charts are better than others, but none are completely impossible.
| Prerequisites |
If you have no programming experience whatsoever, then perhaps you should check out CS 106A. If you have have some experience with procedural programming languages (C/++/#, Java, PERL, Python, etc.) and familiarity with object oriented design, then you should be fine, provided you are willing to put in the time to learn and adjust. In particular, if you're a C++ expert, I think you'll find the transition to Java fairly painless. To help you along, check out the sun tutorial, visit the class websites for CS 106A, CS 106B, and CS 108 and download handouts, search amazon or the bookstore for good references, and do some google searches like "java tutorial," "java for c++ programmer," etc. If you've done Java before but not 5.0, visit or watch Friday's section. Also see the links at the bottom of the CS 121 index page.
Yes, you may program on the machine of your choice, but at least for now, you'll need to access your leland account on a sweet hall machine to download and submit your code, using some combination of ssh, scp, ftp, etc. On Mac OS X and Linux machines, this is fairly simple (using a terminal). On WinXP machines, this may require installing Putty or Cygwin. If you're really ambitious, you could set up a CVS account in your afs space and then use eclipse to access it. :)
As for an IDE, it's up to you, depending on the hardware/OS platform you choose to use. That said, we recommend you consider using eclipse as your IDE, even if you never have before. It makes programming in Java a billion times easier and more elegant! If nothing else, it compiles as you code, speeding up build-time and detecting syntax errors immediately.
If you have ever done math or programmed a computer, then you have already used propositional logic (PL). It's an artificial language (or formal system) used to describe a limited world. It has constants, variables, functions, relations, etc. It's very simple. The R&N textbook cacn introduce you to both (ch. 7-8). If you want another physical reference, check out Language, Proof, and Logic; they have it at the library and the bookstore.
If you have ever calculated odds, ratios, or chances (of a hand in poker, flipping coins, etc.), then you've used probability theory. Again, R&S can introduce you to it (ch. 13). If you want another physical reference, the textbook for Stanford's PT intro class is "First Course in Probability Theory" by Ross. Furthermore, check out the course website for MS&E 120.
You just need to know basic multivariate algebra, and very little of it...you know, vectors, matrices, systems of equations, etc. You'll be fine.
| Comments to CS 121 Staff |