My favorite thing about this project is that it started out as very culturally specific — I wanted to make an agent that played a game I played growing up. But when I moved to the US for school and started telling people about this game, there seemed to be a variation of it almost anywhere in the world.
More people than I thought will walk away with the satisfaction of knowing what the optimal moves are for this game we all grew up playing.
You can check out the repository here.
High Level Overview
Pusoy Dos is a four player card game in which players race to discard all of their cards. A player takes their turn by playing a combination of cards of greater value than the move preceding them, or by passing.
The goal of the model: given a particular state, determine the best possible action. This means that the model should know the best move to play given any situation.
The state space that an agent has access to the following parameters. In parentheses is the number of boolean variables needed to encode these parameters in a one-hot way.
- (5) - round type (singles, doubles, triples, hands, or pending)
- (5) - hand type (straight, flush, full house, four of a kind, and straight flush - if it is not a hands round, all of these are zero)
- (4) - who the current player is
- (4) - who the previous player was
- (52) - the list of cards the player currently has
- (52) - the list of cards that was just previously played
- (204) - all cards previously played by all players in order from 0 to 3.
Note that an agent does not have access to the cards that the other players currently have.
There are tons of possible actions that a player could do in a game of Pusoy Dos. You’d have to consider each possible combination of cards one could play — a combinatorial amount of actions. Is there any way to produce a smarter, more efficient output vector?
The output vector of my prediction model has 62 logits:
- The first 52 correspond to each of the 52 cards.
- The next 5 correspond to the different round types (singles, pairs, triples, hands) and passing.
- The last 5 correspond to the different types of hands (straight, flush, full house, four of a kind, and straight flush).
An output vector can be parsed in the following way:
- Go through all of the valid round types from highest to least probability. For example, an output vector might indicate a preference for playing a pair, followed by a triple, followed by passing, etc. If it is not the beginning of the round, the only valid moves are to pass or to play the type of move of that round.
- Find the action that corresponds to the valid cards with the highest probability. For example, if I wanted to play pairs, I would look at all of the possible pairs I could play with the cards I have, then choose the pair with the highest assigned probability.
- If I produce a valid move, the model plays that move.
- If it does not, move on to the next possible move.
- There always exists a valid move: passing is always valid, unless it is the beginning of the round, in which case it is always possible to play a single.
All of the models used are neural in nature, and generally follow a D2RL architecture — a feed-forward network where at every level, the network is re-fed the inputs.
There are a few hyperparameters that I tweaked:
- Hidden dimension, number of layers — the exact dimensions of the model’s internals.
- Actor vs. actor-critic models — Actor-critic models have two components: an actor that determines the best move given an input state, and a critic that evaluates the actor’s performance. The difference between the two is that the actor model is supervised directly by the game’s results, while the actor-critic model is supervised by the critic, which is supervised by the game’s results.
Here is a summary of how the model was trained. An “epoch” can be thought of as a round or a batch of training.
- Every epoch, the model plays 20 rounds of pusoy dos against previous iterations of the model. We store everything that the winners did, and everything that the losers did.
- The training loop randomly selects 80 rounds from the previous 10 epochs, favoring rounds from later batches. It then encourages the model to increase its probability of doing what the winners did, and decrease its probability of doing what the losers did.
The model is updated using Proximal Policy Optimization (though I implemented this from scratch). It is a way to ensure that the model learns quickly enough, but does not make any jarring or sudden adjustments that might throw the training process out of control. I tried a bunch of different learning rates.
I am still in the process of training and evaluating, so hang tight!