Move Ordering

Improving move ordering has been on the radar of a while now. I started to suspect that move ordering needed some work when my initial attempts at a PVS search and aspiration windows both failed. I reasoned that, if move ordering is subpar, researches would occur too often causing an overall increase in node counts.

To know, you have to measure, so I added data to the logfiles and wrote a Python script to aggregate (1) time to depth, (2) effective branching factor, and (3) the % of nodes in which we get a fail high on the 1st move and first four moves.

Once I was able to measure, I changed the move ordering scheme FROM:

PV move -> Hash move -> All captures in MVV/LVA order -> Killer 1 -> Killer 2 -> noncaptures in the order they are generated

TO:

PV move -> Hash Move -> “Good captures” in MVV/LVA order -> Killer 1 -> Killer 2 -> noncaptures -> bad captures in SEE order.

A “good capture” is one that is a promotion, a capture in which the value of the captured piece is at least that of the capturing piece, or one with a non-negative SEE value.

Incidentally, while researching this I came across a post I myself had made many years ago: http://talkchess.com/forum3/viewtopic.php?f=7&t=15198&hilit=defer+losing+captures&sid=5543f43760ce939ed16deacd59710e06

The change doesn’t seem to add more than just a couple of ELO directly. However, in non-tactical test suites all the measured metrics were improved: time to solution is down, effective branching factor is lower, the percentage of fail highs on the first move is improved, and the percentage of fail highs in the first four moves is dramatically improved. Additionally, the number of nodes required to reach a solution is cut by about a third, with only a 7% or so decrease in speed (nodes per second). I believe this sets the stage to take another stab at PVS and aspiration windows. First though, I’m going to take a stab at pruning bad captures from the quiescence search.

One potential (probable) area of improvement is that knights and bishops have slightly different values (bishops being the more valuable). For the purposes of determining if a capture is “good” when classifying the captures, they should probably be considered equal.

New Testing Rig

Prophet4 finally has a proper testing rig! A few weeks ago, I purchased a Dell Alienware system — an 8 core (16 logical) AMD Ryzen 7 5800 with 32 GB of RAM and an AMD Radeon RX600XT graphics card.

This replaces the single core laptop I have been using. This is pretty exciting as it will allow P4 testing to go at 8x the speed it did before. Just to break the machine in, I ran the first ever gauntlet with P4. Here are the results:

RankNameEloGamesScoreDraws
1plisk 0.2.7d1032978566%19%
2tjchess 1.3812978763%22%
3jazz 840522978658%21%
4myrddin 0.87442978657%21%
5Horizon 4.4152978552%19%
6jumbo 0.4.17-52978349%21%
7p3-20181124-82978648%25%
8p4-20210407-861721136%32%
9prophet2_ja-941600035%18%
10tcb 0052-1822978523%15%

This closely matches the results I obtained a few years ago when I announced Prophet3 20180811 is released . One notable exception is TCB – it seems to have done much worse on this machine, for whatever reason.

So, it seems P4 is already on par with P2, and is within 78 elo or so from P3. This is encouraging, as the P4 rewrite is still not complete. I’m feeling pretty confident that when it is, it will be at least as strong as P3, and then the real work of improving can begin. And, with the fast feedback from a proper testing rig, I don’t think that should be all that difficult. My goal is to achieve +100 elo over P3 before doing a release.

Note– the versions of the chess engines used in this gauntlet are quite old by now; many of them likely have updates that could be significantly stronger. During the rewrite process, I’ve tried to keep everything “as is” — the goal is to compare P4 to P3, not to other engines. Once the rewrite is complete and P4 grows in strength, new testing engines will be cycled in as the current set is cycled out.

Extending Lines

“Tree shaping” is an important component of a strong chess program. There are various ways to shape a search tree. One is by extending lines that seem interesting in some way; particularly those that cannot be fully resolved within the search horizon (see https://en.wikipedia.org/wiki/Horizon_effect). Other methods of shaping the search tree include reducing or even pruning lines that seem less promising. The quiescence search is also a form of tree shaping, as the search becomes more selective in which nodes it expands – typically just captures or captures + checks and check evasions.

One of the simplest, and perhaps the most effective extension is the “check extension.” Whenever a move is made, if the move gives check to the enemy, the search depth is increased by one. That is, instead of this:

apply_move(pos, move)
val = -search(pos, depth-1, -beta, -alpha)

Do this:

apply_move(pos, move)
bool gives_check = in_check(pos)
int ext = gives_check ? 1 : 0
val = -search(pos, depth-1+ext, -beta, -alpha)

Disclaimer: that is probably a little too naive, as it doesn’t really guarantee the line will terminate. There are surely positions that would cause the extension as written above to explode the search tree. But, on average, it’s a win.

I implemented the check extension in Prophet4 and indeed it seems to be a big win.

WinsLossesDrawsPctEloError
33118340258.1%56.616.8
Prophet4 with check extension vs without, 5+0.25

In an effort to add some sort of terminating criteria to it, I also tried limiting the check extension to capturing moves, and while still a win it didn’t work out as well.

WinsLossesDrawsPctEloError
32063119450550.4%2.85.0
P4 with capture only check ext vs without, 5+0.25

I also tried another form of extensions entirely – promoting moves. However, it had no measurable effect at all and may have even been a small loss, so I abandoned it.

After settling on the “naive check extension” I measured P4’s standing against P3.

WinsLossesDrawsPctEloError
31392975834.6%-110.612.2
Prophet4 vs Prophet3, 10+0.5

The last time I measured, P4 was -164.29 vs P3, so it’s definitely gained some ground.

Next on the list is to take a careful look at move ordering before moving onto another form of tree shaping – reductions. I expect Late Move Reductions (LMR) in particular to give another large jump. But, the algorithm is very sensitive to good move ordering, so it’s worth taking some time to study that first.

Validations – and surprises

I’ve been pretty busy as of late, but in an effort to get some momentum going I picked some items that have been on my to do list that would take far more processor time than programming time – validating some evaluation terms that have been part of Prophet for many years now. The goal, really, was to validate what I was sure was true- that these evaluation terms do help; if they are removed, surely performance would drop as well. As you’ll see, it didn’t quite work out that way.

Here’s a list of evaluation terms that were tested, with results below.

  • Knight tropism – the idea that keeping your knight as close as possible to the enemy’s king is generally a good thing.
  • Rooks on open files, or even half open files.
  • Passed pawns – pawns with no enemy pawn in front or on an adjacent file should be rewarded, since the likelihood of promotion increases dramatically.
  • Isolated pawns – pawns with no friendly pawns on an adjacent file should be penalized, since they are weakened considerably without the supporting pawns.
  • Doubled pawns – pawns that occupy the same file as another friendly pawn are a (small) positional weakness.
  • Major pieces (rooks, queens) on the 7th rank with the enemy king on the back rank are usually very deadly, especially when “connected.”

Knight Tropism

This term works by penalizing a knight by 2 x distance_from_enemy_king centipawns. Distance from the enemy king is the max of difference in ranks and difference in files.

I actually started out running this one with chess4j, the Java engine, before deciding to use Prophet4 going forward. As explained in a previous post, the overhead of restarting the JVM between matches is just too high, and not restarting the engine between games isn’t great either, so doing these fast tests with a lightweight executable that can be quickly restarted seems preferable. At any rate, the outcome was nearly the same so I’ve combined the results into one table.

Also, since these are validations of existing terms, player A was the “control” player, and player B the “without” player. The hypothesis is that “Player A is better than Player B.”

WinsLossesDrawsPctEloError
63125982767450.8%5.73.8
23432122276651.5%10.66.3
59425789826950.4%2.73.7
Control vs “without knight tropism”, 5+0.25

As you can see, the knight tropism is worth a few ELO. Not a game wrecker by any means, but it does have a small positive effect. We’ll check this one off the list.

Rooks on Open Files

Rooks on open files or even half open files are known to be a strategic advantage. It allows the player to move their rooks around easily, projecting strength and penetrating into the opponent position.

Rooks on files with no other pieces are given a 25 centipawn bonus. Rooks on files with just enemy pieces are given a 15 centipawn bonus.

WinsLossesDrawsPctEloError
57341171754.7%33.212.6
Control vs “No rook on open files”, 5+0.25

This term is obviously doing the job. Check this one off the list.

Passed Pawns

Passed pawns become promoted pawns. Passed pawns are awarded 20 centipawns.

WinsLossesDrawsPctEloError
21431930265651.6%11.06.5
Control vs “no passed pawn bonus”, 5+0.25

Another one validated. That’s not to say it’s tuned correctly, but at least we can say it does help.

Isolated Pawns

Isolated pawns don’t have any friendly pawns on adjacent ranks to give them any support. They are a weakness. In Prophet and chess4j, isolated pawns are penalized 20 centipawns.

WinsLossesDrawsPctEloError
50264174146.3%-25.712.2
13851532184748.5%-10.77.7
Control vs “no isolated pawn penalty”, 5+0.25

That is NOT a good result! I was in such disbelief after the first test that I ran a second test, and though the result doesn’t seem quite as bad, it’s still bad. As it stands, the isolated pawn penalty is hurting. I haven’t disabled it, because the major focus right now is rewriting the engine before improving it, and I want to be able to compare apples to apples after the rewrite. However, I have put an item on the backlog to study this in more detail. The heuristic should work. Either the implementation isn’t quite right, or it’s too expensive, or the weights aren’t right. I’ll have to get to the bottom of this.

Doubled Pawns

Doubled pawns also known to be a positional weakness. They are penalized 10 centipawns (note this penalty gets “awarded” to each pawn).

WinsLossesDrawsPctEloError
59372295047.2%-19.810.9
8581149151645.9%-28.88.7
Control vs “no doubled pawn penalty”, 5+0.25

Another surprising and disappointing result! Investigating this has also been added to the post-rewrite backlog.

Majors on 7th

This evaluation term awards rooks and queens on the 7th rank when the enemy king is on the back rank. If so, 50 centipawns are awarded. Additionally, if connected to another major piece on the 7th rank, an additional 80 centipawns are awarded – the idea being this is likely a deadly / mating attack.

I can’t remember how long this term has been around, but it’s been a looooong time. Sadly, the results aren’t so good.

WinsLossesDrawsPctEloError
10191137154248.4%-11.18.5
31783509448348.5%-10.35.0
Control vs “without majors on 7th”, 5+0.25

Clearly, not a good outcome! I wondered what would happen if I just disabled the “connected” part but left the award for having a major piece on the 7th when the enemy king is on the 8th?

WinsLossesDrawsPctEloError
57056024827149.2%-5.53.7
56915973833649.3%-4.93.7
Control vs “without CONNECTED term”, 5+0.25

Disabling the “connected” bit may help a little, but still, the heuristic is hurting overall, not helping.

Conclusion

Out of the six evaluation terms to be tested / validated, three of them were found to be helpful and three are actually harmful. The “majors on 7th”, doubled pawn, and isolated pawn heuristics have all been added to the post-rewrite backlog for further study.

The moral of the story here is, do NOT assume anything. Always test! This has been my philosophy for a good while now, but these eval terms were all added at a time that I wasn’t as rigorous in my testing as I am now. (And, frankly, I don’t think many in the computer chess community were before 10-15 years ago.)

If there is any upshot, it’s that there is a guaranteed strength improvement to be had, even if it’s just removing the terms altogether. But, since I know they really should work, I’ve opted to leave them for now. Also, when the Prophet4 rewrite is complete, I really want to be able to run some benchmarks against Prophet3 and be able to make comparisons. The rewritten codebase should “stand on its own.” Improving the evaluator now would cast some doubt on those comparisons.

Pragmatic Thinking and Learning: Refactor your Wetware

Pragmatic Thinking and Learning: Refactor your Wetware, by Andy Hunt was published in 2008, and for whatever reason has been sitting on my bookshelf for several years. I probably bought it based on a comment I read or an online review; I don’t remember, but apparently never made the time to read it – until now.

The premise of the book is to teach one how to — think. And learn. It’s targeted primarily towards software developers, but anyone wanting tips / tricks / advice / insight on how to think critically, and problem solve more effectively would benefit from Andy’s work. There are several metaphors likening the brain to hardware and software that programmers would be familiar with, but the book is really about cognitive science.

Some of my take aways:

  • Software is created in your head, not your IDE. This makes it appropriate in my view to schedule time to do nothing but think. “Thinking walks” are encouraged, or even kicking your feet up on your desk. This may look like goofing off to non-programmers, and even programmers that are of the mindset that “work means your hands are on the keyboard,” but thinking is an important part of the craft.
  • A corollary to that is, it’s important to invest in your skills. It should be OK to incorporate that into your work schedule – in moderation of course. Do you expect a doctor to keep up with important advances in medicine? Would you expect him to do that after hours ? What about a lawyer? The same holds true for any knowledge worker. There is an unnecessary stigma around reading / learning at work. We need to keep the sword sharp to be effective in the long term. The book points out the need to be deliberate about your learning. It doesn’t happen by accident, and if you relegate it to “when I have time” you never will. Create a plan and a schedule.
  • R-mode thinking (traditionally called “right brain”) is the more creative side of our brains, and where intuition comes from. It is often underutilized. You can’t force it, but you can foster it. To me, this is a good reason to play chess! Chess exercises both sides of the brain – L-mode (left brain) for analysis / calculation, and R-mode (right brain) for pattern matching and intuition. I’m going to try 10-15 minutes of puzzle solving each morning as I drink my coffee to get the juices flowing.
  • Mind maps! I admit I rarely ever use them, but the book makes a strong case for using them as a tool to organize your thoughts. Apparently kids in Europe are taught this at a young age.
  • Often times, the real value in documentation isn’t the documentation itself, it’s the act of documenting. This is similar to an observation I’ve already made when preparing to give a lesson, sermon, or a talk. I almost always prepare notes, but often times don’t reference them very much.
  • There is no substitute for experience. Don’t be afraid to jump in and try things. Make it safe to fail. Sometimes it’s better to play first and RTM later, once you’ve gained some context.
  • Pressure kills creativity. I’ve always hated deadlines – particularly artificial ones (ones created internally). I think they are usually counter productive. The book validates my thoughts on this. Pressure (such as deadline pressure) encourages shortcuts and errors. It forces you to get something done, at all cost. The pressure also has negative long term effects on your health. Avoid deadlines as much as possible. Sometimes it’s a trust issue.
  • So many people sit in a chair for eight hours a day (or more), but that doesn’t mean they are productive the entire time (and likely are not). This is assembly line thinking. When it comes to our work time, we need to emphasize quality over quantity. For knowledge workers, it’s important to have blocks of uninterrupted time to get into a flow. Personally, I’m best in the morning after having that first cup of coffee. Don’t sit like a rat looking for a pellet in front of your email – it is about the most unproductive thing you can possibly do. Shut down the email client, turn off notifications, and just get into a flow. Email, conversations with co-workers, and especially meetings are expensive. That’s not to say they aren’t important (well, sometimes), but they can almost always wait. In other words — schedule your interruptions so you can be more productive.

One of the things that really struck a chord with me was the notion of a Pragmatic Investment Plan. Andy introduces the notion of a “knowledge portfolio,” and likens it to a financial portfolio. It’s important to have a goal, with short term and long term objectives to reach that goal. The plan may change, and that’s OK, but always have a plan. Diversify your investment – it’s important to not put all your eggs in one basket. And, invest actively. That is, stop and reevaluate your portfolio from time to time to see if your objectives need to change. I have recently set a goal of becoming more proficient in the area of Machine Learning. This is an area I did some work in during my grad school days but not a lot since, so there is a considerable learning curve in front of me. Andy’s work has convinced me of how important it is to be purposeful about creating objectives towards that goal.

All in all, a great read, and I’ve gotten a lot of practical…. err, pragmatic tips from it. I’m glad I finally got around to it.

PVS – Another Fast Fail

After the failure with aspiration windows I decided to implement PVS . The idea here is that, with proper move ordering the first move is usually the best. That being the case, it makes sense to search all of the moves after the first with a minimal alpha/beta window, the goal being to prove or disprove the first move was best rather than to get a usable score. If the first move is in fact best, you’ve saved some time. If not, you have to research with the “regular” alpha/beta window.

WinsLossesDrawsPctEloError
35148759845.3%-33.013.7
53367079646.6%-23.811.8
With PVS vs without, 5+0.25

This result is a little surprising and depressing. I’m not going to give up on PVS, but clearly the time isn’t quite right for it. I’ll have to revisit it later. This leads me to believe I should take a step back and look at move ordering. It seems to me that if PVS is a loss, that could mean the “first move is best” assertion is wrong more than we’d like, meaning move ordering is not where it should be. There are some tasks on my board related to move ordering, such as deferring losing captures until after non-captures and implementing a static exchange evaluator to aid in ordering the quiescence moves. I’ll leave the branch open and revisit this when move ordering has improved.

Aspiration Windows – a Fast Fail

Earlier this week I implemented aspiration windows to chess4j. It’s a very simple concept. In the iterative deepening loop, the depth 1 iteration is processed as usual, with an alpha bound of “negative infinity” and a beta bound of “infinity.” On subsequent iterations, the alpha bound is initialized to the previous score minus some margin (1/3 pawn in this case), and the beta bound is initialized to the previous score plus some margin (again, 1/3 pawn). If all goes well, and the score is within those bounds, the search should finish faster as there are more cutoffs. However, if the real score is not within that search window, then a research must be done.

My intention was to start with the 1/3 pawn windows, falling back to a “wide open” search if it fails. Once testing confirmed that was a win, I could play with the margins or incrementally expanding the window. Unfortunately, this was a pretty fast fail.

My first run was terminated after 2002 games, and the second after 2965 games.

WinsLossesDrawsPctEloError
54968976346.5%-24.512.0
8861026105347.6%-16.410.0
With aspiration windows vs. without, 5+0.25

I did expect this to be a win so I’m a little puzzled by the result, but that’s why we test. Perhaps when the search matures a little more and can hit deeper depths the scores between iterations will stabilize ? I’ll just queue this up to revisit later on.

On the plus side, SPRT saved a lot of time by terminating a 20,000 game match after 2-3k games. That’s a big time saver.

On to PVS.

Stopping at half time – and on making small changes

Great things are done by a series of small things brought together. — Vincent Van Gogh.

The last few changes — adding null move pruning, hashing, and a quiescence search — have had major impacts on playing strength, which was expected. But inevitably, not all changes can be big winners, or even big losers. Some will have a very minor effect, either for the good or bad. The change I deal with in this post is one of those. The idea being tested was — should the search iterator be stopped early if, after completing an iteration, more than half the allotted time has been used? The rationale is that, due to the exponential growth of search trees, if you have less than half your time left when starting a new iteration, you’re unlikely to complete the iteration. Perhaps it would be better to save the time for future use. (This would not apply for games with a fixed time-per-move.)

The idea is so simple I hesitate to even show the (pseudo) code below:

        boolean stopSearching = false;
        do {
            ++depth;
            // DO SEARCH
            score = search.search(board, depth ...)
            if (search.isStopped()) {
                break;
            }
            long elapsed = System.currentTimeMillis() - startTime;
            // .... OTHER STOP CONDITIONS ....
            // if we've used more than half our time, don't start a new iteration
            if (earlyExitOk && !skipTimeChecks && (elapsed > maxTimeMs / 2)) {
                stopSearching = true;
            }
        } while (!stopSearching);

Up until this point I’ve been running 2,000 game self-play matches to gauge if a change is good or not. 2,000 games is enough to get an Elo difference with error bars of around 12-13 points, which has been “good enough.” So, my starting point here was the same – 2 matches, running concurrently on my 2 core testing machine, each match being 2000 games.

WinsLossesDrawsPctEloError
63458977751.1%+7.8211.90
63461475250.5%+3.4712.02
Test 1 – 2 x 2000 game self play matches

Judging by the results of this test, it appears likely the change is a good change, albeit a minor one. The problem is the error bars. If the error bars are to be believed, it’s possible the change is actually neutral, or even a minor loss. The only way to tell would be to play more games – a lot more. People that have done the math state that to measure a 5 Elo change, you need something like 10,000 games. To measure a 2 Elo change requires 60,000 games! I decided I would run 2 20,000 game matches concurrently, for a combined 40,000 games. That should be enough to make a decision. The results of that experiment are below.

WinsLossesDrawsPctEloError
64925782772651.8%+12.343.77
63346151751550.5%+3.183.80
Test 2 – 2 x 20,000 game self play matches

These results were very confusing. If you take the Elo for each match and use the error bars to create an interval, the two intervals don’t even overlap! I have enough of an appreciation of the statistics to know this is not impossible, but it seems extremely suspect. I ended up creating a post about it on a computer chess forum – http://talkchess.com/forum3/viewtopic.php?f=7&t=74685 . One poster, H.G. Muller gave a possible explanation. Since I am not restarting the engines between matches (chess4j is a Java engine, and the JVM has a”warm up” cost that I’m avoiding by not restarting it), it’s possible that memory (cache) is being mapped in a way that is harmful to one process. Not restarting causes that condition to persist. I don’t know if that’s the case or not, but it does make some sense and raises the question of whether it really is OK to run two matches concurrently (or more generally N matches on an N core machine, especially if you don’t restart between matches). I decided to run another test, but just one match this time, not two.

WinsLossesDrawsPctEloError
61796137768450.1%+0.733.77
Test 3 – 1 x 20,000 game self play match

My conclusion is that it’s not as stable as I originally thought to run two games concurrently. Restarting between matches could help, but again, being a JVM engine that’s not very appealing either. I could switch to testing first with Prophet as it’s C and could be restarted between matches, but one of the points of having a Java engine was quick prototyping. It’s also possible that a machine with more cores would resolve the issue, if at least one core is left free. I’ll revisit this in the future. For now, it’s one match at a time.

As far as the effect of the change in strength, it appears that it is worth a very small but positive amount of Elo. I think it’s a keeper, but just barely. But, running tens of thousands of games (especially when I’m only able to use a single core!) is not tenable. This would be a good time to look into some statistical methods. Many engine authors use Sequential Probability Ratio Test . If all you want to know is “is this engine better than this engine?” , and not necessarily the Elo difference between the two, SPRT could give a faster result.

The starting point for SPRT is to form a hypothesis. If player PA is my new version, and player PB is the current, best known version, my hypothesis is “PA is stronger than PB”. That is almost complete; that statement is really equivalent to “PA is stronger than PB by at least 1 Elo”.

You also need a “null hypothesis” – the hypothesis that there is no significant difference between players. My null hypothesis is “PA is NOT stronger than PB by at least 5 Elo points.”

Finally, since this is a statistical test you need to determine your tolerance for an error. There are two types of errors to consider. One is the likelihood of rejecting a change that is good (a type I error). The other is the likelihood of accepting a change that is bad (a type II error). I decided to accept a 5% likelihood of error for both.

Putting all that together:

H1: PA is stronger than PB by at least 1 Elo.

Ho: PA is not stronger than PB by at least 5 Elo

A: 0.05

B: 0.05

Note that, SPRT does not necessarily mean the match will terminate early. You can read about the math behind it at https://en.wikipedia.org/wiki/Sequential_probability_ratio_test , but the idea is that, if either H1 or H0 are “accepted” , the match will terminate. If not, it will continue until the maximum number of games is reached.

As it turns out, the program I use to drive the testing process, Cutechess , has the capability to terminate a match using SPRT. I had to experiment with it a little bit, my first attempt ended in an 18,000 game match where H0 was accepted! – not good. But, I had some parameters flipped so the results are invalid. I decided it would be more productive to experiment with a very large and obvious change, like disabling null move. The correct flags are:

-sprt elo0=1 elo1=5 alpha=0.05 beta=0.05

Given the previous results, I am not going to repeat the experiment yet again to see if it would terminate early, as it’s very likely it wouldn’t and I’d end up losing a week of processor time. I will however be using SPRT going forward.

Since the change was a good one in chess4j I’ve ported it to Prophet4 as well, and for confirmation ran a 2,000 game P4 vs P3 match.

WinsLossesDrawsPctEloError
233115361427.0%-172.813.4
P4 vs P3, 2000 games 10.0+0.5

That is marginally worse than the last run by 1.0%, but well within the margin of error.

Edit: I ran the P4 vs P3 match again out of curiosity:

WinsLossesDrawsPctEloError
250113961127.8%-166.013.4
P4 vs P3, 2000 games 10.0+0.5

Once again, the error bars are large enough in a 2000 game match that there is a lot of variability between runs, so I’m not making decisions based on these results, but I do like to see how P4 is measuring up against P3 as changes are made.

Next up- aspiration windows.

Null move pruning

The null move heuristic has been added and yielded another pretty solid increase in Elo.

For the uninitiated, a “null move” isn’t a real over-the-board move of course, it’s just a heuristic that can be used to terminate a line of search early, before any moves are even generated. The heuristic is based on the “Null Move Observation,” which says that it is almost always better to do something rather than nothing.

The basic idea is to actually try “doing nothing” and seeing if that would allow the opponent to improve their position. If they can’t, odds are that when you really do make a move your position will be too good (the opponent will do something else which prevents this entire line anyway). An example might be capturing the opponent’s queen with a pawn, leaving none of my pieces hanging (en-prise). No matter what the opponent does in response, he’s probably losing, so he’ll make a move which avoids leaving his queen hanging in the first place.

For this to actually save time, the search that is done after “doing nothing” (allowing the opponent a second move) is done to a reduced depth. How much to reduce is the big question – too much and you will make some very unsafe and risky cutoffs. Too little and you’re not really saving much time. A very conservative value would be R=2 or R=3, but with increasing processor speeds and greater and greater search depths many programmers push this value to extremes. It should all be backed by testing. Currently I am using R=3 but I do plan to revisit this in the future.

I am also reducing that reduction towards the leaves to ensure there is at least one ply of full width depth search remaining before dropping into the quiescence search. The idea here is to give the search an opportunity to resolve checks (the qsearch is currently capture only). My programs also avoid using this heuristic in positions that are likely zugzwang, meaning any move could potentially be harmful (which violates the Null Move Observation the heuristic is based on).

You’d have to know a little about the alpha-beta algorithm for a more technical description to make sense, but there’s really no reason for me to do that when you can read all about it on CPW – https://www.chessprogramming.org/Null_Move_Pruning or on this old-but-excellent description by Bruce Moreland – http://web.archive.org/web/20070708232843/http://www.brucemo.com/compchess/programming/nullmove.htm .

The important question – how much is it worth?

WinsLossesDrawsPctEloError
78051670456.6%+46.13+/- 12.29
81250468457.7%+53.93+/- 12.41
chess4j self play match, with null move vs. without, 5+0.25

… and …

WinsLossesDrawsPctEloError
237111864528.0%-164.29+/- 13.13
Prophet 4 vs Prophet 3, 10+0.5

So once again a pretty large jump, and of course I’m very pleased. Prophet4 is now within 200 elo of Prophet3 and there is still a pretty good list of enhancements to implement.

The next four planned changes are going to be smaller and perhaps more difficult to measure: (1) stopping the iterator early if less than half the allotted time remains, (2) proper treatment of mate scores in hash (rather than the “cheat” that’s in place now), (3) aspiration windows within the iterative deepening function, and (4) re-examination of the three-fold rep: is a two-fold repetition good enough? I don’t expect any of those will be big changes on their own, but I’m hoping that cumulatively they’ll be worth a few points. But, you never know, which is why we test.

Hashing

Both chess4j and Prophet now have hashing. Maybe I should say that chess4j has hashing again, because it did have hashing before as I wrote about here. As stated in a previous post, I basically “tore down” the search in chess4j in order to build it back up alongside P4, so I could test equivalence as I go.

The replacement strategy is very simple and certainly an area for future work. For every single node in the full width search where moves are expanded and searched (so these wouldn’t include quiescence nodes, leaf nodes, or nodes where a draw caused an early exit), the result is stored in the hash table. For every node of the full width search except the root, the hash table is probed. The hash entry includes not just the score for the position, but the depth of the search that produced it, and the type of node (PV, ALL, CUT) as well so we know if the score represents a bound or an exact value. In the case of PV and CUT nodes, the “best move” is also stored to aid in move ordering. The hash table is not “bucketed” – an entry consists of just a single slot, so anytime something is hashed any previous values are overwritten. This is referred to in the literature as the “always replace” strategy. It’s very simple, but very effective. In the future I intend to experiment with probing from the qsearch, different replacement strategies (such as “depth preferred”) and expanding the number of values that can be mapped to a single key.

When a hash probe is done and a value found, if the depth associated with the entry is at least as big as the depth we intend to search, we may be able to immediately exit the search without expanding any moves. If not, we at least have a move to try. If we do have to search and there is a suggested move from the hash table, it is tried first, before any moves are generated.

The results of two runs of chess4j with hash vs. without:

WinsLossesDrawsPctEloError
95943960263.0%+92.4612.97
98146955062.8%+90.9713.22

… and Prophet4 vs Prophet3:

WinsLossesDrawsPctEloError
145136049519.6%-244.9214.94

A very solid jump in strength; I’ll take it. 🙂

I’ve been putting some thought into my testing regime. In the past I’ve tested changes by running gauntlets of 10+0.5, which can take quite a long time on a single processor if you really want to get the error bars down. I have an older laptop that I’ve dedicated to testing. It has two cores, so it would be nice to utilize both of them. So, I did some self play tests with both of my programs, running two games concurrently, and happily the results came out pretty close to 50%. Additionally, I also experimented with reducing game times in self play matches from 10+0.5 down to 5+0.25. The results were still around 50%, and without any timeouts. So, at the moment my testing regime is to run a self play match, and then to run a P4 vs P3 match as verification. I think this will work until I’ve reached parity with P3, at which point I intend to go back to gauntlet testing. It would be good to introduce some statistical methods to measure the confidence a change is good (and to use for early termination), particularly as the elo become harder to come by and more and more games are needed to measure the change. I imagine at some point I’ll need more cores for testing as well.