I could’nt stop thinking about optimisation

On my last post I was complaining on how slow the development process was low because of the IO emulation routine cost. I decided to stop developing it.

But the next morning, I could not stop reflecting on the speed problem, and I found out a method. Replacing PPUADDR and PPUWR call system by an array of routines depending on the address. Array[PPUADDR >> 2] = routine@. This array is big but it fits in WRAM.

It allowed to remove all the PPUADDR incrementing code who changed the IO routine depending on the address. It made a 20 rendering lines gain by removing the PPUADDR write IO emulation from Bank zero where the emulation code is, and replacing it with a short routine set in ram.

I believed that the gain in cycles was not enough because of sound emulation, but it looks like that sound emulation in the SPC700 needs only to be updated once per frame. In Super Mario Bros, it can be done between line 80 and line 240 where the game does nothing. In fact plenty of cycles are available for sound emulation update.

All in all, it will be possible to run Super Mario Brothers on the snes with automatic conversion.

Upernes, conclusion

I have been developing this software for a total of 1.5 year and it is finished.

I had many problems with the scrolling and I fixed that thanks to the NES community. But then I still had glitches, they came from missing vblank end because the IO emulation took too much time. It went past the Vblank end while the smb code was looking for it. While on average, emulation does not take that much (because it calls only 10 routines), it calls up to 60 IO routines when updating the background. Event with optimising, it caused one last glitch. Maybe there is room for more optimising and removing this last glitch. By optimising a little more (I spent all day on it to remove 2 missing frames of 3), it could work with smb1 but it would leave no room for improvement or stabilisation.

Therefore, I reverted it to a working Ballon Fight and a SMB1 with more glitches. Donkey kong also works.

It looks like a NES game, it feels like a NES game much more than “smb all stars”, but it is not 100% perfect. The Snes PPU is too different and the CPU is not fast enough to handle the IO emulation cycle cost. Upernes uses a ton of tricks to be able to play games like in the picture below, and the console has some design compatibility (the 1rst one being HW cpu emulation) but it does not fit at 100%. It lacks a few details, just a few CPU cycles, but it’s not enough.

Anyway, it works with non scrolling games and Super Mario Bros can be played directly from the conversion.

The project was interesting, very exotic, it went further than what I expected but it is not an aesthetic conversion where everything fits (that was my goal). However it is fast, despite the few missing frames it really feels like the NES. I am not looking forward to squeeze cycle count per IO access. And I leave it like this.

I will just take a look at How to integrate Memblers work but I will probably not add it, given the problems with graphics.

Super Mario Brothers first successful conversion


On my previous post I was reflecting on a way to preserve the PRG ROM in order to execute the jump engine properly. Finally, chose to patch the PRG ROM by using a JSR to a block of jump routines in RAM. Those ram routines go to bank 0 and come back to the caller in bank 1 by using a return long. It was quicker than BRK because the BRK method requires to pull too many bytes from the stack.

Until today I had many problems with SMB. While balloon fight could be played, I had only a black screen on SMB.

In fact only the first instructions were executing. The indirect jump engine was not reading the correct addresses. Also, the programmers used a trick to jump over variable initialisations by using a BIT instruction using the 2 following bytes as BIT operands or as STA #$XX. And this was difficult to disassemble. It had 2 meanings depending on the address

However because the jump engine routine needs a patched rom instead of recompiled code, it was not a problem any-more. And after adding some indirect jumps to the indirect jump init file, SMB1 began to work.

It is slow but it works more or less. Now it will be a question of speed and sprite zero hit flag simulation.



The Super Mario Bros jump engine.

The following code was taken from the smb comented disassembly on GitHub. It shows how indirect jumps are used in Super Mario Bros.

lda OperMode_Task
jsr JumpEngine
.dw SetupGameOver
.dw ScreenRoutines
.dw RunGameOver
jsr JumpEngine
.dw IncrementColumnPos
.dw RenderAreaGraphics
.dw RenderAreaGraphics
.dw AreaParserCore
.dw IncrementColumnPos
.dw RenderAreaGraphics
.dw RenderAreaGraphics
.dw AreaParserCore
lda GameEngineSubroutine  ;run routine based on number (a few of these routines are
jsr JumpEngine            ;merely placeholders as conditions for other routines)
.dw Entrance_GameTimerSetup
.dw Vine_AutoClimb
.dw SideExitPipeEntry
.dw VerticalPipeEntry
.dw FlagpoleSlide
.dw PlayerEndLevel
.dw PlayerLoseLife
.dw PlayerEntrance
.dw PlayerCtrlRoutine
.dw PlayerChangeSize
.dw PlayerInjuryBlink
.dw PlayerDeath
.dw PlayerFireFlower

And so on, with more indirect addresses…
And the JumpEngine routine:

;$04 - address low to jump address
;$05 - address high to jump address
;$06 - jump address low
;$07 - jump address high
asl          ;shift bit from contents of A
pla          ;pull saved return address from stack
sta $04      ;save to indirect
sta $05
lda ($04),y  ;load pointer from indirect
sta $06      ;note that if an RTS is performed in next routine
iny          ;it will return to the execution before the sub
lda ($04),y  ;that called this routine
sta $07
jmp ($06)    ;jump to the address we loaded

Thus a parameter is set in A and then we call the jump engine. It will go to the caller address plus an index from A, hence calling a routine dynamicly depending on the state of the game.
I do not know how to emulate this. What upernes would do here is disassemble the routine call and then get lost with the routine addresses data below. And the generated code would take his base address on the stack from the recompiled code address at an offset very different from the original nes. And it will add to it the data from the original PRG rom at the wrong address. And it would not work.


  • The jump engine routine could be specified in the indirect jump files. And every data following the call be disassembled. And this specific structure kept in the source at the same @ in order to be able to jump like this.
  • At least the routines could be at their original addresses.
  • Another solution would be to keep the original code running without changing anything but the read and writes to IO ports. They would be replaced with BRK and two bytes indicating the port, read or write direction, and A, X or Y register. The BRK vector would point to an area out of the original rom where the io routines would be called in native mode from another bank. And therefore the original routines would work with this indirect jump mecanism.

The solution is not obvious. For now it means that the indirect jumps do not work because when it reads an address on the snes using the stack it will not be the corresponding nes address. It would be the recompiled code address.

I will discuss that on the nesdev forum to see if anyone has an idea.

Balloon Fight beginning to run on the super nes.

Balloon Fight on bsnes plus

This week “Balloon Fight”  automatic conversion to a snes rom using upernes started to work. The previous developments were conducted on supersleut and video memory was accessed through registers.

Now I use the new bsnes-plus v073+2 a special version with debugger and it is great. But register emulation has something and dma use is mandatory. So I had to make some changes in ram to vram transfers. This week, sprites worked for the first time and it was already playable. It is complex and the results are all or nothing.

I am working on the backgrounds, it is not so simple because the background memories are very different. Attributes memory gives the palette of 16 tiles per byte, while on the snes you must update the 16 tiles. Therefore it must be optimised, but first it must work.

The nes nametable becomes a buffer in CPU ram and is move to vram using DMA. This is simple and effective, it was a mess the the vram ports.

I will use the HDMA with indirect access to update a list of tiles and if too many tiles must be updated, it will be a full screen DMA update.

Now the attributes table.


Automating the rom conversion process


In the automated rom conversion process, upernes is not doing all the job. It breaks the rom into chr and prg data and rewrites the prg code into a big assembly file called recomp.asm. It also takes care of creating a file describing the indirect jumps. The unknown indirect jump addresses must be added by the user until all of them are found (by playing the game on the snes).

The last step is made by combining the files spat by upernes with assembly files containing the emulation routines and rom banks specifications. The rom is assembled with wla-65816. This is the snes rom final product.

Today I added a shell script + makefile to convert the roms with a single shell call. They are in /source/workdir/

It will reduce development time, no need to copy everything from folder to folder and make calls everywhere.

Upernes – a nes to supernes game recompiler: passing the tests.


In 2010 I learnt that the 65C816 cpu could run native 6502 code. And hence it could be possible to run nes games like “super mario” on the snes.

Plus I wanted to test what would be required to recompile a program from one architecture to another. I was thinking about how assembly code could be re-optimised by dynamic profiling on each target CPU.

In 2010 and 2011 I built a system able to disassemble a nes rom and build a snes rom. It is called upernes and can be found on github. But it only passed simple test roms.

Disassembling to recompile a program is a slightly more detailed process than everyday disassembling. But the game code remains nearly the same, like “only” replacing the calls to the Audio Processing Unit and Picture Processing Unit.

In 2011, by adding complexity to the assembly routines it got messy and I lost grip on it. This week I will work on it from the first simplest test rom with the goal to run SMB on the super nes.

The difficulty is to keep the first tests pass while adding stuff to the PPU emulation. Test driven development seems difficult on this. I will split the source code into independent functions when possible and make testing more automatic. The progress will be documented on this blog.