I could’nt stop thinking about optimisation

On my last post I was complaining on how slow the development process was low because of the IO emulation routine cost. I decided to stop developing it.

But the next morning, I could not stop reflecting on the speed problem, and I found out a method. Replacing PPUADDR and PPUWR call system by an array of routines depending on the address. Array[PPUADDR >> 2] = routine@. This array is big but it fits in WRAM.

It allowed to remove all the PPUADDR incrementing code who changed the IO routine depending on the address. It made a 20 rendering lines gain by removing the PPUADDR write IO emulation from Bank zero where the emulation code is, and replacing it with a short routine set in ram.

I believed that the gain in cycles was not enough because of sound emulation, but it looks like that sound emulation in the SPC700 needs only to be updated once per frame. In Super Mario Bros, it can be done between line 80 and line 240 where the game does nothing. In fact plenty of cycles are available for sound emulation update.

All in all, it will be possible to run Super Mario Brothers on the snes with automatic conversion.

Upernes, conclusion

I have been developing this software for a total of 1.5 year and it is finished.

I had many problems with the scrolling and I fixed that thanks to the NES community. But then I still had glitches, they came from missing vblank end because the IO emulation took too much time. It went past the Vblank end while the smb code was looking for it. While on average, emulation does not take that much (because it calls only 10 routines), it calls up to 60 IO routines when updating the background. Event with optimising, it caused one last glitch. Maybe there is room for more optimising and removing this last glitch. By optimising a little more (I spent all day on it to remove 2 missing frames of 3), it could work with smb1 but it would leave no room for improvement or stabilisation.

Therefore, I reverted it to a working Ballon Fight and a SMB1 with more glitches. Donkey kong also works.

It looks like a NES game, it feels like a NES game much more than “smb all stars”, but it is not 100% perfect. The Snes PPU is too different and the CPU is not fast enough to handle the IO emulation cycle cost. Upernes uses a ton of tricks to be able to play games like in the picture below, and the console has some design compatibility (the 1rst one being HW cpu emulation) but it does not fit at 100%. It lacks a few details, just a few CPU cycles, but it’s not enough.

Anyway, it works with non scrolling games and Super Mario Bros can be played directly from the conversion.

The project was interesting, very exotic, it went further than what I expected but it is not an aesthetic conversion where everything fits (that was my goal). However it is fast, despite the few missing frames it really feels like the NES. I am not looking forward to squeeze cycle count per IO access. And I leave it like this.

I will just take a look at How to integrate Memblers work but I will probably not add it, given the problems with graphics.

Super Mario Brothers first successful conversion


On my previous post I was reflecting on a way to preserve the PRG ROM in order to execute the jump engine properly. Finally, chose to patch the PRG ROM by using a JSR to a block of jump routines in RAM. Those ram routines go to bank 0 and come back to the caller in bank 1 by using a return long. It was quicker than BRK because the BRK method requires to pull too many bytes from the stack.

Until today I had many problems with SMB. While balloon fight could be played, I had only a black screen on SMB.

In fact only the first instructions were executing. The indirect jump engine was not reading the correct addresses. Also, the programmers used a trick to jump over variable initialisations by using a BIT instruction using the 2 following bytes as BIT operands or as STA #$XX. And this was difficult to disassemble. It had 2 meanings depending on the address

However because the jump engine routine needs a patched rom instead of recompiled code, it was not a problem any-more. And after adding some indirect jumps to the indirect jump init file, SMB1 began to work.

It is slow but it works more or less. Now it will be a question of speed and sprite zero hit flag simulation.



The Super Mario Bros jump engine.

The following code was taken from the smb comented disassembly on GitHub. It shows how indirect jumps are used in Super Mario Bros.

lda OperMode_Task
jsr JumpEngine
.dw SetupGameOver
.dw ScreenRoutines
.dw RunGameOver
jsr JumpEngine
.dw IncrementColumnPos
.dw RenderAreaGraphics
.dw RenderAreaGraphics
.dw AreaParserCore
.dw IncrementColumnPos
.dw RenderAreaGraphics
.dw RenderAreaGraphics
.dw AreaParserCore
lda GameEngineSubroutine  ;run routine based on number (a few of these routines are
jsr JumpEngine            ;merely placeholders as conditions for other routines)
.dw Entrance_GameTimerSetup
.dw Vine_AutoClimb
.dw SideExitPipeEntry
.dw VerticalPipeEntry
.dw FlagpoleSlide
.dw PlayerEndLevel
.dw PlayerLoseLife
.dw PlayerEntrance
.dw PlayerCtrlRoutine
.dw PlayerChangeSize
.dw PlayerInjuryBlink
.dw PlayerDeath
.dw PlayerFireFlower

And so on, with more indirect addresses…
And the JumpEngine routine:

;$04 - address low to jump address
;$05 - address high to jump address
;$06 - jump address low
;$07 - jump address high
asl          ;shift bit from contents of A
pla          ;pull saved return address from stack
sta $04      ;save to indirect
sta $05
lda ($04),y  ;load pointer from indirect
sta $06      ;note that if an RTS is performed in next routine
iny          ;it will return to the execution before the sub
lda ($04),y  ;that called this routine
sta $07
jmp ($06)    ;jump to the address we loaded

Thus a parameter is set in A and then we call the jump engine. It will go to the caller address plus an index from A, hence calling a routine dynamicly depending on the state of the game.
I do not know how to emulate this. What upernes would do here is disassemble the routine call and then get lost with the routine addresses data below. And the generated code would take his base address on the stack from the recompiled code address at an offset very different from the original nes. And it will add to it the data from the original PRG rom at the wrong address. And it would not work.


  • The jump engine routine could be specified in the indirect jump files. And every data following the call be disassembled. And this specific structure kept in the source at the same @ in order to be able to jump like this.
  • At least the routines could be at their original addresses.
  • Another solution would be to keep the original code running without changing anything but the read and writes to IO ports. They would be replaced with BRK and two bytes indicating the port, read or write direction, and A, X or Y register. The BRK vector would point to an area out of the original rom where the io routines would be called in native mode from another bank. And therefore the original routines would work with this indirect jump mecanism.

The solution is not obvious. For now it means that the indirect jumps do not work because when it reads an address on the snes using the stack it will not be the corresponding nes address. It would be the recompiled code address.

I will discuss that on the nesdev forum to see if anyone has an idea.

Balloon Fight beginning to run on the super nes.

Balloon Fight on bsnes plus

This week “Balloon Fight”  automatic conversion to a snes rom using upernes started to work. The previous developments were conducted on supersleut and video memory was accessed through registers.

Now I use the new bsnes-plus v073+2 a special version with debugger and it is great. But register emulation has something and dma use is mandatory. So I had to make some changes in ram to vram transfers. This week, sprites worked for the first time and it was already playable. It is complex and the results are all or nothing.

I am working on the backgrounds, it is not so simple because the background memories are very different. Attributes memory gives the palette of 16 tiles per byte, while on the snes you must update the 16 tiles. Therefore it must be optimised, but first it must work.

The nes nametable becomes a buffer in CPU ram and is move to vram using DMA. This is simple and effective, it was a mess the the vram ports.

I will use the HDMA with indirect access to update a list of tiles and if too many tiles must be updated, it will be a full screen DMA update.

Now the attributes table.


Automating the rom conversion process


In the automated rom conversion process, upernes is not doing all the job. It breaks the rom into chr and prg data and rewrites the prg code into a big assembly file called recomp.asm. It also takes care of creating a file describing the indirect jumps. The unknown indirect jump addresses must be added by the user until all of them are found (by playing the game on the snes).

The last step is made by combining the files spat by upernes with assembly files containing the emulation routines and rom banks specifications. The rom is assembled with wla-65816. This is the snes rom final product.

Today I added a shell script + makefile to convert the roms with a single shell call. They are in /source/workdir/

It will reduce development time, no need to copy everything from folder to folder and make calls everywhere.

Upernes – a nes to supernes game recompiler: passing the tests.


In 2010 I learnt that the 65C816 cpu could run native 6502 code. And hence it could be possible to run nes games like “super mario” on the snes.

Plus I wanted to test what would be required to recompile a program from one architecture to another. I was thinking about how assembly code could be re-optimised by dynamic profiling on each target CPU.

In 2010 and 2011 I built a system able to disassemble a nes rom and build a snes rom. It is called upernes and can be found on github. But it only passed simple test roms.

Disassembling to recompile a program is a slightly more detailed process than everyday disassembling. But the game code remains nearly the same, like “only” replacing the calls to the Audio Processing Unit and Picture Processing Unit.

In 2011, by adding complexity to the assembly routines it got messy and I lost grip on it. This week I will work on it from the first simplest test rom with the goal to run SMB on the super nes.

The difficulty is to keep the first tests pass while adding stuff to the PPU emulation. Test driven development seems difficult on this. I will split the source code into independent functions when possible and make testing more automatic. The progress will be documented on this blog.

Porting Peter Fleury’s STK500V2 bootloader to a custom avr board.

mega128 isp
mega128 accelerometer board with stk200 compatible isp programmer and usb to serial port adapter (rs232 to ttl circuit not shown).

I had problems porting the bootloader to my board twice, once with the atmega128 and now with the atmega168p while it worked already on the atmega128.

And I need a bootloader, for boards hidden from the user in a bag or in a box. I can’t use the SPI there. Plus the classic SPI connector takes a huge surface on the board compared to the chip size. I prefer to program a bootloader once with SPI and then upload the firmware through a serial port.

In this article I will describe the steps to port the bootloader created by Peter Fleury to a custom board. It is a compact bootloader with many stk500v2 functions. It has a lot of configuration options selected by macro definitions.


1/ Get the source code:

Many versions can be found. You can take my copy at github.com/mandraga/SombreroBms in the firmwarebms/stk500v2bootloader/ section (or you can find different bootloaders at github.com/arduino/Arduino).


2/ Change the makefile:

Change the µcontroller name to what you are using:

# MCU name
MCU = atmega128;

Change the clock to your clock:

#         F_CPU =  1000000
#         F_CPU =  1843200
#         F_CPU =  2000000
#         F_CPU =  3686400
#         F_CPU =  4000000
#         F_CPU =  7372800
#         F_CPU =  8000000
#         F_CPU = 11059200
#         F_CPU = 14745600
#         F_CPU = 16000000
#         F_CPU = 18432000
#         F_CPU = 20000000
F_CPU = 7372800

You can notice that most clocks are not simple numbers like 8000000hz. It is because avrdude uses 115200 bauds for the serial communication and in order to transmit data at this baudrate without any error you need a clock like 1843200hz or multiple of it. With this kind of frequency, the serial clock will be at 0,0% difference to the ideal baudrate. But if you use a 16Mhz clock you will have a 2,5% error. And 2,5% is enough to mess with the communication on a serial line (big chances to upload crap). In this case I picked a 7372800hz crystal.

Change the bootloader area start address:

# Bootloader
# Please adjust if using a different AVR
# 0x0e00*2=0x1C00 for ATmega8  512 words Boot Size
# Atmega168PA 0x1E00 512W but avrgcc takes bytes so   3C00
# Atmega128   0xF000 4096W but avrgcc takes bytes so 1E000


On this table from the atmega128 datasheet you can find the flash sections selected with the BOOTSZ1 and BOOTSZ0 fuses. For my atmega128 board I chose the 4096 words (8KBytes) bootloader size. And as you can see here, the bootloader start section begins at $F000. But this address is in words and gcc needs it in bytes. Therefore you must shift it left once to get the address given to gcc. $F000 becomes $1E000.

I made a mistake here and it took me several days to find it (that’s why I post this article). I used $F000 as the bootloader start address in gcc but my bootloader was somehow executing anyway while being in the middle of the 128kBytes of flash. And I could test the stk500v2 protocol, read the fuses… But when it came to use the SPM instruction to erase/program the flash, it failed. I had this message:

avrdude: verifying ...
avrdude: verification error, first mismatch at byte 0x0000
0x0c != 0xff
avrdude: verification error; content mismatch

First I though that the lock fuses were wrong, but they were 0xFF (SPM can read/write anything when executing in the boot section). Then I though that the power went down during programming, but no. And then I though that the avr-libc boot.h headers of the atmega168pa were not compatible. So I wrote the function in assembler. And what lost me is that when downloading the flash memory with the isp programmer, it missed the end of the flash and stopped at the end of the boot code. The downloaded flash content looked like the bootloader was at the end of the flash. Therefore I assumed that my boot address was the right one. This is where I began to lose time double checking everything.

By checking everything I finally changed the gcc code address to 0x1E000. And at this point SPM did erase the memory and write the page content to the flash.

3/ Changing stk500boot.c:

Chose the uart: some atmegas like the atmega128 have 2 uarts. By default it uses UART0 and if you define “#define USE_USART1”, you select the second uart.

The bootloader can use a led to indicate that it is working and an input button to indicate that you want to program the chip. The ports for it are specified here:

* Pin "PROG_PIN" on port "PROG_PORT" has to be pulled low
* (active low) to start the bootloader
* uncomment #define REMOVE_PROG_PIN_PULLUP if using an external pullup
#define PROG_DDR   DDRD
#define PROG_IN    PIND
#define PROG_PIN   PIND2

* Active-low LED on pin "PROGLED_PIN" on port "PROGLED_PORT"
* indicates that bootloader is active
#define PROGLED_PIN  7

I do not use the prog pin and everyone can change this in his own way. Change to your led port and pin here. And depending if it is a source or sink led port, write PROGLED_PORT = (1<<PROGLED_PIN); or PROGLED_PORT &= ~(1<<PROGLED_PIN); at the start of the main.

The flash address:

* Calculate the address where the bootloader starts from FLASHEND and BOOTSIZE
* (adjust BOOTSIZE below and BOOTLOADER_ADDRESS in Makefile if you want to change the size of the bootloader)
//#define BOOTSIZE 512
#define BOOTSIZE 4096
#define APP_END  (FLASHEND -(2*BOOTSIZE) + 1)

Here the value is defined in words, but APP_END is the same value as in the makefile, int Bytes.

4/ In circuit programming

Once you have changed all the parameters and compiled the bootloader, pick your spi programmer and do two things (assuming the clock fuses are already setup):

  •  Set the fuses: BOOTSZ1/0 to the flash size you chose and BOOTRST enabled by setting it to 0 in order to boot to the bootloader memory instead of $0000 on reset.
  • Program the bootloader with the SPI programmer.

At this point your bootloader should be running on the board. If you call avrdude like

avrdude -p atmega128 -P /dev/ttyUSB0 -c stk500v2 -U flash:w:main.hex

it should program the flash.


5/ What if avrdude does not work?

If at this point, avrdude fails you will ask yourself how to check the protocol? And by running “avrdude –help” you see that adding -v -v -v to the command line gives a lot of information. But not enough to debug the protocol.

I had a problem with an stk500v2 command, SPI_MULTI did not pass. A good solution to find a protocol problem is to download avrdude’s source code and in stk500v2.c change this:

#if 0
#define DEBUG(...) fprintf(stderr, __VA_ARGS__)
#define DEBUG(...)

Change #if 0 to #if 1 and recompile (make) reinstall (sudo make install). Now your avrdude will print all the data sent and received to and from the serial port. You will be able compare the data to what is specified in the protocol here: http://www.atmel.com/images/doc2591.pdf

And if avrdude fails somewhere you know on which command it is. In my case I forgot to enable SPI_MULTI by defining #define REMOVE_CMD_SPI_MULTI. You must comment this define otherwise avrdude will fail when sending the SPI_MULTI command. And if you use the default avrdude you will have no way to tell it was this command who failed.

The STK500v2 command and parameter codes can be found in the include file “command.h”.

About binary code size.

On the atmega168pa I need a bootloader size of 512Words, but I have this output with the same code as the atmega128. My binary is 1068Bytes long. Therefore it does not fit the 512W/1024Bytes.

Creating Symbol Table: stk500boot.sym
avr-nm -n stk500boot.elf > stk500boot.sym

Size after:
   text	   data	    bss	    dec	    hex	filename
   1068	      4	      0	   1072	    430	stk500boot.elf

-------- end --------

To go down to 512Workds, I uncommented “#define REMOVE_CMD_SPI_MULTI”. It removes part of the protocol, but it works with avrdude 6.3.

Creating Symbol Table: stk500boot.sym
avr-nm -n stk500boot.elf > stk500boot.sym

Size after:
   text	   data	    bss	    dec	    hex	filename
    962	      4	      0	    966	    3c6	stk500boot.elf

-------- end --------

In this case it was just because I needed all the memory, the BMS serial protocol takes a lot of program space (like 14700B/16KB). On the next version I will use an atmega32C1 and the bootloader size will not be a problem.


Configuring a bootloader for your avr may be difficult but it could be the best choice to program your board. With the condensed info here it should be easier to adapt it.


Battery voltage values over serial port over usb.

Capture d'écran de 2016-03-07 21:52:41

Just added a display of each battery voltage. It will be more usefull than I though for the development of the balancing algorithm. If it is possible to balance before charging it could be best solution. Because it removes completelly the capcitive effect. However if a battery is too low the passive draining of the batteries must not go to this level.

By balancing before charging, you wait for the balancing to finish when maybe you need to charge the car quickly. But if the pack is not too umbalanced or too low, it may be a reliable solution.

Also added a CRC to the transmited frames in order to be more confident in the transfered data. I do ont trust that much a serial port. On the linux side, it has his own thread and a stop char but a CRC does no harm.


BMS simulation

Capture d'écran de 2016-03-04 14:49:15

How to make a BMS simulator to test the AVR program? I wanted to keep the original AVR sources but test them as a linux app. I decided to separate the hardware and pure software sides of the BMS program. I modified the source code such that everytime the hardware is used, it is through a C function with no hardware includes. Therefore replacing the hardware C function calls to software C++ simulations and calling the state machine in a loop makes a simulator. I needed to add a graphic view of the battery packs (made with the SDL2 graphic calls) and voilà.

The BMS has a USB serial port to be able to update firmware and to communicate the battery information. This could be simulated with the tty simulator interface included in Linux. But I stayed with my first idea of using a named pipe. Because when created with mkfifo, you chose the name of the pipe, no need to pass it to the second app. I just needed two named fifos in /tmp/, one for each direction. I implemented them in a child class of the serial communication class. In that way the “BMS manager” app can dialog with the simulator or with the real hardware with the same functions.

Simulation speeds up the development process. Because it is easier to recompile and run from the PC than having to turn on and off switches on the battery pack test setup. I had a ton of bugs and still have to clean shit. I prefer bugs in the simulation than fearing of a balancing error killing batteries on the real hardware.

As you can see on the picture, the simulator allows to plug/unplug the charger, run/stop the motor, and change temperature. It is easy to check the diverse safety cases.

The GUI is coded with FLTK. From the internet pages about FLTK it looked like it only suports OpenGL 1.5 but in fact I ported the 2D drawing library using OpenGL 2.0 from Scoreview and it worked fine. So FLTK has this common point witht he SDL: it gives an OpenGL context directly.

And finally I burnt my 2 AD7280A, when installing the board on the test battery pack. I had reverse currents going up the AD7280A  and down to the µController (confused a via silkscreen circle with the 1 pin dot and mounted the ships with a -90° angle). I changed the design anyway, the fact that it is powered by the battery itself makes subtle changes from a standard design for a car.