Exploring the NEC V20 CPU

A V20 CPU, courtesy of Konstantin Lanzet (CC)

The NEC V20 was a 16-bit CPU released in 1984. It is pin-compatible with the Intel 8088, and clones the 8088's instruction set. It also includes newer instructions that had been introduced two years prior by the Intel 80186 and 80286, but does not include the latter's protected-mode features.

Besides the 186 instruction set, NEC also added new opcodes and instruction prefixes of their own, enabling performance enhancements for software that could detect or require a V20.

NEC also incorporated various other improvements Intel had made in the 80186, namely hardware support for address calculation and for division and multiplication. The former was a big benefit overall, as it could decrease the execution time of any instruction utilizing a memory address operand.

The V20 was one of the first 3rd-party, drop-in CPU replacements for the PC. An owner of an original IBM 5150 could pull the 8088 out of their motherboard's CPU socket, install a V20, and enjoy improved software support and a decent performance increase at a rather attractive price. Upgrade kits were sold for as little as $15.95. Of course, this made Intel unhappy, and lawsuits followed.

In ideal situations, the V20 could achieve performance somewhere between an 8086 and a 286 - making it an attractive upgrade indeed. For owners of the 8086, the V30 was offered as a similar upgrade.

Over the past few years I have done extensive research on the 8088. I thought I would spend some time now poking around at the V20, which has yet to be so thoroughly explored. I am not Ken Shirriff, so no in-depth, gate level analysis from die photos here. Just the information I can glean from controlling the CPU via an Arduino microcontroller. Speaking of which...

Improving the Arduino8088

One of the things that spurred my curiosity over the V20 is an improved Arduino8088 board. The board has been modified for 3V operation, and now includes a speaker and status LEDs. The software that drives it has been rewritten for a much faster Arduino - the Arduino DUE (The GIGA has since been released and is much faster in turn than the DUE, but that's a future project).

Arduino8088 v1.1

The MEGA that I used to generate the original 8088 test suite used an ATmega2560 CPU clocked at 16 Mhz. By comparison, the DUE has an 84Mhz SAM3X8E ARM Cortex M3. This by itself is a nice perk; but the actual biggest benefit of switching to the DUE is support for native USB serial. This runs at 480Mbps compared to the maximum of 2Mbps possible via the MEGA's USB to serial conversion chip. Considering a single byte has to be sent and five bytes received as acknowledgment to clock the CPU and retrieve its state, the serial control protocol was the main bottleneck. At 2M baud, the protocol itself limits us to a theoretical maximum clock speed of 333Khz, and that's not counting latency, GPIO state change delays, and other issues which make the effective clock rate far lower still.

Overall, the DUE ends up being about 7 times faster running validation tasks than the MEGA, and I believe even more optimization is possible. Generating a test set of 10,000 opcode executions used to take over 20 minutes; now it takes only 3.

V20 Reset Vector

A quick note about the V20 reset vector. The original 8088 has a reset vector of FFFF:0000. Intel changed the reset vector to F000:FFF0 on the 286. The V20 keeps the same reset vector as the 8088, at FFFF:0000. This is important for implementing proper wrapping behavior. A popular set of '186' test ROMs by Artlav actually assumes a 286 reset vector, so keep that in mind.

V20 Mnemonics

NEC defined their own mnemonics for the 8088 ISA, probably for legal reasons; we will ignore NEC's naming conventions except when discussing V20-specific instructions. If you're curious as to what NEC named things, here is a translation table from Intel to NEC naming conventions:

Fuzzing the V20

The NEC is less resilient to undefined instruction forms than the 8088. The 8088 will not stop executing even if fed a random stream of bytes (assuming we filter HLT ). The 8088 has no concept of an illegal instruction, although the operations it performs for certain invalid instruction encodings may be of questionable usefulness. The V20, in contrast, is a bit more fussy. It doesn't have an illegal instruction exception either, but certain instruction forms can simply cause the V20 to halt so we must do a bit of masking when fuzzing.

With our new "ArduinoV20" in hand, we can explore the behavior of V20 instructions, both defined and undefined.

V20 Opcode Notes

The V20 does not perform the dubious POP CS like the 8088 does. As Intel did on the 286, the V20 repurposes 0F as the first byte of a set of extended two byte opcodes. We'll take a look at those in more detail below.

60-6F

On the 8088, 60-6F are aliases for the relative jump opcodes 70-7F . Not so on the V20. Several new 186+ instructions live here, as well as a few new instructions and two prefixes unique to the V20.

60 PUSHA

(186+) Pushes the 8 main 16 bit registers to the stack. The value pushed for SP is the value of SP before any register is pushed.

61 POPA (186+) The complement to PUSHA, popping the registers off the stack (except for SP, which is ignored)

62 BOUND (186+) BOUND takes a modr/m byte, and its memory operand consists of two signed words, giving this instruction a unique operand type. The value of the register operand is interpreted as a signed word index. The two signed words of the memory operand are interpreted as a starting and ending bounds. If the index is not greater than start and less than end, INT5 is executed.

63 Undefined

63 is an undefined opcode that takes a 16-bit modr/m operand which it reads and then and spends approximately 60 cycles doing nothing. It does not modify any registers or flags. This turns out to be actually pretty useful for test generation, as during those 60 cycles the instruction queue is filled without any side-effects. This allowed me to execute and generate tests for instructions from a fully prefetched state.

64 REPNC

(V20) The V20 defines a new prefix for use with string operations. With this prefix, a string operation will run until CX is exhausted as normal, but with an additional exit if the carry flag is set. This prefix along with its twin REPC are intended for use with the string comparison instruction CMPSB , although it will affect all string operations except for INS and OUTS for which it acts like a plain REP prefix. If carry is set at the start of the instruction, o ne iteration will still be performed - carry is only checked after each iteration.

65 REPC

(V20) Similar to REPNC but with inverted carry flag logic, REPC will repeat the string operation as long as carry is set. One iteration will always be performed - carry is only checked after each iteration. Acts like a regular REP prefix when attached to INS or OUTS.

66-67 FPO2

ESC opcodes here for use with a floating point math coprocessor, which it calls FPO2 . An 8087 won't know what to do with these; but NEC had planned its own math coprocessor, the UPD72191 , which might have been designed to work with these additional opcodes. (V20) The V20 defines two additionalopcodes here for use with a floating point math coprocessor, which it calls. An 8087 won't know what to do with these; but NEC had planned its own math coprocessor, the

68, 6A PUSH imm (186+) Pushes an immediate value to the stack, either as 16-bits ( 68 ) or 8-bits ( 6A ). In 8-bit mode, the immediate operand is sign-extended.

69, 6B IMUL imm (186+) This form of multiplication marks the first appearance of three-operand instructions, taking both a modr/m and an immediate. The product is constrained to a single register.

6C, 6D INS (186+) When utilized with a REP prefix, these instructions act like the string operation MOVS , except using an IO port specified by DX as the source and ES:DI as the destination. DI is updated per iteration.

6E, 6F OUTS (186+) When utilized with a REP prefix, these instructions act like string operations with a source of DS:SI (segment-overridable) and an IO port specified by DX as the destination. SI is updated per iteration.

82 Group 1: Bitwise Operations Like on the 8088, 82 appears aliased to 80 , performing bitwise operations on an immediate byte operand.

8E MOV sreg, r/m16

POP CS may not be implemented, but the V20 happily overwrites CS if specified as the register destination in this form. NEC documentation states such a form is undefined.

8F PUSH

completely broken. The PUSH instruction itself will appear to act as a NOP , but it will likely break the following instruction by injecting the stack memory reads that didn't occur during PUSH itself. Bizarre. This was known about in the day; see this The register forms of this instruction are undefined. On the 8088, their behavior is strange - on the V20, their behavior is. Theinstruction itself will appear to act as a, but it will likely break the following instruction by injecting the stack memory reads that didn't occur duringitself. Bizarre. This was known about in the day; see this text file . Some further discussion of this can be found here

A6, A7 CMPS

One peculiar difference to the 8088 here is the order in which CMPS on the V20 accesses its operands. When prefixed by a REPE/REPNE , ES is read first, and then the overridable DS segment is read second. When not prefixed, CMPS behaves like an 8088 and reads from DS first. This is an odd quirk - if it always read from ES first one might propose that it was due to shared microcode between CMPS and SCAS . Does the V20 have different microcode for prefixed and non-prefixed CMPS instructions? Without decoding the V20 microcode, we can only speculate.

C0-C1 Bitwise Operations

(186+) C0 and C1 are now new group instructions for bitwise operations with an immediate operand. Unlike on the 80186, the immediate byte operand is not masked. Extension 6 performs SHL .

C8 ENTER

(186+) ENTER was designed as support for creating stack frames in higher-level languages such as Pascal. Intel documentation states that the nesting level is determined by the second immediate operand modulo 32. The V20 apparently ignores this detail and uses the unmasked value of the immediate.

D0-D3 Bitwise Operations

C0-C1 , The V20 does not mask the value of CL used as a count with D2-D3 . The 8088 has SETMO for D0-D1 , and SETMOC for D2-D3 . The V20 does not perform either; extension 6 is aliased to SHL . Similar to, The V20 does not mask the value of CL used as a count with. The 8088 has undocumented instructions at extension 6 here;for, andfor. The V20 does not perform either; extension 6 is aliased to

D4 AAM, D5 AAD

These BCD operations take an immediate byte operand representing the number base. Most assemblers will assume a value of 0A (decimal 10) to represent base 10 or traditional decimal values. The 8088 can actually accept any value for the immediate base for both D4 and D5 . The V20 honors the immediate value for AAM (D4) but ignores it for AAD (D5) . If an immediate value of 0 is provided to AAM , the AH register is set to FF and AL is unchanged, but the Sign, Zero and Parity flags are updated against the current value of AL.

D6 XLAT

The 8088 has an undocumented instruction SALC here. The V20 does not implement it. At first glance it would appear D6 is an alias to D7, as it seems to perform XLAT correctly. However this is not a normal XLAT; for some reason D6 takes 14-18 more cycles than D7. If anyone has a clue why this might be, please let me know.

F6-F7.1

Opcode extension 1 of F6-F7 is aliased to extension 0 and performs TEST .

F6-F7.7 IDIV

It was only recently discovered that prepending a REP prefix to IDIV on the 8088 inverts the sign of the quotient. Whether or not NEC was aware of this particular quirk, they did not copy it. REP prefixes have no effect on IDIV on the V20.

FE.3-7

You may have wondered why FE is a group opcode with only two instructions. The explanation is that on the 8088, it shares its microcode with FF, and the V20 seems to do the same. The width bit (0) is valid for both instructions, making FE simply an 8-bit version of FF.

It is no problem to increment or decrement a byte value, so those are the only two extensions officially defined fore FE. Performing a CALL or JUMP with a byte value is dubious, and attempting a far jump or call with just a byte is completely nonsensical. Nevertheless, both the 8088 and V20 will attempt it if extensions 3-7 of FE are provided.

The 8088 muddles through FE regardless of form provided, doing odd things like updating half of registers and pushing single bytes to the stack. I haven't fully explored the behavior of V20 but one immediate difference is that FE.3 and FE.5 will halt the CPU if a register addressing mode is used, which I suppose is a fair response when asked to do the impossible.

FF.7

FF.7 is an alias for FF.6 , and also performs PUSH .

0F Extended Opcodes The V20 has several NEC-specific instructions defined as two-byte opcodes, with the first byte being 0F . 0F isn't simply treated as a new prefix. When the CPU reads a prefix and then a normal, non-extended opcode, both bytes are tagged as "First Byte" reads. When the CPU reads 0F and then the second opcode byte, the second opcode byte is tagged "Subsequent Byte" instead. However even if 0F was a prefix internally, this behavior would be required as to not confuse the 8087.

There is a one-cycle delay after 0F is read before the next opcode byte is read.

The 0F opcode space is pretty sparse. NEC avoided the first 16 instructions to prevent conflicts with the 80286 ISA, but did not fully pack the opcode space otherwise. If you're curious how NEC and Intel coordinated to share the 0F opcode space, the answer is: they did not. Intel would largely ignore NEC's instructions and reuse several of these opcodes for their own purposes on the 80386.

0F10-0F17 TEST1, CLR1, SET1, NOT1

The first 8 extended instructions either test, clear, set or invert bits in their modr/m operand, with the bit number to target specified by the CL register. The 'W' bit (0) is valid to determine either 8 or 16 bit operation.