This is a text transcription of the slides from the "Windows: a software engineering odyssey" talk given on Microsoft culture by Mark Lucovsky in 2000. This is hosted here because I wanted to link to the slides, but the only formats available online were powerpoint and slide-per-page HTML where each page is basically a screenshot of a powerpoint slide. If you're looking for something on current Microsoft culture, try these links.

Agenda

History of NT

Design Goals/Culture

NT 3.1 vs. Win2k

The next 10 years

NT timeline: first 10 years

2/89: Coding begins

7/93: NT 3.1 ships

9/94: NT 3.5 ships

5/95: NT 3.51 ships

7/96: NT 4.0 ships

12/99: NT 5.0 a.k.a. Windows 2000 ships

Unix timeline: first 20 years

69: coding begins

71: first edition -- PDP 11/20

73: fourth edition -- rewritten in C

75: fifth edition -- leaves Bell Labs, basis for BSD 1.x

79 -- one of the best

82 System III

84 4.2 BSD

89 SVR4 unification of Xenix, BSD, System V NT development begins

History of NT

Team forms 11/89

Six guys from DEC

One guy from MS

Built from the ground up Advanced PC OS Designed for desktop & server Secure, scalable, SMP design All new code

Schedule: 18 months (only missed our date by 3 years)

History of NT, cont.

Initial effort targeted at Intel i860 code-named N10, hence the name NT which doubled as N-Ten and New Technology

Most dev done on i860 simulator running OS/2 1.2

Microsoft built a single board i860 computer code-named Dazzle, including the supporting chipset; ran full kernel, memory management, etc. on the machine

Compiler came from Metaware with weekly UUCP updates sent to my Sun-4/200

MS wrote a PE/Coff linker and a graphical cross debugger

Design longevity

OS code has a long lifetime

You have to base your OS on solid design principles

You have to set goals; not everything can be at the top of the list

You have to design for evolution in hardware, usage patterns, etc.

Only way to succeed is to base your design on a solid architectural foundation

Development environments never get enough attention

Goal setting

First job was to establish high level goals Portability: ability to target more than one processor, avoid assembler, abstract away machine dependencies. Purposely started the i386 port very late to avoid falling into a typical Microsoft x86 centric design Reliability: nothing should be able to crash the OS. Anything that crashes the OS is a bug. Very radical thinking inside MS considering Win16 was co-operative multi-tasking in a single address space, and OS/2 had similar attributes with respect to memory isolation Extensibility: ability to extend OS over time Compatibility: with DOS, OS/2, POSIX, or other popular runtimes; this is the foundation work that allowed us to invent windows two years into NT OS/2 development performance: all of the above are more important than raw speed!

NS OS/2 design workbook

Design of executive captured in functional specs

Written by engineers, for engineers

Every functional interface was defined and reviewed

Small teams can do this efficiently Making this process scale is an almost impossible challenge Senior developers are inundated with spec reviews and the value of their feedback becomes meaningless You have to spread review duties broadly and everyone must share the culture

Developing a culture

To scale a dev team, you need to establish a culture Common way of evaluating designs, making tradeoffs, etc. Common way of developing code and reacting to problems (build breaks, critical bugs, etc.) Common way of establishing ownership of problems

Goal setting can be the foundation for the culture

Keeping culture alive as a team grows is a huge challenge

The NT culture

Portability, reliability, security, and extensibility ingrained as the teams top priority Every decision was made in the context of these design goals

Everyone owns all the code, so whenever something is busted anyone has a right and a duty to fix it Works in small groups (< 150 people) where people cover for each other Fails miserably in large groups

Sloppiness is not tolerated Great idea, but very difficult to nurture as group grows Abuse and intimidation gets way out of control; can't keep calling people stupid and except them to listen

A successful culture has to accept that mistakes will happen

NT 3.1 vs. Windows 2000

Dev teams

Source control

Process management

Serialized development

Defects

Development team

NT 3.1 Starts small (6), slowly grows to 200 people NT culture was commonly understood by all

Windows 2000 Mass assimilation of other teams into the NT team NT 4.0 had 800 developers, Windows 2000 had 1400 Original NT culture practiced by the old timers in the group, but keeping the culture alive was difficult due to growth, physical separation, etc. Diluted culture leads to conflict Accountability: I don't "own" the code that is busted, see Mark! reliability vs. new features 64-bit portability vs. new features

Source control system (NT 3.1)

Internally developed, maintained by a non-NT tools team No branch capability, but not needed for small team

10-12 well isolated source "projects", 6M LOC

Informal project separation worked well minimal obscure source level dependencies

Small hard drive could easily hold entire source tree

Developer could easily stay in sync with changes made to the system

Source control system (Windows 2000)

Windows team takes ownership of source control system, which is on life support

Branch capability sorely needed, tree copies used as substitutes, so merging is a nightmare

180 source "projects", 29M LOC

No project separation, reaching "up and over" was very common as developers tried to minimize what they had to carry on their machines to get their jobs done

Full source base required about 50Gb of disk space

To keep a machine in sync was a huge chore (1 week to set up, 2 hours per day to sync)

Process management (NT 3.1)

Safe sync period in effect for 4 hours each day; all other times, the rule is check-in when ready

Build lab syncs during morning safe sync period, which starts a complete build Build breaks are corrected manually during the build process (1-2 breaks were normal)

Complete build time is 5 hours on 486/50

Build is boot tested with some very minimal testing before release to stress testing Defects corrected with incremental build fixed

4pm, stress testing on ~100 machines begins

Process management (Windows 2000)

Developers not allowed to change source tree without explicit, email/written permission Build lab manually approves each check-in using a combination of email, web, and a bug tracking database

Build lab approves about 100 changes each day and manually issues the appropriate sync and build commands Build breaks are corrected manually; when they occur, all further build processing is halted A developer that mistypes a build instruction can stop the build lab, which stops over 5000 people

Complete build time is 8 hours on 4-way PIII Xeon 550 with 50Gb disk and 512k cache

Build is boot tested and assuming we get a boot, extensive baseline testing begins Testing is a mostly manual, semi-automated process Defects occurring in the boot or test phase must be corrected before the build is "released" for stress testing

4pm, stress testing on ~1000 machines begins

Team size

Product Devs Testers NT 3.1 200 140 NT 3.5 300 230 NT 3.51 450 325 NT 4.0 800 700 Win2k 1400 1700

Serialized Development

The model from NT 3.1 to 2000

All developers on team check in to a single main line branch

Master build lab syncs to main branch and builds releases from that branch

Checked in defect affects everyone waiting for results

Defect rates and serialization

Compile time or run time bugs that occur in a dev's office only affect that dev

Once a defect is checked in, the number of people affected by the defect increases

Best devs are going to check in a runtime or compile time mistake at least twice a year

Best devs will be able to code with a checked in compile time or run time break very quickly (20 minutes end-to-end)

As the code base gets larger, and as the team gets larger, these numbers typically double

Defect rates data

With serialized development Good, small, teams operate efficiently Even the absolute best large teams are always broken and always serialized

Product Team # Defects/dev-yr Fix time / defect Defects / day Total fix time NT 3.1 200 2 20m 1 20m NT 3.5 300 2 25m 1.6 41m NT 3.51 450 2 30m 2.5 1.2h NT 4.0 800 3 35m 6.6 3.8h Win2k 1400 4 40m 15.3 10.2h

Dev environment summary

NT 3.1 Fast and loose; lots of fun & energy Few barriers to getting work done Defects serialized as parts of the process, but didn't stop the whole machine; minimal downtime

Windows 2000 Source control system bursting at the seams Excessive process management serialized the entire dev process; 1 defect stops 1400 devs, 5000 team members Resource required to build a complete instance of NT were excessive, giving few developers a way to be sucessful

Focused fixes

Source control

Source code restructuring

Make the large team work like a set of small teams Windows is already organized into reasonable sized dev teams Goal is to allow these teams to work as a team when contributing source code changes rather than as a group of individuals that happen to work for the same VP Parallel development, team level independence

Automated builds

Source control system

New system identified 3/99 (SourceDepot)

Native branch support

Scalable high speed client-server architecture

New machine setup 3 hours vs. 1 week

Normal sync 5 minutes vs. 2 hours

Transition to SourceDepot done on live Win2k code base

Hand built SLM -> SourceDepot migration system allowed us to keep in sync with the old system while transitioning to SourceDepot without changing the code layout.

Source code restructuring

16 depots for covering each major area of source code

Organization is focused on: Minimizing cross project dependencies to reduce defect rate Sizing projects to compile in a reasonable about of time To build a project, all you need is the code for that project and that public/root project Cross project sharing is explicit

New tree layout

The new tree layout features Root project houses public 15 additional projects hang off the root No nested projects All projects build independently Cross project dependencies resolved via public, public/internal usnig checked in interfaces

Team level independence

Each team determines its own check-in policy, enable rapid, frequent check ins

Teams are isolated from mistakes by other teams When errors occur, only the tema causing the error is affected A build, boot, or test break only affects a small subset of the product group

Each team has their own view of the source tree, their own mini build lab, and builds and entire installable build

Any developer with adequate resources can easily duplicate a mini build lab Build and release a completely installable Windows system

Teams integrate their changes into the "main" trunk one at a time, so there is a high degree of accountability when something goes wrong in "main"

Build breaks will happen, but they are easily localized to the branch level, not the main product codeline

Teams are isolated from mistakes made by other teams When errors occur, they affect smaller teams A build, boot, or test break only affects a small subset of the Windows development team

Each team has their own view of the source tree and their own mini buikld lab Each team's lab is enlisted in all projects and builds all projects Each team needs resources able to build an NT system

Each team's build lab builds, tests, and mini-bvt's a complete standalone system

Automated builds

Build lab runs 100% hands off

10am and 10pm full sync and full build Build failures are auto detected and mailed to the team Sucessful builds are automatically released with automatic notification to the team

Each VBL can build: 4 platforms (x86 fre/chk, ia64 fre/chk) = 8 builkds/day, 56/week No manual steps at all 7 VBLs in Win2k group Majority of builds work, but failures when they occur are isolated to a single team

Productivity gains