Adapted by Istvan Albert from Software Carpentry Best Practices in Scientific Computing
Background
- Software is lab equipment for the 21st Century
- Scientists spend a lot of time writing it
- But over 90% are self-taught
- They don't know what "good" looks like
- So we describe 24 practices in 8 groups
Good programmers are 10X more productive than average
Good practices are 10X more productive than average
Rule 1: Write Programs for People not Computers
- Hard to tell if code that's difficult to understand is doing what it's supposed to
- Hard for other scientists to re-use it...
- ...including your future self
Rule 1.1: Keep it simple
- Short-term memory can hold 7±2 items
- So break programs into short, readable functions, each taking only a few parameters
Rule 1.2: Make names consistent, distinctive, and meaningful.
p
doesn't help the reader's short term memory as much as pressure
- Don't use
temp
for both "temporary" and "temperature"
i
, j
are OK for indices in small scopes
Rule 1.3: Make code style and formatting consistent.
- Which rules don't matter -- having rules does
- Brain assumes all differences are significant
- Every inconsistency slows comprehension
Rule 2: Let the Computer Do the Work
- Computers exist to repeat things quickly
- 99% accuracy ⇒ 63% of at least one error per hundred repetitions
Rule 2.1: Make the computer repeat tasks.
- Write little programs for everything
- Even if they're called scripts, macros, or aliases
- Easier to do this with text-based programming systems than with GUIs
Rule 2.2: Save recent commands in a file for re-use.
- Most text-based interfaces do this automatically
- Repeat recent operations using
history
- "Reproducibility in the small"
- Saving history supports "reproducibility in the large"
- An accurate record of how a result was produced
- If everything can be captured
Rule 2.3: Use a build tool to automate workflows.
- Originally developed for compiling programs
- Can be used whenever some files depend on others
- Makes workflow explicit
Rule 3: Make Incremental Changes
- Most scientists don't have "requirements"
- They are their own users
- Code evolves in tandem with research
- Closest fit from industry is agile development
Rule 3.1: Small steps with frequent feedback
- People can concentrate for 45-90 minutes without a break
- So size each burst of work to fit that
- Longer cycle should be a week or two
Rule 3.2: Use a version control system.
- Tracks changes
- Allows them to be undone
- Supports independent parallel development
- Essential for collaboration collaboration
Rule 3.3: Version control EVERYTHING
- Not just software: papers, raw images, ...
- Not gigabytes...
- ...but metadata about those gigabytes
- Leave out things generated by the computer
- Use build tools to reproduce those instead
- Unless they take a very long time to create
Rule 4: Don't Repeat Yourself (or Others)
- Anything repeated in two or more places will eventually be wrong in at least one
- If it's faster to re-create than to discover or understand, fix it
Rule 4.1: There can be only one
- Every piece of data must have
a single authoritative representation in the system.
- Define constants exactly once
- Ditto file formats, geographical locations, ...
Rule 4.2: Modularize code rather than copying and pasting.
- Reducing code cloning reduces error rates
- Cuts the amount of testing needed
- And increases comprehension
Rule 4.3: Re-use code instead of rewriting it.
- It takes experts years to build high-quality numerical or statistical software
- Your time is better spent doing science on top of that
Rule 5: Plan for Mistakes
- No single practice catches everything
- So practice defense in depth
Note: improving quality increases productivity
Rule 5.1: Don't trust. Verify
- Add assertions to programs to check their operation.
- "This must be true here or there is an error"
- Like diagnostic circuits in hardware
- No point proceeding if the program is broken...
- ...and they serve as executable documentation
Rule 5.2: Use an off-the-shelf unit testing library.
- Manages setup, execution, and reporting
- Re-run unit tests after every change to the code to check for regression
Testing is Hard
- "If I knew what the right answer was, I'd have published by now."
- Compare to experimental data
- Or to analytic solutions of simple problems
- Or to old (trusted) programs
- If nothing else, forces scientists to document what "errors" are acceptable
Rule 5.3: Turn bugs into test cases.
- Write a test that fails when the bug is present
- Then work on the code until that test passes...
- ...and no others are failing
Test-Driven Development
- Why wait? Always write the tests, then the code
- Improves focus
- Encourages writing testable code
- And ensures tests actually get written...
- "Red, green, refactor"
Rule 5.4: Use a symbolic debugger.
- Explore the program as it runs
- Better than print statements
- You don't have to re-run...
- ...or guess in advance what you'll need to know
- Use breakpoints to stop program at particular points or when particular things are true
Rule 6: Optimize Software Only After It Works Correctly
- Even experts find it hard to predict performance bottlenecks
- Small changes to code often have dramatic impact on performance
- So get it right, then make it fast
Rule 6.1: Use a profiler to identify bottlenecks.
- Reports how much time is spent on each line of code
- Re-check on new computers or when switching libraries
- Summarize across unit tests
Rule 6.2: Write code in the highest-level language possible.
- People write the same number of lines of code per hour regardless of language
- So use the most expressive language available to get the "right" version...
- ...then rewrite core pieces (possibly in a lower-level language) to get the "fast" version
Rule 7: Document Design and Purpose not Mechanics
- Goal is to make the next person's life easier
- Focus on things the code doesn't say
- Or doesn't say clearly
- E.g., file formats
- An example is worth a thousand words...
Rule 7.1: Document interfaces and reasons not implementations.
- Interfaces and reasons change more slowly than implementation details, so documenting them is better economics
- And most people care about using code more than understanding it
Rule 7.2: Refactor code in preference to explaining how it works.
- Good code can be understood when read aloud
- Good programmers build libraries so that solving their problem is straightforward
- Again, "red, green, refactor"
Rule 7.3: Embed the documentation for a piece of software in that software.
- Specially-formatted comments or strings
- More likely to be kept up to date
- More accessible to interactive help
- Many modern tools embed code in documentation rather than vice versa
Rule 8: Collaborate
- Computers were invented to calculate
- The web was invented to collaborate
- Science is more fun when it's shared
Rule 8.1: Use pre-merge code reviews.
- Have someone else review changes before merging in version control
- Significantly reduces errors
- Good way to share knowledge
- It's what makes open source possible
Rule 8.2 Use pair programming
- Code in pairs when bringing someone new up to speed
and when tackling particularly tricky problems.
- Two people, one keyboard, one screen
- An extreme form of code review
- Can get a bit tired if done all the time...
Rule 8.3: Use an issue tracking tool.
- A shared to-do list
- Items can be assigned to people
- Supports comments, links to code and papers, etc.
- "Version control is where we've been, the issue tracker is where we're going"
Gosh, That's a Lot
One step at a time.
- Use text-based interfaces
- Turn history into scripts
- Put everything in version control
- Use test-driven development
Citation: Best Practices for Scientific Computing" ,
PLOS Biology, Jan. 2014.