Adapted from Software Carpentry Best Practices in Scientific Computing
Background
- Software is lab equipment for the 21st Century
- Scientists spend a lot of time writing it
- But over 90% are self-taught
- They don't know what "good" looks like
- So we describe 24 practices in 8 groups
Good programmers are 10X more productive than average
Good practices are 10X more productive than average
Rule 1: Write Programs for People not Computers
- Hard to tell if code that's difficult to understand is doing what it's supposed to
- Hard for other scientists to re-use it...
- ...including your future self
Rule 1.1: Keep it simple
- Short-term memory can hold 7±2 items
- So break programs into short, readable functions, each taking only a few parameters
Rule 1.2: Make names consistent, distinctive, and meaningful.
p
doesn't help the reader's short term memory as much as pressure
- Don't use
temp
for both "temporary" and "temperature"
i
, j
are OK for indices in small scopes
Rule 1.3: Make code style and formatting consistent.
- Which rules don't matter -- having rules does
- Brain assumes all differences are significant
- Every inconsistency slows comprehension
Rule 2: Let the Computer Do the Work
- Computers exist to repeat things quickly
- 99% accuracy ⇒ 63% of at least one error per hundred repetitions
Rule 2.1: Make the computer repeat tasks.
- Write little programs for everything
- Even if they're called scripts, macros, or aliases
- Easier to do this with text-based programming systems than with GUIs
Rule 2.2: Save recent commands in a file for re-use.
- Most text-based interfaces do this automatically
- Repeat recent operations using
history
- "Reproducibility in the small"
- Saving history supports "reproducibility in the large"
- An accurate record of how a result was produced
- If everything can be captured
Rule 2.3: Use a build tool to automate workflows.
- Originally developed for compiling programs
- Can be used whenever some files depend on others
- Makes workflow explicit
Rule 3: Make Incremental Changes
- Most scientists don't have "requirements"
- They are their own users
- Code evolves in tandem with research
- Closest fit from industry is agile development
Rule 3.1: Small steps with frequent feedback
- People can concentrate for 45-90 minutes without a break
- So size each burst of work to fit that
- Longer cycle should be a week or two
Rule 3.2: Use a version control system.
- Tracks changes
- Allows them to be undone
- Supports independent parallel development
- Essential for collaboration collaboration
Rule 3.3: Version control EVERYTHING
- Not just software: papers, raw images, ...
- Not gigabytes...
- ...but metadata about those gigabytes
- Leave out things generated by the computer
- Use build tools to reproduce those instead
- Unless they take a very long time to create
Rule 4: Don't Repeat Yourself (or Others)
- Anything repeated in two or more places will eventually be wrong in at least one
- If it's faster to re-create than to discover or understand, fix it
Rule 4.1: There can be only one
- Every piece of data must have
a single authoritative representation in the system.
- Define constants exactly once
- Ditto file formats, geographical locations, ...
Rule 4.2: Modularize code rather than copying and pasting.
- Reducing code cloning reduces error rates
- Cuts the amount of testing needed
- And increases comprehension
Rule 4.3: Re-use code instead of rewriting it.
- It takes experts years to build high-quality numerical or statistical software
- Your time is better spent doing science on top of that
Rule 5: Plan for Mistakes
- No single practice catches everything
- So practice defense in depth
Note: improving quality increases productivity
Rule 5.1: Don't trust. Verify
- Add assertions to programs to check their operation.
- "This must be true here or there is an error"
- Like diagnostic circuits in hardware
- No point proceeding if the program is broken...
- ...and they serve as executable documentation
Rule 5.2: Use an off-the-shelf unit testing library.
- Manages setup, execution, and reporting
- Re-run unit tests after every change to the code to check for regression
Testing is Hard
- "If I knew what the right answer was, I'd have published by now."
- Compare to experimental data
- Or to analytic solutions of simple problems
- Or to old (trusted) programs
- If nothing else, forces scientists to document what "errors" are acceptable
Rule 5.3: Turn bugs into test cases.
- Write a test that fails when the bug is present
- Then work on the code until that test passes...
- ...and no others are failing
Test-Driven Development
- Why wait? Always write the tests, then the code
- Improves focus
- Encourages writing testable code
- And ensures tests actually get written...
- "Red, green, refactor"
Rule 5.4: Use a symbolic debugger.
- Explore the program as it runs
- Better than print statements
- You don't have to re-run...
- ...or guess in advance what you'll need to know
- Use breakpoints to stop program at particular points or when particular things are true
Rule 6: Optimize Software Only After It Works Correctly
- Even experts find it hard to predict performance bottlenecks
- Small changes to code often have dramatic impact on performance
- So get it right, then make it fast
Rule 6.1: Use a profiler to identify bottlenecks.
- Reports how much time is spent on each line of code
- Re-check on new computers or when switching libraries
- Summarize across unit tests
Rule 6.2: Write code in the highest-level language possible.
- People write the same number of lines of code per hour regardless of language
- So use the most expressive language available to get the "right" version...
- ...then rewrite core pieces (possibly in a lower-level language) to get the "fast" version
Rule 7: Document Design and Purpose not Mechanics
- Goal is to make the next person's life easier
- Focus on things the code doesn't say
- Or doesn't say clearly
- E.g., file formats
- An example is worth a thousand words...
Rule 7.1: Document interfaces and reasons not implementations.
- Interfaces and reasons change more slowly than implementation details, so documenting them is better economics
- And most people care about using code more than understanding it
Rule 7.2: Refactor code in preference to explaining how it works.
- Good code can be understood when read aloud
- Good programmers build libraries so that solving their problem is straightforward
- Again, "red, green, refactor"
Rule 7.3: Embed the documentation for a piece of software in that software.
- Specially-formatted comments or strings
- More likely to be kept up to date
- More accessible to interactive help
- Many modern tools embed code in documentation rather than vice versa
Rule 8: Collaborate
- Computers were invented to calculate
- The web was invented to collaborate
- Science is more fun when it's shared
Rule 8.1: Use pre-merge code reviews.
- Have someone else review changes before merging in version control
- Significantly reduces errors
- Good way to share knowledge
- It's what makes open source possible
Rule 8.2 Use pair programming
- Code in pairs when bringing someone new up to speed
and when tackling particularly tricky problems.
- Two people, one keyboard, one screen
- An extreme form of code review
- Can get a bit tired if done all the time...
Rule 8.3: Use an issue tracking tool.
- A shared to-do list
- Items can be assigned to people
- Supports comments, links to code and papers, etc.
- "Version control is where we've been, the issue tracker is where we're going"
Gosh, That's a Lot
One step at a time.
- Use text-based interfaces
- Turn history into scripts
- Put everything in version control
- Use test-driven development
Citation: Best Practices for Scientific Computing" ,
PLOS Biology, Jan. 2014.
Annotated Best Practices
Edit the markdown document: web/2016/day2/docs/best_practices_keep_or_toss.md
Add your choices below. Write them in the following format.
by ialbert
KEEP Rule 1.1 Keep it simple
Simplicity is the most powerful virtue that any process
can have. There is only one problem: it is kind of
difficult to keep it simple
KEEP Rule 3.1: Small steps with frequent feedback
There is great value in keep the entire pipeline working at most times.
Save often, commit often. Rerun often.
KEEP Rule 3.3: Version control EVERYTHING
While git was designed for software you should keep everything (other
than large datasets) in it. You get free backup and replication with it!
TOSS Rule 5.4 Use a symbolic debugger.
There is nothing wrong with print statements.
Symbolic debuggers promote writing complex programs.
If you can't debug a program with simple print
statements your program may be already too complicated.
TOSS Rule 8.1 Use pre-merge code reviews
This is a concept borrowed from software engineering where
it is assumed that all people on a team work on a the exact same
and relatively simple problem. This rarely happens in sciences.
This rule is one of these "feel good" rules that are just unrealistic in
scientific practice.
TOSS Rule 8.2 Use pair programming
Pair programming is again a concept borrowed from
software engineering. But it disregards the fact that most
software engineers need to solve
far simpler and far better defined problems than
scientists do. It is sort of a pipe dream that we can do this.