Bootcamp

Latest Index Day 1 Day 2 Day 3 Day 4 Day 5 Info

Home Info

BEST PRACTICES

Adapted from Software Carpentry Best Practices in Scientific Computing

Background

Software is lab equipment for the 21st Century
Scientists spend a lot of time writing it
But over 90% are self-taught
They don't know what "good" looks like
So we describe 24 practices in 8 groups

Good programmers are 10X more productive than average

Good practices are 10X more productive than average

Rule 1: Write Programs for People not Computers

Hard to tell if code that's difficult to understand is doing what it's supposed to
Hard for other scientists to re-use it...
...including your future self

Rule 1.1: Keep it simple

Short-term memory can hold 7±2 items
So break programs into short, readable functions, each taking only a few parameters

Rule 1.2: Make names consistent, distinctive, and meaningful.

p doesn't help the reader's short term memory as much as pressure
Don't use temp for both "temporary" and "temperature"
i, j are OK for indices in small scopes

Rule 1.3: Make code style and formatting consistent.

Which rules don't matter -- having rules does
Brain assumes all differences are significant
Every inconsistency slows comprehension

Rule 2: Let the Computer Do the Work

Computers exist to repeat things quickly
99% accuracy ⇒ 63% of at least one error per hundred repetitions

Rule 2.1: Make the computer repeat tasks.

Write little programs for everything
Even if they're called scripts, macros, or aliases
Easier to do this with text-based programming systems than with GUIs

Rule 2.2: Save recent commands in a file for re-use.

Most text-based interfaces do this automatically
- Repeat recent operations using history
- "Reproducibility in the small"
Saving history supports "reproducibility in the large"
- An accurate record of how a result was produced
- If everything can be captured

Rule 2.3: Use a build tool to automate workflows.

Originally developed for compiling programs
Can be used whenever some files depend on others
Makes workflow explicit

Rule 3: Make Incremental Changes

Most scientists don't have "requirements"
- They are their own users
- Code evolves in tandem with research
Closest fit from industry is agile development

Rule 3.1: Small steps with frequent feedback

People can concentrate for 45-90 minutes without a break
So size each burst of work to fit that
Longer cycle should be a week or two

Rule 3.2: Use a version control system.

Tracks changes
Allows them to be undone
Supports independent parallel development
Essential for collaboration collaboration

Rule 3.3: Version control EVERYTHING

Not just software: papers, raw images, ...
- Not gigabytes...
- ...but metadata about those gigabytes
Leave out things generated by the computer
- Use build tools to reproduce those instead
- Unless they take a very long time to create

Rule 4: Don't Repeat Yourself (or Others)

Anything repeated in two or more places will eventually be wrong in at least one
If it's faster to re-create than to discover or understand, fix it

Rule 4.1: There can be only one

Every piece of data must have a single authoritative representation in the system.
Define constants exactly once
Ditto file formats, geographical locations, ...

Rule 4.2: Modularize code rather than copying and pasting.

Reducing code cloning reduces error rates
Cuts the amount of testing needed
And increases comprehension

Rule 4.3: Re-use code instead of rewriting it.

It takes experts years to build high-quality numerical or statistical software
Your time is better spent doing science on top of that

Rule 5: Plan for Mistakes

No single practice catches everything
So practice defense in depth

Note: improving quality increases productivity

Rule 5.1: Don't trust. Verify

Add assertions to programs to check their operation.
"This must be true here or there is an error"
Like diagnostic circuits in hardware
No point proceeding if the program is broken...
...and they serve as executable documentation

Rule 5.2: Use an off-the-shelf unit testing library.

Manages setup, execution, and reporting
Re-run unit tests after every change to the code to check for regression

Testing is Hard

"If I knew what the right answer was, I'd have published by now."
Compare to experimental data
Or to analytic solutions of simple problems
Or to old (trusted) programs
If nothing else, forces scientists to document what "errors" are acceptable

Rule 5.3: Turn bugs into test cases.

Write a test that fails when the bug is present
Then work on the code until that test passes...
...and no others are failing

Test-Driven Development

Why wait? Always write the tests, then the code
Improves focus
Encourages writing testable code
And ensures tests actually get written...
"Red, green, refactor"

Rule 5.4: Use a symbolic debugger.

Explore the program as it runs
Better than print statements
- You don't have to re-run...
- ...or guess in advance what you'll need to know
Use breakpoints to stop program at particular points or when particular things are true

Rule 6: Optimize Software Only After It Works Correctly

Even experts find it hard to predict performance bottlenecks
Small changes to code often have dramatic impact on performance
So get it right, then make it fast

Rule 6.1: Use a profiler to identify bottlenecks.

Reports how much time is spent on each line of code
Re-check on new computers or when switching libraries
Summarize across unit tests

Rule 6.2: Write code in the highest-level language possible.

People write the same number of lines of code per hour regardless of language
So use the most expressive language available to get the "right" version...
...then rewrite core pieces (possibly in a lower-level language) to get the "fast" version

Rule 7: Document Design and Purpose not Mechanics

Goal is to make the next person's life easier
Focus on things the code doesn't say
- Or doesn't say clearly
- E.g., file formats
An example is worth a thousand words...

Rule 7.1: Document interfaces and reasons not implementations.

Interfaces and reasons change more slowly than implementation details, so documenting them is better economics
And most people care about using code more than understanding it

Rule 7.2: Refactor code in preference to explaining how it works.

Good code can be understood when read aloud
Good programmers build libraries so that solving their problem is straightforward
Again, "red, green, refactor"

Rule 7.3: Embed the documentation for a piece of software in that software.

Specially-formatted comments or strings
More likely to be kept up to date
More accessible to interactive help
Many modern tools embed code in documentation rather than vice versa

Rule 8: Collaborate

Computers were invented to calculate
The web was invented to collaborate
Science is more fun when it's shared

Rule 8.1: Use pre-merge code reviews.

Have someone else review changes before merging in version control
Significantly reduces errors
Good way to share knowledge
It's what makes open source possible

Rule 8.2 Use pair programming

Code in pairs when bringing someone new up to speed and when tackling particularly tricky problems.
Two people, one keyboard, one screen
An extreme form of code review
Can get a bit tired if done all the time...

Rule 8.3: Use an issue tracking tool.

A shared to-do list
- Items can be assigned to people
- Supports comments, links to code and papers, etc.
"Version control is where we've been, the issue tracker is where we're going"

Gosh, That's a Lot

One step at a time.

Use text-based interfaces
Turn history into scripts
Put everything in version control
Use test-driven development

Citation: Best Practices for Scientific Computing" , PLOS Biology, Jan. 2014.

Annotated Best Practices

Edit the markdown document: web/2016/day2/docs/best_practices_keep_or_toss.md

Add your choices below. Write them in the following format.

by ialbert

KEEP Rule 1.1 Keep it simple

Simplicity is the most powerful virtue that any process can have. There is only one problem: it is kind of difficult to keep it simple

KEEP Rule 3.1: Small steps with frequent feedback

There is great value in keep the entire pipeline working at most times. Save often, commit often. Rerun often.

KEEP Rule 3.3: Version control EVERYTHING

While git was designed for software you should keep everything (other than large datasets) in it. You get free backup and replication with it!

TOSS Rule 5.4 Use a symbolic debugger.

There is nothing wrong with print statements. Symbolic debuggers promote writing complex programs. If you can't debug a program with simple print statements your program may be already too complicated.

TOSS Rule 8.1 Use pre-merge code reviews

This is a concept borrowed from software engineering where it is assumed that all people on a team work on a the exact same and relatively simple problem. This rarely happens in sciences. This rule is one of these "feel good" rules that are just unrealistic in scientific practice.

TOSS Rule 8.2 Use pair programming

Pair programming is again a concept borrowed from software engineering. But it disregards the fact that most software engineers need to solve far simpler and far better defined problems than scientists do. It is sort of a pipe dream that we can do this.

Penn State • 2016 • bootcamp-central via pyblue

Software Carpentry - How good are the Best Practices?