Bootcamp

BEST PRACTICES TOP

## Content 1. [Adapted by Istvan Albert from Software Carpentry [Best Practices in Scientific Computing][best] ](#adapted-by-istvan-albert-from-software-carpentry-best-practices-in-scientific-computingbest) 1. [Adapted by Istvan Albert from Software Carpentry [Best Practices in Scientific Computing][best] ](#adapted-by-istvan-albert-from-software-carpentry-best-practices-in-scientific-computingbest) 1. [Adapted from Software Carpentry [Best Practices in Scientific Computing][best] ](#adapted-from-software-carpentry-best-practices-in-scientific-computingbest) 1. [Adapted by Istvan Albert from Software Carpentry [Best Practices in Scientific Computing][best] ](#adapted-by-istvan-albert-from-software-carpentry-best-practices-in-scientific-computingbest) --- Adapted by Istvan Albert from Software Carpentry [Best Practices in Scientific Computing][best] [best]: http://swcarpentry.github.io/slideshows/best-practices/index.html --- ## Background * Software is lab equipment for the 21st Century * Scientists spend a lot of time writing it * But over 90% are self-taught * They don't know what "good" looks like * So we describe 24 practices in 8 groups **Good programmers are 10X more productive than average** **Good practices are 10X more productive than average** --- ## Rule 1: Write Programs for People not Computers * Hard to tell if code that's difficult to understand is doing what it's supposed to * Hard for other scientists to re-use it... * ...including your future self ### Rule 1.1: Keep it simple * Short-term memory can hold 7±2 items * So break programs into short, readable functions, each taking only a few parameters ### Rule 1.2: Make names consistent, distinctive, and meaningful. * `p` doesn't help the reader's short term memory as much as `pressure` * Don't use `temp` for both "temporary" and "temperature" * `i`, `j` are OK for indices in small scopes ### Rule 1.3: Make code style and formatting consistent. * _Which_ rules don't matter -- _having_ rules does * Brain assumes all differences are significant * Every inconsistency slows comprehension --- ## Rule 2: Let the Computer Do the Work * Computers exist to repeat things quickly * 99% accuracy ⇒ 63% of at least one error per hundred repetitions ### Rule 2.1: Make the computer repeat tasks. * Write little programs for everything * Even if they're called scripts, macros, or aliases * Easier to do this with text-based programming systems than with GUIs ### Rule 2.2: Save recent commands in a file for re-use. * Most text-based interfaces do this automatically * Repeat recent operations using `history` * "Reproducibility in the small" * Saving history supports "reproducibility in the large" * An accurate record of how a result was produced * _If_ everything can be captured ### Rule 2.3: Use a build tool to automate workflows. * Originally developed for compiling programs * Can be used whenever some files depend on others * Makes workflow explicit --- ## Rule 3: Make Incremental Changes * Most scientists don't have "requirements" * They are their own users * Code evolves in tandem with research * Closest fit from industry is _agile development_ ### Rule 3.1: Small steps with frequent feedback * People can concentrate for 45-90 minutes without a break * So size each burst of work to fit that * Longer cycle should be a week or two ### Rule 3.2: Use a version control system. * Tracks changes * Allows them to be undone * Supports independent parallel development * Essential for collaboration collaboration ### Rule 3.3: Version control EVERYTHING * Not just software: papers, raw images, ... * Not gigabytes... * ...but metadata _about_ those gigabytes * Leave out things generated by the computer * Use build tools to reproduce those instead * Unless they take a very long time to create --- ## Rule 4: Don't Repeat Yourself (or Others) * Anything repeated in two or more places will eventually be wrong in at least one * If it's faster to re-create than to discover or understand, _fix it_ ### Rule 4.1: There can be only one * Every piece of data must have a single authoritative representation in the system. * Define constants exactly once * Ditto file formats, geographical locations, ... ### Rule 4.2: Modularize code rather than copying and pasting. * Reducing code cloning reduces error rates * Cuts the amount of testing needed * And increases comprehension ### Rule 4.3: Re-use code instead of rewriting it. * It takes experts years to build high-quality numerical or statistical software * Your time is better spent doing science on top of that --- ## Rule 5: Plan for Mistakes * No single practice catches everything * So practice _defense in depth_ _Note: improving quality increases productivity_ ### Rule 5.1: Don't trust. Verify * Add assertions to programs to check their operation. * "This must be true here or there is an error" * Like diagnostic circuits in hardware * No point proceeding if the program is broken... * ...and they serve as _executable documentation_ ### Rule 5.2: Use an off-the-shelf unit testing library. * Manages setup, execution, and reporting * Re-run unit tests after every change to the code to check for _regression_ Testing is Hard * "If I knew what the right answer was, I'd have published by now." * Compare to experimental data * Or to analytic solutions of simple problems * Or to old (trusted) programs * If nothing else, forces scientists to document what "errors" are acceptable ### Rule 5.3: Turn bugs into test cases. * Write a test that fails when the bug is present * Then work on the code until that test passes... * ...and no others are failing Test-Driven Development * Why wait? Always write the tests, then the code * Improves focus * Encourages writing testable code * And ensures tests actually get written... * "Red, green, refactor" ### Rule 5.4: Use a symbolic debugger. * Explore the program as it runs * Better than print statements * You don't have to re-run... * ...or guess in advance what you'll need to know * Use _breakpoints_ to stop program at particular points or when particular things are true --- ## Rule 6: Optimize Software Only After It Works Correctly * Even experts find it hard to predict performance bottlenecks * Small changes to code often have dramatic impact on performance * So get it right, _then_ make it fast ### Rule 6.1: Use a profiler to identify bottlenecks. * Reports how much time is spent on each line of code * Re-check on new computers or when switching libraries * Summarize across unit tests ### Rule 6.2: Write code in the highest-level language possible. * People write the same number of lines of code per hour regardless of language * So use the most expressive language available to get the "right" version... * ...then rewrite core pieces (possibly in a lower-level language) to get the "fast" version --- ## Rule 7: Document Design and Purpose not Mechanics * Goal is to make the next person's life easier * Focus on things the code _doesn't_ say * Or doesn't say clearly * E.g., file formats * An example is worth a thousand words... ### Rule 7.1: Document interfaces and reasons not implementations. * Interfaces and reasons change more slowly than implementation details, so documenting them is better economics * And most people care about using code more than understanding it ### Rule 7.2: Refactor code in preference to explaining how it works. * Good code can be understood when read aloud * Good programmers build libraries so that solving their problem is straightforward * Again, "red, green, refactor" ### Rule 7.3: Embed the documentation for a piece of software in that software. * Specially-formatted comments or strings * More likely to be kept up to date * More accessible to interactive help * Many modern tools embed code in documentation rather than vice versa --- ## Rule 8: Collaborate * Computers were invented to calculate * The web was invented to collaborate * Science is more fun when it's shared ### Rule 8.1: Use pre-merge code reviews. * Have someone else review changes _before_ merging in version control * Significantly reduces errors * Good way to share knowledge * It's what makes open source possible ### Rule 8.2 Use pair programming * Code in pairs when bringing someone new up to speed and when tackling particularly tricky problems. * Two people, one keyboard, one screen * An extreme form of code review * Can get a bit tired if done all the time... ### Rule 8.3: Use an issue tracking tool. * A shared to-do list * Items can be assigned to people * Supports comments, links to code and papers, etc. * "Version control is where we've been, the issue tracker is where we're going" --- ## Gosh, That's a Lot One step at a time. 1. Use text-based interfaces 2. Turn history into scripts 3. Put everything in version control 4. Use test-driven development Citation: [Best Practices for Scientific Computing" , PLOS Biology, Jan. 2014](http://dx.doi.org/10.1371/journal.pbio.1001745).

« back to top

Adapted by Istvan Albert from Software Carpentry [Best Practices in Scientific Computing][best] [best]: http://swcarpentry.github.io/slideshows/best-practices/index.html --- ## Background * Software is lab equipment for the 21st Century * Scientists spend a lot of time writing it * But over 90% are self-taught * They don't know what "good" looks like * So we describe 24 practices in 8 groups **Good programmers are 10X more productive than average** **Good practices are 10X more productive than average** --- ## Rule 1: Write Programs for People not Computers * Hard to tell if code that's difficult to understand is doing what it's supposed to * Hard for other scientists to re-use it... * ...including your future self ### Rule 1.1: Keep it simple * Short-term memory can hold 7±2 items * So break programs into short, readable functions, each taking only a few parameters ### Rule 1.2: Make names consistent, distinctive, and meaningful. * `p` doesn't help the reader's short term memory as much as `pressure` * Don't use `temp` for both "temporary" and "temperature" * `i`, `j` are OK for indices in small scopes ### Rule 1.3: Make code style and formatting consistent. * _Which_ rules don't matter -- _having_ rules does * Brain assumes all differences are significant * Every inconsistency slows comprehension --- ## Rule 2: Let the Computer Do the Work * Computers exist to repeat things quickly * 99% accuracy ⇒ 63% of at least one error per hundred repetitions ### Rule 2.1: Make the computer repeat tasks. * Write little programs for everything * Even if they're called scripts, macros, or aliases * Easier to do this with text-based programming systems than with GUIs ### Rule 2.2: Save recent commands in a file for re-use. * Most text-based interfaces do this automatically * Repeat recent operations using `history` * "Reproducibility in the small" * Saving history supports "reproducibility in the large" * An accurate record of how a result was produced * _If_ everything can be captured ### Rule 2.3: Use a build tool to automate workflows. * Originally developed for compiling programs * Can be used whenever some files depend on others * Makes workflow explicit --- ## Rule 3: Make Incremental Changes * Most scientists don't have "requirements" * They are their own users * Code evolves in tandem with research * Closest fit from industry is _agile development_ ### Rule 3.1: Small steps with frequent feedback * People can concentrate for 45-90 minutes without a break * So size each burst of work to fit that * Longer cycle should be a week or two ### Rule 3.2: Use a version control system. * Tracks changes * Allows them to be undone * Supports independent parallel development * Essential for collaboration collaboration ### Rule 3.3: Version control EVERYTHING * Not just software: papers, raw images, ... * Not gigabytes... * ...but metadata _about_ those gigabytes * Leave out things generated by the computer * Use build tools to reproduce those instead * Unless they take a very long time to create --- ## Rule 4: Don't Repeat Yourself (or Others) * Anything repeated in two or more places will eventually be wrong in at least one * If it's faster to re-create than to discover or understand, _fix it_ ### Rule 4.1: There can be only one * Every piece of data must have a single authoritative representation in the system. * Define constants exactly once * Ditto file formats, geographical locations, ... ### Rule 4.2: Modularize code rather than copying and pasting. * Reducing code cloning reduces error rates * Cuts the amount of testing needed * And increases comprehension ### Rule 4.3: Re-use code instead of rewriting it. * It takes experts years to build high-quality numerical or statistical software * Your time is better spent doing science on top of that --- ## Rule 5: Plan for Mistakes * No single practice catches everything * So practice _defense in depth_ _Note: improving quality increases productivity_ ### Rule 5.1: Don't trust. Verify * Add assertions to programs to check their operation. * "This must be true here or there is an error" * Like diagnostic circuits in hardware * No point proceeding if the program is broken... * ...and they serve as _executable documentation_ ### Rule 5.2: Use an off-the-shelf unit testing library. * Manages setup, execution, and reporting * Re-run unit tests after every change to the code to check for _regression_ Testing is Hard * "If I knew what the right answer was, I'd have published by now." * Compare to experimental data * Or to analytic solutions of simple problems * Or to old (trusted) programs * If nothing else, forces scientists to document what "errors" are acceptable ### Rule 5.3: Turn bugs into test cases. * Write a test that fails when the bug is present * Then work on the code until that test passes... * ...and no others are failing Test-Driven Development * Why wait? Always write the tests, then the code * Improves focus * Encourages writing testable code * And ensures tests actually get written... * "Red, green, refactor" ### Rule 5.4: Use a symbolic debugger. * Explore the program as it runs * Better than print statements * You don't have to re-run... * ...or guess in advance what you'll need to know * Use _breakpoints_ to stop program at particular points or when particular things are true --- ## Rule 6: Optimize Software Only After It Works Correctly * Even experts find it hard to predict performance bottlenecks * Small changes to code often have dramatic impact on performance * So get it right, _then_ make it fast ### Rule 6.1: Use a profiler to identify bottlenecks. * Reports how much time is spent on each line of code * Re-check on new computers or when switching libraries * Summarize across unit tests ### Rule 6.2: Write code in the highest-level language possible. * People write the same number of lines of code per hour regardless of language * So use the most expressive language available to get the "right" version... * ...then rewrite core pieces (possibly in a lower-level language) to get the "fast" version --- ## Rule 7: Document Design and Purpose not Mechanics * Goal is to make the next person's life easier * Focus on things the code _doesn't_ say * Or doesn't say clearly * E.g., file formats * An example is worth a thousand words... ### Rule 7.1: Document interfaces and reasons not implementations. * Interfaces and reasons change more slowly than implementation details, so documenting them is better economics * And most people care about using code more than understanding it ### Rule 7.2: Refactor code in preference to explaining how it works. * Good code can be understood when read aloud * Good programmers build libraries so that solving their problem is straightforward * Again, "red, green, refactor" ### Rule 7.3: Embed the documentation for a piece of software in that software. * Specially-formatted comments or strings * More likely to be kept up to date * More accessible to interactive help * Many modern tools embed code in documentation rather than vice versa --- ## Rule 8: Collaborate * Computers were invented to calculate * The web was invented to collaborate * Science is more fun when it's shared ### Rule 8.1: Use pre-merge code reviews. * Have someone else review changes _before_ merging in version control * Significantly reduces errors * Good way to share knowledge * It's what makes open source possible ### Rule 8.2 Use pair programming * Code in pairs when bringing someone new up to speed and when tackling particularly tricky problems. * Two people, one keyboard, one screen * An extreme form of code review * Can get a bit tired if done all the time... ### Rule 8.3: Use an issue tracking tool. * A shared to-do list * Items can be assigned to people * Supports comments, links to code and papers, etc. * "Version control is where we've been, the issue tracker is where we're going" --- ## Gosh, That's a Lot One step at a time. 1. Use text-based interfaces 2. Turn history into scripts 3. Put everything in version control 4. Use test-driven development Citation: [Best Practices for Scientific Computing" , PLOS Biology, Jan. 2014](http://dx.doi.org/10.1371/journal.pbio.1001745).

« back to top

Adapted from Software Carpentry [Best Practices in Scientific Computing][best] [best]: http://swcarpentry.github.io/slideshows/best-practices/index.html --- ## Background * Software is lab equipment for the 21st Century * Scientists spend a lot of time writing it * But over 90% are self-taught * They don't know what "good" looks like * So we describe 24 practices in 8 groups **Good programmers are 10X more productive than average** **Good practices are 10X more productive than average** --- ## Rule 1: Write Programs for People not Computers * Hard to tell if code that's difficult to understand is doing what it's supposed to * Hard for other scientists to re-use it... * ...including your future self ### Rule 1.1: Keep it simple * Short-term memory can hold 7±2 items * So break programs into short, readable functions, each taking only a few parameters ### Rule 1.2: Make names consistent, distinctive, and meaningful. * `p` doesn't help the reader's short term memory as much as `pressure` * Don't use `temp` for both "temporary" and "temperature" * `i`, `j` are OK for indices in small scopes ### Rule 1.3: Make code style and formatting consistent. * _Which_ rules don't matter -- _having_ rules does * Brain assumes all differences are significant * Every inconsistency slows comprehension --- ## Rule 2: Let the Computer Do the Work * Computers exist to repeat things quickly * 99% accuracy ⇒ 63% of at least one error per hundred repetitions ### Rule 2.1: Make the computer repeat tasks. * Write little programs for everything * Even if they're called scripts, macros, or aliases * Easier to do this with text-based programming systems than with GUIs ### Rule 2.2: Save recent commands in a file for re-use. * Most text-based interfaces do this automatically * Repeat recent operations using `history` * "Reproducibility in the small" * Saving history supports "reproducibility in the large" * An accurate record of how a result was produced * _If_ everything can be captured ### Rule 2.3: Use a build tool to automate workflows. * Originally developed for compiling programs * Can be used whenever some files depend on others * Makes workflow explicit --- ## Rule 3: Make Incremental Changes * Most scientists don't have "requirements" * They are their own users * Code evolves in tandem with research * Closest fit from industry is _agile development_ ### Rule 3.1: Small steps with frequent feedback * People can concentrate for 45-90 minutes without a break * So size each burst of work to fit that * Longer cycle should be a week or two ### Rule 3.2: Use a version control system. * Tracks changes * Allows them to be undone * Supports independent parallel development * Essential for collaboration collaboration ### Rule 3.3: Version control EVERYTHING * Not just software: papers, raw images, ... * Not gigabytes... * ...but metadata _about_ those gigabytes * Leave out things generated by the computer * Use build tools to reproduce those instead * Unless they take a very long time to create --- ## Rule 4: Don't Repeat Yourself (or Others) * Anything repeated in two or more places will eventually be wrong in at least one * If it's faster to re-create than to discover or understand, _fix it_ ### Rule 4.1: There can be only one * Every piece of data must have a single authoritative representation in the system. * Define constants exactly once * Ditto file formats, geographical locations, ... ### Rule 4.2: Modularize code rather than copying and pasting. * Reducing code cloning reduces error rates * Cuts the amount of testing needed * And increases comprehension ### Rule 4.3: Re-use code instead of rewriting it. * It takes experts years to build high-quality numerical or statistical software * Your time is better spent doing science on top of that --- ## Rule 5: Plan for Mistakes * No single practice catches everything * So practice _defense in depth_ _Note: improving quality increases productivity_ ### Rule 5.1: Don't trust. Verify * Add assertions to programs to check their operation. * "This must be true here or there is an error" * Like diagnostic circuits in hardware * No point proceeding if the program is broken... * ...and they serve as _executable documentation_ ### Rule 5.2: Use an off-the-shelf unit testing library. * Manages setup, execution, and reporting * Re-run unit tests after every change to the code to check for _regression_ Testing is Hard * "If I knew what the right answer was, I'd have published by now." * Compare to experimental data * Or to analytic solutions of simple problems * Or to old (trusted) programs * If nothing else, forces scientists to document what "errors" are acceptable ### Rule 5.3: Turn bugs into test cases. * Write a test that fails when the bug is present * Then work on the code until that test passes... * ...and no others are failing Test-Driven Development * Why wait? Always write the tests, then the code * Improves focus * Encourages writing testable code * And ensures tests actually get written... * "Red, green, refactor" ### Rule 5.4: Use a symbolic debugger. * Explore the program as it runs * Better than print statements * You don't have to re-run... * ...or guess in advance what you'll need to know * Use _breakpoints_ to stop program at particular points or when particular things are true --- ## Rule 6: Optimize Software Only After It Works Correctly * Even experts find it hard to predict performance bottlenecks * Small changes to code often have dramatic impact on performance * So get it right, _then_ make it fast ### Rule 6.1: Use a profiler to identify bottlenecks. * Reports how much time is spent on each line of code * Re-check on new computers or when switching libraries * Summarize across unit tests ### Rule 6.2: Write code in the highest-level language possible. * People write the same number of lines of code per hour regardless of language * So use the most expressive language available to get the "right" version... * ...then rewrite core pieces (possibly in a lower-level language) to get the "fast" version --- ## Rule 7: Document Design and Purpose not Mechanics * Goal is to make the next person's life easier * Focus on things the code _doesn't_ say * Or doesn't say clearly * E.g., file formats * An example is worth a thousand words... ### Rule 7.1: Document interfaces and reasons not implementations. * Interfaces and reasons change more slowly than implementation details, so documenting them is better economics * And most people care about using code more than understanding it ### Rule 7.2: Refactor code in preference to explaining how it works. * Good code can be understood when read aloud * Good programmers build libraries so that solving their problem is straightforward * Again, "red, green, refactor" ### Rule 7.3: Embed the documentation for a piece of software in that software. * Specially-formatted comments or strings * More likely to be kept up to date * More accessible to interactive help * Many modern tools embed code in documentation rather than vice versa --- ## Rule 8: Collaborate * Computers were invented to calculate * The web was invented to collaborate * Science is more fun when it's shared ### Rule 8.1: Use pre-merge code reviews. * Have someone else review changes _before_ merging in version control * Significantly reduces errors * Good way to share knowledge * It's what makes open source possible ### Rule 8.2 Use pair programming * Code in pairs when bringing someone new up to speed and when tackling particularly tricky problems. * Two people, one keyboard, one screen * An extreme form of code review * Can get a bit tired if done all the time... ### Rule 8.3: Use an issue tracking tool. * A shared to-do list * Items can be assigned to people * Supports comments, links to code and papers, etc. * "Version control is where we've been, the issue tracker is where we're going" --- ## Gosh, That's a Lot One step at a time. 1. Use text-based interfaces 2. Turn history into scripts 3. Put everything in version control 4. Use test-driven development Citation: [Best Practices for Scientific Computing" , PLOS Biology, Jan. 2014](http://dx.doi.org/10.1371/journal.pbio.1001745).

« back to top

Software Carpentry - How good are the Best Practices?

BEST PRACTICES TOP