1. Minimizing mistakes. There is no upside in analyzing the wrong data set, using the wrong parameters, including the wrong figure, or reporting the wrong statistics. These mistakes are in my view unacceptable in science. Minimizing them is the highest priority
2. Knowing what we did. Some time in the future, way in the future, we or someone else will visit what we did. Can we figure out what happened? I'd like to plan on the time scale of decades rather than months or years.
3. Planning for Human fallible. Some people think science is for those who are meticulous. Then count me out. I am messy, careless, and chronically clueless. A good organization anticipates human mistakes.
4. Easy to learn. I collaborate with a lot of people. The organization structure should be fairly intuitive self explanatory.
What we do:
1. Data acquisition and curation. I think we have this wired. We use a born-open data model where data are collected, logged, versioned, and uploaded nightly to GitHub automatically. We also automatically populate local mysql tables including information on subjects and sessions, and have additional tables for experiments, experimenters, computers, and IRB info. We even have an adverse-events table to record and address any flaws in the organizational system. The basic unit of organization is the dataset, and it works well.
2. Outputs. We have the usual outputs: papers, talks, grant proposals, dissertations, etc. Some are collaborative; some are individual; some are important; some go nowhere. The basic unit here is pretty obvious---we know exactly where each paper, talk, dissertation, etc., begins and ends.
3. Value-added endeavors. A value-added endeavor (VAE) is a small unit of intellectual contribution. It could be a proof, a simulation, a specific analysis, or (on occasion) a verbal argument. VAEs, as important as they are, are ill-defined in size and scope. And it is sometimes unclear (perhaps arbitrary) where one ends and another begins.
The Current System, The Good:Perhaps the strongest elements of my lab's organization is that we use really good tools for open and high-integrity science. Pretty much everything is script based, and scripts are in many ways self-documenting, especially when compared to menu-driven alternatives. Our analyses are done in R, our papers in Latex and Markdown, and the two are integrated with RMarkdown and Knitr. Moreover, we use a local git server and curate all development in repositories.
The Current System, The Bad and Ugly:
We use projects as our basic organization unit. Projects are basically repositories on our local git server. They contain ad-hoc organizations of files. But what a project encompasses and how it is organized is ad-hoc, disordered, unstandardized, and idiosyncratic. Here are the issues:
1. There is no natural relation between the three things we do, acquire and curate data, produce outputs, and produce VAEs and projects. One VAE might serve several different papers; likewise, one dataset might serve several different papers. Papers and talks encompass several different experiments (usually) and VAEs.
2. Projects have no systematic relations to VAEs, outputs or datasets. This is why I am unhappy. Does a project mean one paper? Does it mean one analysis? One development? A collection of related papers? A paper and all talks and the supporting dissertation? We have done all of the these.
What do you do? Are there good standards? What should be the basic organization unit? Stay with project? I am thinking about a strict output model where every output is a repository as the main organizing unit. The problem is what-to-do about VAEs that span several outputs. Say I have an analysis or graph that is common for a paper, a dissertation, and a talk. I don't think I want this VAE repeated in three places. I don't want symbolic links or hard codings because it makes it difficult to publicly archive. That is why projects were so handy. VAEs themselves are too small and too ill-defined to be organizing units. Ideas?