recentpopularlog in


« earlier   
rOpenSci | Building Reproducible Data Packages with DataPackageR
> Data from a single study will lead to multiple manuscripts by different principal investigators, dozens of reports, talks, presentations. There are many different consumers of data sets, and results and conclusions must be consistent and reproducible.

> Data processing pipelines tend to be study specific. Even though we have great tidyverse tools that make data cleaning easier, every new data set has idiosyncracies and unique features that require some bespoke code to convert them from raw to tidy data.

> Best practices tell us raw data are read-only, and analysis is done on some form of processed and tidied data set. Conventional wisdom is that this data tidying takes a substantial amount of the time and effort in a data analysis project, so once it’s done, why should every consumer of a data set repeat it?

> One important reason is we don’t always know what a collaborator did to go from raw data to their version of an analysis-ready data set, so the instinct is to ask for the raw data and do the processing ourselves, which involves shipping around inconveniently large files.
r  data_management 
19 hours ago by sharon_howard

Copy this bookmark:

to read