recentpopularlog in

nhaliday : data-science   260

« earlier  
Is the bounty system effective? - Meta Stack Exchange
https://math.meta.stackexchange.com/questions/20155/how-effective-are-bounties
could do some kinda econometric analysis using the data explorer to determine this once and for all: https://pinboard.in/u:nhaliday/b:c0cd449b9e69
maybe some kinda RDD in time, or difference-in-differences?
I don't think answer quality/quantity by time meets the common trend assumption for DD, tho... Questions that eventually receive bounty are prob higher quality in the first place, and higher quality answers accumulate more and better answers regardless. Hmm.
q-n-a  stackex  forum  community  info-foraging  efficiency  cost-benefit  data  analysis  incentives  attention  quality  ubiquity  supply-demand  multi  math  causation  endogenous-exogenous  intervention  branches  control  tactics  sleuthin  hmm  idk  todo  data-science  overflow  dbs  regression  shift  methodology  econometrics 
november 2019 by nhaliday
Ask HN: Getting into NLP in 2018? | Hacker News
syllogism (spaCy author):
I think it's probably a bad strategy to try to be the "NLP guy" to potential employers. You'd do much better off being a software engineer on a project with people with ML or NLP expertise.

NLP projects fail a lot. If you line up a job as a company's first NLP person, you'll probably be setting yourself up for failure. You'll get handed an idea that can't work, you won't know enough about how to push back to change it into something that might, etc. After the project fails, you might get a chance to fail at a second one, but maybe not a third. This isn't a great way to move into any new field.

I think a cunning plan would be to angle to be the person who "productionises" models.
...
.--
...

Basically, don't just work on having more powerful solutions. Make sure you've tried hard to have easier problems as well --- that part tends to be higher leverage.

https://news.ycombinator.com/item?id=14008752
https://news.ycombinator.com/item?id=12916498
https://algorithmia.com/blog/introduction-natural-language-processing-nlp
hn  q-n-a  discussion  tech  programming  machine-learning  nlp  strategy  career  planning  human-capital  init  advice  books  recommendations  course  unit  links  automation  project  examples  applications  multi  mooc  lectures  video  data-science  org:com  roadmap  summary  error  applicability-prereqs  ends-means  telos-atelos  cost-benefit 
november 2019 by nhaliday
Ask HN: What's a promising area to work on? | Hacker News
hn  discussion  q-n-a  ideas  impact  trends  the-bones  speedometer  technology  applications  tech  cs  programming  list  top-n  recommendations  lens  machine-learning  deep-learning  security  privacy  crypto  software  hardware  cloud  biotech  CRISPR  bioinformatics  biohacking  blockchain  cryptocurrency  crypto-anarchy  healthcare  graphics  SIGGRAPH  vr  automation  universalism-particularism  expert-experience  reddit  social  arbitrage  supply-demand  ubiquity  cost-benefit  compensation  chart  career  planning  strategy  long-term  advice  sub-super  commentary  rhetoric  org:com  techtariat  human-capital  prioritizing  tech-infrastructure  working-stiff  data-science 
november 2019 by nhaliday
How good are decisions?
A statement I commonly hear in tech-utopian circles is that some seeming inefficiency can’t actually be inefficient because the market is efficient and inefficiencies will quickly be eliminated. A contentious example of this is the claim that companies can’t be discriminating because the market is too competitive to tolerate discrimination. A less contentious example is that when you see a big company doing something that seems bizarrely inefficient, maybe it’s not inefficient and you just lack the information necessary to understand why the decision was efficient.

Unfortunately, arguments like this are difficult to settle because, even in retrospect, it’s usually not possible to get enough information to determine the precise “value” of a decision. Even in cases where the decision led to an unambiguous success or failure, there are so many factors that led to the result that it’s difficult to figure out precisely why something happened.

One nice thing about sports is that they often have detailed play-by-play data and well-defined win criteria which lets us tell, on average, what the expected value of a decision is. In this post, we’ll look at the cost of bad decision making in one sport and then briefly discuss why decision quality in sports might be the same or better as decision quality in other fields.

Just to have a concrete example, we’re going to look at baseball, but you could do the same kind of analysis for football, hockey, basketball, etc., and my understanding is that you’d get a roughly similar result in all of those cases.

We’re going to model baseball as a state machine, both because that makes it easy to understand the expected value of particular decisions and because this lets us talk about the value of decisions without having to go over most of the rules of baseball.

exactly the kinda thing Dad likes
techtariat  dan-luu  data  analysis  examples  nitty-gritty  sports  street-fighting  automata-languages  models  optimization  arbitrage  data-science  cost-benefit  tactics  baseball  low-hanging 
august 2019 by nhaliday
Three best practices for building successful data pipelines - O'Reilly Media
Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:
1. Reproducible
2. Consistent
3. Productionizable

...

Science that cannot be reproduced by an external third party is just not science — and this does apply to data science. One of the benefits of working in data science is the ability to apply the existing tools from software engineering. These tools let you isolate all the dependencies of your analyses and make them reproducible.

Dependencies fall into three categories:
1. Analysis code ...
2. Data sources ...
3. Algorithmic randomness ...

...

Establishing consistency in data
...

There are generally two ways of establishing the consistency of data sources. The first is by checking-in all code and data into a single revision control repository. The second method is to reserve source control for code and build a pipeline that explicitly depends on external data being in a stable, consistent format and location.

Checking data into version control is generally considered verboten for production software engineers, but it has a place in data analysis. For one thing, it makes your analysis very portable by isolating all dependencies into source control. Here are some conditions under which it makes sense to have both code and data in source control:
Small data sets ...
Regular analytics ...
Fixed source ...

Productionizability: Developing a common ETL
...

1. Common data format ...
2. Isolating library dependencies ...

https://blog.koresoftware.com/blog/etl-principles
Rigorously enforce the idempotency constraint
For efficiency, seek to load data incrementally
Always ensure that you can efficiently process historic data
Partition ingested data at the destination
Rest data between tasks
Pool resources for efficiency
Store all metadata together in one place
Manage login details in one place
Specify configuration details once
Parameterize sub flows and dynamically run tasks where possible
Execute conditionally
Develop your own workflow framework and reuse workflow components

more focused on details of specific technologies:
https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7

https://www.cloudera.com/documentation/director/cloud/topics/cloud_de_best_practices.html
techtariat  org:com  best-practices  engineering  code-organizing  machine-learning  data-science  yak-shaving  nitty-gritty  workflow  config  vcs  replication  homo-hetero  multi  org:med  design  system-design  links  shipping  minimalism  volo-avolo  causation  random  invariance  structure  arrows  protocol-metadata  interface-compatibility  project-management 
august 2019 by nhaliday
Interview with Donald Knuth | Interview with Donald Knuth | InformIT
Andrew Binstock and Donald Knuth converse on the success of open source, the problem with multicore architecture, the disappointing lack of interest in literate programming, the menace of reusable code, and that urban legend about winning a programming contest with a single compilation.

Reusable vs. re-editable code: https://hal.archives-ouvertes.fr/hal-01966146/document
- Konrad Hinsen

https://www.johndcook.com/blog/2008/05/03/reusable-code-vs-re-editable-code/
I think whether code should be editable or in “an untouchable black box” depends on the number of developers involved, as well as their talent and motivation. Knuth is a highly motivated genius working in isolation. Most software is developed by large teams of programmers with varying degrees of motivation and talent. I think the further you move away from Knuth along these three axes the more important black boxes become.
nibble  interview  giants  expert-experience  programming  cs  software  contrarianism  carmack  oss  prediction  trends  linux  concurrency  desktop  comparison  checking  debugging  stories  engineering  hmm  idk  algorithms  books  debate  flux-stasis  duplication  parsimony  best-practices  writing  documentation  latex  intricacy  structure  hardware  caching  workflow  editors  composition-decomposition  coupling-cohesion  exposition  technical-writing  thinking  cracker-prog  code-organizing  grokkability  multi  techtariat  commentary  pdf  reflection  essay  examples  python  data-science  libraries  grokkability-clarity  project-management 
june 2019 by nhaliday
Should I go for TensorFlow or PyTorch?
Honestly, most experts that I know love Pytorch and detest TensorFlow. Karpathy and Justin from Stanford for example. You can see Karpthy's thoughts and I've asked Justin personally and the answer was sharp: PYTORCH!!! TF has lots of PR but its API and graph model are horrible and will waste lots of your research time.

--

...

Updated Mar 12
Update after 2019 TF summit:

TL/DR: previously I was in the pytorch camp but with TF 2.0 it’s clear that Google is really going to try to have parity or try to be better than Pytorch in all aspects where people voiced concerns (ease of use/debugging/dynamic graphs). They seem to be allocating more resources on development than Facebook so the longer term currently looks promising for Google. Prior to TF 2.0 I thought that Pytorch team had more momentum. One area where FB/Pytorch is still stronger is Google is a bit more closed and doesn’t seem to release reproducible cutting edge models such as AlphaGo whereas FAIR released OpenGo for instance. Generally you will end up running into models that are only implemented in one framework of the other so chances are you might end up learning both.
q-n-a  qra  comparison  software  recommendations  cost-benefit  tradeoffs  python  libraries  machine-learning  deep-learning  data-science  sci-comp  tools  google  facebook  tech  competition  best-practices  trends  debugging  expert-experience  ecosystem  theory-practice  pragmatic  wire-guided  static-dynamic  state  academia  frameworks  open-closed 
may 2019 by nhaliday
python - Does pandas iterrows have performance issues? - Stack Overflow
Generally, iterrows should only be used in very very specific cases. This is the general order of precedence for performance of various operations:

1) vectorization
2) using a custom cython routine
3) apply
a) reductions that can be performed in cython
b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)
q-n-a  stackex  programming  python  libraries  gotchas  data-science  sci-comp  performance  checklists  objektbuch  best-practices  DSL  frameworks 
may 2019 by nhaliday
Burrito: Rethinking the Electronic Lab Notebook
Seems very well-suited for ML experiments (if you can get it to work), also the nilfs aspect is cool and basically implements exactly one of the my project ideas (mini-VCS for competitive programming). Unfortunately gnarly installation instructions specify running it on Linux VM: https://github.com/pgbovine/burrito/blob/master/INSTALL. Linux is hard requirement due to nilfs.
techtariat  project  tools  devtools  linux  programming  yak-shaving  integration-extension  nitty-gritty  workflow  exocortex  scholar  software  python  app  desktop  notetaking  state  machine-learning  data-science  nibble  sci-comp  oly  vcs  multi  repo  paste  homepage  research 
may 2019 by nhaliday
A Recipe for Training Neural Networks
acmtariat  org:bleg  nibble  machine-learning  deep-learning  howto  tutorial  guide  nitty-gritty  gotchas  init  list  checklists  expert-experience  abstraction  composition-decomposition  gradient-descent  data-science  error  debugging  benchmarks  programming  engineering  best-practices  dataviz  checking  plots  generalization  regularization  unsupervised  optimization  ensembles  random  methodology  multi  twitter  social  discussion  techtariat  links  org:med  pdf  visualization  python  recommendations  advice  devtools 
april 2019 by nhaliday
Workshop Abstract | Identifying and Understanding Deep Learning Phenomena
ICML 2019 workshop, June 15th 2019, Long Beach, CA

We solicit contributions that view the behavior of deep nets as natural phenomena, to be investigated with methods inspired from the natural sciences like physics, astronomy, and biology.
unit  workshop  acm  machine-learning  science  empirical  nitty-gritty  atoms  deep-learning  model-class  icml  data-science  rigor  replication  examples  ben-recht  physics 
april 2019 by nhaliday
Stack Overflow Developer Survey 2018
Rust, Python, Go in top most loved
F#/OCaml most high paying globally, Erlang/Scala/OCaml in the US (F# still in top 10)
ML specialists high-paid
editor usage: VSCode > VS > Sublime > Vim > Intellij >> Emacs
ranking  list  top-n  time-series  data  database  programming  engineering  pls  trends  stackex  poll  career  exploratory  network-structure  ubiquity  ocaml-sml  rust  golang  python  dotnet  money  jobs  compensation  erlang  scala  jvm  ai  ai-control  risk  futurism  ethical-algorithms  data-science  machine-learning  editors  devtools  tools  pro-rata  org:com  software  analysis  article  human-capital  let-me-see  expert-experience  complement-substitute 
december 2018 by nhaliday
The Gelman View – spottedtoad
I have read Andrew Gelman’s blog for about five years, and gradually, I’ve decided that among his many blog posts and hundreds of academic articles, he is advancing a philosophy not just of statistics but of quantitative social science in general. Not a statistician myself, here is how I would articulate the Gelman View:

A. Purposes

1. The purpose of social statistics is to describe and understand variation in the world. The world is a complicated place, and we shouldn’t expect things to be simple.
2. The purpose of scientific publication is to allow for communication, dialogue, and critique, not to “certify” a specific finding as absolute truth.
3. The incentive structure of science needs to reward attempts to independently investigate, reproduce, and refute existing claims and observed patterns, not just to advance new hypotheses or support a particular research agenda.

B. Approach

1. Because the world is complicated, the most valuable statistical models for the world will generally be complicated. The result of statistical investigations will only rarely be to  give a stamp of truth on a specific effect or causal claim, but will generally show variation in effects and outcomes.
2. Whenever possible, the data, analytic approach, and methods should be made as transparent and replicable as possible, and should be fair game for anyone to examine, critique, or amend.
3. Social scientists should look to build upon a broad shared body of knowledge, not to “own” a particular intervention, theoretic framework, or technique. Such ownership creates incentive problems when the intervention, framework, or technique fail and the scientist is left trying to support a flawed structure.

Components

1. Measurement. How and what we measure is the first question, well before we decide on what the effects are or what is making that measurement change.
2. Sampling. Who we talk to or collect information from always matters, because we should always expect effects to depend on context.
3. Inference. While models should usually be complex, our inferential framework should be simple enough for anyone to follow along. And no p values.

He might disagree with all of this, or how it reflects his understanding of his own work. But I think it is a valuable guide to empirical work.
ratty  unaffiliated  summary  gelman  scitariat  philosophy  lens  stats  hypothesis-testing  science  meta:science  social-science  institutions  truth  is-ought  best-practices  data-science  info-dynamics  alt-inst  academia  empirical  evidence-based  checklists  strategy  epistemic 
november 2017 by nhaliday
self study - Looking for a good and complete probability and statistics book - Cross Validated
I never had the opportunity to visit a stats course from a math faculty. I am looking for a probability theory and statistics book that is complete and self-sufficient. By complete I mean that it contains all the proofs and not just states results.
nibble  q-n-a  overflow  data-science  stats  methodology  books  recommendations  list  top-n  confluence  proofs  rigor  reference  accretion 
october 2017 by nhaliday
The Downside of Baseball’s Data Revolution—Long Games, Less Action - WSJ
After years of ‘Moneyball’-style quantitative analysis, major-league teams are setting records for inactivity
news  org:rec  trends  sports  data-science  unintended-consequences  quantitative-qualitative  modernity  time  baseball  measure 
october 2017 by nhaliday
Atrocity statistics from the Roman Era
Christian Martyrs [make link]
Gibbon, Decline & Fall v.2 ch.XVI: < 2,000 k. under Roman persecution.
Ludwig Hertling ("Die Zahl de Märtyrer bis 313", 1944) estimated 100,000 Christians killed between 30 and 313 CE. (cited -- unfavorably -- by David Henige, Numbers From Nowhere, 1998)
Catholic Encyclopedia, "Martyr": number of Christian martyrs under the Romans unknown, unknowable. Origen says not many. Eusebius says thousands.

...

General population decline during The Fall of Rome: 7,000,000 [make link]
- Colin McEvedy, The New Penguin Atlas of Medieval History (1992)
- From 2nd Century CE to 4th Century CE: Empire's population declined from 45M to 36M [i.e. 9M]
- From 400 CE to 600 CE: Empire's population declined by 20% [i.e. 7.2M]
- Paul Bairoch, Cities and economic development: from the dawn of history to the present, p.111
- "The population of Europe except Russia, then, having apparently reached a high point of some 40-55 million people by the start of the third century [ca.200 C.E.], seems to have fallen by the year 500 to about 30-40 million, bottoming out at about 20-35 million around 600." [i.e. ca.20M]
- Francois Crouzet, A History of the European Economy, 1000-2000 (University Press of Virginia: 2001) p.1.
- "The population of Europe (west of the Urals) in c. AD 200 has been estimated at 36 million; by 600, it had fallen to 26 million; another estimate (excluding ‘Russia’) gives a more drastic fall, from 44 to 22 million." [i.e. 10M or 22M]

also:
The geometric mean of these two extremes would come to 4½ per day, which is a credible daily rate for the really bad years.

why geometric mean? can you get it as the MLE given min{X1, ..., Xn} and max{X1, ..., Xn} for {X_i} iid Poissons? some kinda limit? think it might just be a rule of thumb.

yeah, it's a rule of thumb. found it it his book (epub).
org:junk  data  let-me-see  scale  history  iron-age  mediterranean  the-classics  death  nihil  conquest-empire  war  peace-violence  gibbon  trivia  multi  todo  AMT  expectancy  heuristic  stats  ML-MAP-E  data-science  estimate  magnitude  population  demographics  database  list  religion  christianity  leviathan 
september 2017 by nhaliday
All models are wrong - Wikipedia
Box repeated the aphorism in a paper that was published in the proceedings of a 1978 statistics workshop.[2] The paper contains a section entitled "All models are wrong but some are useful". The section is copied below.

Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an "ideal" gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules.

For such a model there is no need to ask the question "Is the model true?". If "truth" is to be the "whole truth" the answer must be "No". The only question of interest is "Is the model illuminating and useful?".
thinking  metabuch  metameta  map-territory  models  accuracy  wire-guided  truth  philosophy  stats  data-science  methodology  lens  wiki  reference  complex-systems  occam  parsimony  science  nibble  hi-order-bits  info-dynamics  the-trenches  meta:science  physics  fluid  thermo  stat-mech  applicability-prereqs  theory-practice  elegance  simplification-normalization 
august 2017 by nhaliday
trees are harlequins, words are harlequins — bayes: a kinda-sorta masterpost
lol, gwern: https://www.reddit.com/r/slatestarcodex/comments/6ghsxf/biweekly_rational_feed/diqr0rq/
> What sort of person thinks “oh yeah, my beliefs about these coefficients correspond to a Gaussian with variance 2.5″? And what if I do cross-validation, like I always do, and find that variance 200 works better for the problem? Was the other person wrong? But how could they have known?
> ...Even ignoring the mode vs. mean issue, I have never met anyone who could tell whether their beliefs were normally distributed vs. Laplace distributed. Have you?
I must have spent too much time in Bayesland because both those strike me as very easy and I often think them! My beliefs usually are Laplace distributed when it comes to things like genetics (it makes me very sad to see GWASes with flat priors), and my Gaussian coefficients are actually a variance of 0.70 (assuming standardized variables w.l.o.g.) as is consistent with field-wide meta-analyses indicating that d>1 is pretty rare.
ratty  ssc  core-rats  tumblr  social  explanation  init  philosophy  bayesian  thinking  probability  stats  frequentist  big-yud  lesswrong  synchrony  similarity  critique  intricacy  shalizi  scitariat  selection  mutation  evolution  priors-posteriors  regularization  bias-variance  gwern  reddit  commentary  GWAS  genetics  regression  spock  nitty-gritty  generalization  epistemic  🤖  rationality  poast  multi  best-practices  methodology  data-science 
august 2017 by nhaliday
« earlier      
per page:    204080120160

Copy this bookmark:





to read