recentpopularlog in

nhaliday : dataset   46

Why is Google Translate so bad for Latin? A longish answer. : latin
hmm:
> All it does its correlate sequences of up to five consecutive words in texts that have been manually translated into two or more languages.
That sort of system ought to be perfect for a dead language, though. Dump all the Cicero, Livy, Lucretius, Vergil, and Oxford Latin Course into a database and we're good.

We're not exactly inundated with brand new Latin to translate.
--
> Dump all the Cicero, Livy, Lucretius, Vergil, and Oxford Latin Course into a database and we're good.
What makes you think that the Google folks haven't done so and used that to create the language models they use?
> That sort of system ought to be perfect for a dead language, though.
Perhaps. But it will be bad at translating novel English sentences to Latin.
foreign-lang  reddit  social  discussion  language  the-classics  literature  dataset  measurement  roots  traces  syntax  anglo  nlp  stackex  links  q-n-a  linguistics  lexical  deep-learning  sequential  hmm  project  arrows  generalization  state-of-art  apollonian-dionysian  machine-learning  google 
june 2019 by nhaliday
classification - ImageNet: what is top-1 and top-5 error rate? - Cross Validated
Now, in the case of top-1 score, you check if the top class (the one having the highest probability) is the same as the target label.

In the case of top-5 score, you check if the target label is one of your top 5 predictions (the 5 ones with the highest probabilities).
nibble  q-n-a  overflow  machine-learning  deep-learning  metrics  comparison  ranking  top-n  classification  computer-vision  benchmarks  dataset  accuracy  error  jargon 
june 2019 by nhaliday
[1803.00085] Chinese Text in the Wild
We introduce Chinese Text in the Wild, a very large dataset of Chinese text in street view images.

...

We give baseline results using several state-of-the-art networks, including AlexNet, OverFeat, Google Inception and ResNet for character recognition, and YOLOv2 for character detection in images. Overall Google Inception has the best performance on recognition with 80.5% top-1 accuracy, while YOLOv2 achieves an mAP of 71.0% on detection. Dataset, source code and trained models will all be publicly available on the website.
nibble  pdf  papers  preprint  machine-learning  deep-learning  deepgoog  state-of-art  china  asia  writing  language  dataset  error  accuracy  computer-vision  pic  ocr  org:mat  benchmarks  questions 
may 2019 by nhaliday
Land, history or modernization? Explaining ethnic fractionalization: Ethnic and Racial Studies: Vol 38, No 2
Ethnic fractionalization (EF) is frequently used as an explanatory tool in models of economic development, civil war and public goods provision. However, if EF is endogenous to political and economic change, its utility for further research diminishes. This turns out not to be the case. This paper provides the first comprehensive model of EF as a dependent variable.
study  polisci  sociology  political-econ  economics  broad-econ  diversity  putnam-like  race  concept  conceptual-vocab  definition  realness  eric-kaufmann  roots  database  dataset  robust  endogenous-exogenous  causation  anthropology  cultural-dynamics  tribalism  methodology  world  developing-world  🎩  things  metrics  intricacy  microfoundations 
december 2017 by nhaliday
Global Evidence on Economic Preferences
- Benjamin Enke et al

This paper studies the global variation in economic preferences. For this purpose, we present the Global Preference Survey (GPS), an experimentally validated survey dataset of time preference, risk preference, positive and negative reciprocity, altruism, and trust from 80,000 individuals in 76 countries. The data reveal substantial heterogeneity in preferences across countries, but even larger within-country heterogeneity. Across individuals, preferences vary with age, gender, and cognitive ability, yet these relationships appear partly country specific. At the country level, the data reveal correlations between preferences and bio-geographic and cultural variables such as agricultural suitability, language structure, and religion. Variation in preferences is also correlated with economic outcomes and behaviors. Within countries and subnational regions, preferences are linked to individual savings decisions, labor market choices, and prosocial behaviors. Across countries, preferences vary with aggregate outcomes ranging from per capita income, to entrepreneurial activities, to the frequency of armed conflicts.

...

This paper explores these questions by making use of the core features of the GPS: (i) coverage of 76 countries that represent approximately 90 percent of the world population; (ii) representative population samples within each country for a total of 80,000 respondents, (iii) measures designed to capture time preference, risk preference, altruism, positive reciprocity, negative reciprocity, and trust, based on an ex ante experimental validation procedure (Falk et al., 2016) as well as pre-tests in culturally heterogeneous countries, (iv) standardized elicitation and translation techniques through the pre-existing infrastructure of a global polling institute, Gallup. Upon publication, the data will be made publicly available online. The data on individual preferences are complemented by a comprehensive set of covariates provided by the Gallup World Poll 2012.

...

The GPS preference measures are based on twelve survey items, which were selected in an initial survey validation study (see Falk et al., 2016, for details). The validation procedure involved conducting multiple incentivized choice experiments for each preference, and testing the relative abilities of a wide range of different question wordings and formats to predict behavior in these choice experiments. The particular items used to construct the GPS preference measures were selected based on optimal performance out of menus of alternative items (for details see Falk et al., 2016). Experiments provide a valuable benchmark for selecting survey items, because they can approximate the ideal choice situations, specified in economic theory, in which individuals make choices in controlled decision contexts. Experimental measures are very costly, however, to implement in a globally representative sample, whereas survey measures are much less costly.⁴ Selecting survey measures that can stand in for incentivized revealed preference measures leverages the strengths of both approaches.

The Preference Survey Module: A Validated Instrument for Measuring Risk, Time, and Social Preferences: http://ftp.iza.org/dp9674.pdf

Table 1: Survey items of the GPS

Figure 1: World maps of patience, risk taking, and positive reciprocity.
Figure 2: World maps of negative reciprocity, altruism, and trust.

Figure 3: Gender coefficients by country. For each country, we regress the respective preference on gender, age and its square, and subjective math skills, and plot the resulting gender coefficients as well as their significance level. In order to make countries comparable, each preference was standardized (z-scores) within each country before computing the coefficients.

Figure 4: Cognitive ability coefficients by country. For each country, we regress the respective preference on gender, age and its square, and subjective math skills, and plot the resulting coefficients on subjective math skills as well as their significance level. In order to make countries comparable, each preference was standardized (z-scores) within each country before computing the coefficients.

Figure 5: Age profiles by OECD membership.

Table 6: Pairwise correlations between preferences and geographic and cultural variables

Figure 10: Distribution of preferences at individual level.
Figure 11: Distribution of preferences at country level.

interesting digression:
D Discussion of Measurement Error and Within- versus Between-Country Variation
study  dataset  data  database  let-me-see  economics  growth-econ  broad-econ  microfoundations  anthropology  cultural-dynamics  culture  psychology  behavioral-econ  values  🎩  pdf  piracy  world  spearhead  general-survey  poll  group-level  within-group  variance-components  🌞  correlation  demographics  age-generation  gender  iq  cooperate-defect  time-preference  temperance  labor  wealth  wealth-of-nations  entrepreneurialism  outcome-risk  altruism  trust  patience  developing-world  maps  visualization  n-factor  things  phalanges  personality  regression  gender-diff  pop-diff  geography  usa  canada  anglo  europe  the-great-west-whale  nordic  anglosphere  MENA  africa  china  asia  sinosphere  latin-america  self-report  hive-mind  GT-101  realness  long-short-run  endo-exo  signal-noise  communism  japan  korea  methodology  measurement  org:ngo  white-paper  endogenous-exogenous  within-without  hari-seldon 
october 2017 by nhaliday
Comprehensive Military Power: World’s Top 10 Militaries of 2015 - The Unz Review
gnon  military  defense  scale  top-n  list  ranking  usa  china  asia  analysis  data  sinosphere  critique  russia  capital  magnitude  street-fighting  individualism-collectivism  europe  germanic  world  developing-world  latin-america  MENA  india  war  meta:war  history  mostly-modern  world-war  prediction  trends  realpolitik  strategy  thucydides  great-powers  multi  news  org:mag  org:biz  org:foreign  current-events  the-bones  org:rec  org:data  org:popup  skunkworks  database  dataset  power  energy-resources  heavy-industry  economics  growth-econ  foreign-policy  geopolitics  maps  project  expansionism  the-world-is-just-atoms  civilization  let-me-see  wiki  reference  metrics  urban  population  japan  britain  gallic  allodium  definite-planning  kumbaya-kult  peace-violence  urban-rural  wealth  wealth-of-nations  econ-metrics  dynamic  infographic 
june 2017 by nhaliday
SDA: Survey Documentation and Analysis
preliminary summary of how to use:

Archive -> cumulative datafile
programs:
'Tables' = see frequencies by crosstab
'Means' = see means by crosstab
'Regression' = multivar regression

to subset on range: var(x-y), for more detail, http://sda.berkeley.edu/sdaweb/helpfiles/helpan.htm#filter
recoding: http://sda.berkeley.edu/sdaweb/helpfiles/helpan.htm#recode
computing new variables using arithmetic expressions: http://sda.berkeley.edu/sdaweb/helpfiles/helpnewv.htm#compute

when variables are behaving funkily, can be useful to check out availability by year in the Chicago data explorer: https://gssdataexplorer.norc.org/variables/

IAP = "inapplicable" (not asked that year, or N/A)

how to do line plots: http://www.ssric.org/node/601
use 'bar chart' instead of 'stacked bar chart' to get histogram

Razib: http://blogs.discovermagazine.com/gnxp/2011/08/how-to-use-the-general-social-survey/
general-survey  database  crosstab  data  org:data  org:edu  sociology  let-me-see  poll  usa  culture  society  values  ideology  elections  coalitions  demographics  tools  2016-election  dynamic  todo  dataset  social-science  multi  documentation  howto  gnxp  scitariat  info-foraging  calculator  data-science 
may 2017 by nhaliday
Fertility trends by social status
The study reveals that as fertility declines, there is a general shift from a positive to a negative or neutral status-fertility relation. Those with high income/wealth or high occupation/social class switch from having relatively many to fewer or the same number of children as others. Education, however, depresses fertility for as long as this relation is observed (from early in the 20th century).

- good survey with trends for different regions, including UK+North America
- Figure 4: quadratic for UK+NA, crossing zero around 1800 or so and quickly leveling off
http://imgur.com/a/xjwO1
- also Figure 5: fertility differential by total TFR (quadratic trend), so worst dysgenics in middle of demographic transition
- dataset: http://www.demographic-research.org/volumes/vol18/5/files/StatusFertilityDataset.xls

This article discusses how fertility relates to social status with the use of a new dataset, several times larger than the ones used so far. The status-fertility relation is investigated over several centuries, across world regions and by the type of status-measure. The study reveals that as fertility declines, there is a general shift from a positive to a negative or neutral status-fertility relation. Those with high income/wealth or high occupation/social class switch from having relatively many to fewer or the same number of children as others. Education, however, depresses fertility for as long as this relation is observed (from early in the 20th century).
pdf  study  demographics  sociology  fertility  correlation  dysgenics  britain  anglo  usa  history  early-modern  mostly-modern  trends  iq  education  status  compensation  money  class  gender  social-structure  🎩  🌞  world  demographic-transition  plots  science-anxiety  multi  pic  visualization  data  developing-world  deep-materialism  new-religion  stylized-facts  age-generation  s:*  nonlinearity  wealth  s-factor  chart  biophysical-econ  broad-econ  solid-study  rot  the-bones  meta-analysis  database  dataset  curvature  pre-ww2  modernity  time-series  convexity-curvature  hari-seldon 
march 2017 by nhaliday
Was the Wealth of Nations Determined in 1000 BC?
Our most interesting, strong, and robust results are for the association of 1500 AD technology with per capita income and technology adoption today. We also find robust and significant technological persistence from 1000 BC to 0 AD, and from 0 AD to 1500 AD.

migration-adjusted ancestry predicts current economic growth and technology adoption today

https://economix.blogs.nytimes.com/2010/08/02/was-todays-poverty-determined-in-1000-b-c/

Putterman-Weil:
Post-1500 Population Flows and the Long Run Determinants of Economic Growth and Inequality: http://www.nber.org/papers/w14448
Persistence of Fortune: Accounting for Population Movements, There Was No Post-Columbian Reversal: http://sci-hub.tw/10.1257/mac.6.3.1
Extended State History Index: https://sites.google.com/site/econolaols/extended-state-history-index
Description:
The data set extends and replaces previous versions of the State Antiquity Index (originally created by Bockstette, Chanda and Putterman, 2002). The updated data extends the previous Statehist data into the years before 1 CE, to the first states in Mesopotamia (in the fourth millennium BCE), along with filling in the years 1951 – 2000 CE that were left out of past versions of the Statehist data.
The construction of the index follows the principles developed by Bockstette et al (2002). First, the duration of state existence is established for each territory defined by modern-day country borders. Second, this duration is divided into 50-year periods. For each half-century from the first period (state emergence) onwards, the authors assign scores to reflect three dimensions of state presence, based on the following questions: 1) Is there a government above the tribal level? 2) Is this government foreign or locally based? 3) How much of the territory of the modern country was ruled by this government?

Creators: Oana Borcan, Ola Olsson & Louis Putterman

State History and Economic Development: Evidence from Six Millennia∗: https://drive.google.com/file/d/1cifUljlPpoURL7VPOQRGF5q9H6zgVFXe/view
The presence of a state is one of the most reliable historical predictors of social and economic development. In this article, we complete the coding of an extant indicator of state presence from 3500 BCE forward for almost all but the smallest countries of the world today. We outline a theoretical framework where accumulated state experience increases aggregate productivity in individual countries but where newer or relatively inexperienced states can reach a higher productivity maximum by learning from the experience of older states. The predicted pattern of comparative development is tested in an empirical analysis where we introduce our extended state history variable. Our key finding is that the current level of economic development across countries has a hump-shaped relationship with accumulated state history.

nonlinearity confirmed in this other paper:
State and Development: A Historical Study of Europe from 0 AD to 2000 AD: https://ideas.repec.org/p/hic/wpaper/219.html
After addressing conceptual and practical concerns on its construction, we present a measure of the mean duration of state rule that is aimed at resolving some of these issues. We then present our findings on the relationship between our measure and local development, drawing from observations in Europe spanning from 0 AD to 2000 AD. We find that during this period, the mean duration of state rule and the local income level have a nonlinear, inverse U-shaped relationship, controlling for a set of historical, geographic and socioeconomic factors. Regions that have historically experienced short or long duration of state rule on average lag behind in their local wealth today, while those that have experienced medium-duration state rule on average fare better.

Figure 1 shows all borders that existed during this period
Figure 4 shows quadratic fit

I wonder if U-shape is due to Ibn Kaldun-Turchin style effect on asabiya? They suggest sunk costs and ossified institutions.
study  economics  growth-econ  history  antiquity  medieval  cliometrics  macro  path-dependence  hive-mind  garett-jones  spearhead  biodet  🎩  🌞  human-capital  divergence  multi  roots  demographics  the-great-west-whale  europe  china  asia  technology  easterly  definite-planning  big-picture  big-peeps  early-modern  stylized-facts  s:*  broad-econ  track-record  migration  assimilation  chart  frontier  prepping  discovery  biophysical-econ  cultural-dynamics  wealth-of-nations  ideas  occident  microfoundations  news  org:rec  popsci  age-of-discovery  expansionism  conquest-empire  pdf  piracy  world  developing-world  deep-materialism  dataset  time  data  database  time-series  leviathan  political-econ  polisci  iron-age  mostly-modern  government  institutions  correlation  curvature  econ-metrics  wealth  geography  walls  within-group  nonlinearity  convexity-curvature  models  marginal  wire-guided  branches  cohesion  organizing  hari-seldon 
march 2017 by nhaliday
Information Processing: Big, complicated data sets
This Times article profiles Nick Patterson, a mathematician whose career wandered from cryptography, to finance (7 years at Renaissance) and finally to bioinformatics. “I’m a data guy,” Dr. Patterson said. “What I know about is how to analyze big, complicated data sets.”

If you're a smart guy looking for something to do, there are 3 huge computational problems staring you in the face, for which the data is readily accessible.

1) human genome: 3 GB of data in a single genome; most data freely available on the Web (e.g., Hapmap stores patterns of sequence variation). Got a hypothesis about deep human history (evolution)? Test it yourself...

2) market prediction: every market tick available at zero or minimal subscription-service cost. Can you model short term movements? It's never been cheaper to build and test your model!

3) internet search: about 10^3 Terabytes of data (admittedly, a barrier to entry for an individual, but not for a startup). Can you come up with a better way to index or search it? What about peripheral problems like language translation or picture or video search?

The biggest barrier to entry is, of course, brainpower and a few years (a decade?) of concentrated learning. But the necessary books are all in the library :-)

Patterson has worked in 2 of the 3 areas listed above! Substituting crypto for internet search is understandable given his age, our cold war history, etc.
hsu  scitariat  quotes  links  news  org:rec  profile  giants  stories  huge-data-the-biggest  genomics  bioinformatics  finance  crypto  history  britain  interdisciplinary  the-trenches  🔬  questions  genetics  dataset  search  web  internet  scale  commentary  apollonian-dionysian  magnitude  examples  open-problems  big-surf  markets  securities  ORFE  nitty-gritty  quixotic  google  startups  ideas  measure  space-complexity  minimum-viable  move-fast-(and-break-things) 
february 2017 by nhaliday

Copy this bookmark:





to read