Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Daily Archives: October 12, 2015

Paper – Characterizing the Google Books Corpus

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. Eitan Adam Pechenick, Christopher M. Danforth, Peter Sheridan Dodds. PLOS ONE – Published: October 7, 2015. DOI: 10.1371/journal.pone.0137041.

“It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800–2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.”

Public Collection – public art and literacy project

“The Public Collection is a public art and literacy project developed by Rachel M. Simon to improve literacy, foster a deeper appreciation of the arts, and raise awareness for education and social justice in our community. Through a curated process, Indiana-based artists were commissioned to design unique book share stations or lending libraries that are installed in… Continue Reading

Quality of Death Index 2015 – Ranking palliative care across the world

The Economist: “The UK ranks first in the 2015 Quality of Death Index, a measure of the quality of palliative care in 80 countries around the world released today by The Economist Intelligence Unit (EIU). Its ranking is due to comprehensive national policies, the extensive integration of palliative care into the National Health Service, a… Continue Reading

Constitutional Bad Faith

Pozen, David, Constitutional Bad Faith (October 12, 2015). 129 Harvard Law Review (forthcoming 2016). Available for download at SSRN: “The concepts of good faith and bad faith play a central role in many areas of private law and international law. Typically associated with honesty, loyalty, and fair dealing, good faith is said to supply… Continue Reading launches user privacy initiative blog: “Our new Privacy Manager puts privacy choices at your fingertips Your privacy is serious business. That’s why we always make sure we have important safeguards in place to protect the information you provide when you visit We’ve now introduced a tool that lets you easily control some of the information we may… Continue Reading