Clarity Software

Unique state-of-the-art statistical software package which answers the requirements for joint analysis of complex data sets from various fields

Regardless of the aspect of language being compared to parallel data sets,

joint analysis with genetics or history lacks a suitable comparison metric. Language variation is

often related by a tree, which could be compared to trees of genetic, cultural or historical

variation. Therefore, in preparation for the collaborations made possible by the OCSEAN

mobility proposal, we have developed a unique state-of-the-art statistical software package

(CLARITY), which answers these requirements for joint analysis of data sets and permits a true

synthesis of disciplines in the pursuit of knowledge of the human past.


The CLARITY approach that we discuss is freely available for download at https://github.com/danjlawson/CLARITY . It exploits the idea that modelling

similarity matrices obtained from different data sources may be difficult, but that they can still

be compared using their ability to predict one another. Further, the relative importance of

various “structures” – loosely, clusters – in each dataset can be compared by examining how

many dimensions of a low dimensional representation of one dataset is required to predict a

structure in the second dataset.


This approach is not an explicit model combining different datasets but is instead a

hypothesis generating way to compare arbitrary data. It can for example provide a probability

that a particular language sharing result is surprising given the genetic sharing patterns. This

would then be investigated in an individual analysis, for example, identifying which word

cognates are responsible for an additional signal of language sharing compared to genetic.

Where appropriate we will develop additional methodology to address important questions.


We will make inferences about demographic parameters, using a model of the joint

evolution of language and genomes, initially basing our approach on that of Thouzeau et al.

(2017), who have modified the language evolutionary model of Gray and Atkinson (2002) to

create a package, PopLingSim, for simulating languages under different demographic histories.

Genome evolution can be simulated with the same demographic history using e.g. SLiM or

msprime. As our study progresses we will consider further summary statistics and ABC

approaches, making use of machine learning methods (Beaumont, 2019) , as described above.


A key barrier to the use of archaeological data is unevenness in sampling, with some

studies providing well-provenienced materials, collected using modern protocols, and with good chronological controls, whilst others are known only through limited and occasionally tangential

information. This results in missing data that usually is not random (Schafer 1999).

Archaeological data can be treated as a third type of information on cultural processes and the

use of hierarchical Bayesian modelling may allow imputation of unobserved data (e.g. to infer the type of ceramics expected in unsampled cultural horizons).


It is, therefore, possible to statistically quantify the evidence for whether any strong

claims can be made regarding the strength of agreement between genetic, linguistic and

archaeological data, and specifically whether we can conclude that they disagree and, hence, that material culture and ideas have moved where genes did not. This approach will need to consider the likelihood of survival according to climate and geological history, and population densities.