Clarity Software
Unique state-of-the-art statistical software package which answers the requirements for joint analysis of complex data sets from various fields
Regardless of the aspect of language being compared to parallel data sets,
joint analysis with genetics or history lacks a suitable comparison metric. Language variation is
often related by a tree, which could be compared to trees of genetic, cultural or historical
variation. Therefore, in preparation for the collaborations made possible by the OCSEAN
mobility proposal, we have developed a unique state-of-the-art statistical software package
(CLARITY), which answers these requirements for joint analysis of data sets and permits a true
synthesis of disciplines in the pursuit of knowledge of the human past.
The CLARITY approach that we discuss is freely available for download at https://github.com/danjlawson/CLARITY . It exploits the idea that modelling
similarity matrices obtained from different data sources may be difficult, but that they can still
be compared using their ability to predict one another. Further, the relative importance of
various “structures” – loosely, clusters – in each dataset can be compared by examining how
many dimensions of a low dimensional representation of one dataset is required to predict a
structure in the second dataset.
This approach is not an explicit model combining different datasets but is instead a
hypothesis generating way to compare arbitrary data. It can for example provide a probability
that a particular language sharing result is surprising given the genetic sharing patterns. This
would then be investigated in an individual analysis, for example, identifying which word
cognates are responsible for an additional signal of language sharing compared to genetic.
Where appropriate we will develop additional methodology to address important questions.
We will make inferences about demographic parameters, using a model of the joint
evolution of language and genomes, initially basing our approach on that of Thouzeau et al.
(2017), who have modified the language evolutionary model of Gray and Atkinson (2002) to
create a package, PopLingSim, for simulating languages under different demographic histories.
Genome evolution can be simulated with the same demographic history using e.g. SLiM or
msprime. As our study progresses we will consider further summary statistics and ABC
approaches, making use of machine learning methods (Beaumont, 2019) , as described above.
A key barrier to the use of archaeological data is unevenness in sampling, with some
studies providing well-provenienced materials, collected using modern protocols, and with good chronological controls, whilst others are known only through limited and occasionally tangential
information. This results in missing data that usually is not random (Schafer 1999).
Archaeological data can be treated as a third type of information on cultural processes and the
use of hierarchical Bayesian modelling may allow imputation of unobserved data (e.g. to infer the type of ceramics expected in unsampled cultural horizons).
It is, therefore, possible to statistically quantify the evidence for whether any strong
claims can be made regarding the strength of agreement between genetic, linguistic and
archaeological data, and specifically whether we can conclude that they disagree and, hence, that material culture and ideas have moved where genes did not. This approach will need to consider the likelihood of survival according to climate and geological history, and population densities.