Early View
RESEARCH ARTICLE
Open Access

Critical data modeling and the basic representation model

Karen M. Wickett

Corresponding Author

Karen M. Wickett

School of Information Sciences, Urbana-Champaign, Illinois, USA

Correspondence

Karen M. Wickett, School of Information Sciences, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA.

Email: [email protected]

Search for more papers by this author
First published: 17 February 2023
Citations: 1

Abstract

The increasing role and impact of information systems in modern life calls for new types of information studies that examine sociotechnical factors at play in the development and use of information systems. This article proposes critical data modeling—the use of data modeling and systems analysis techniques to build critical interrogations of information systems—as a method for bridging between social factors and technical systems, presents the Basic Representation Model as an analytical tool for critical data modeling, and discusses the results of critical data modeling of a police arrest record dataset. The Basic Representation Model is a conceptual model of information objects that supports a detailed examination of data modeling and information representation within and across information systems, and functions as a synthesizing concept for existing critical work on information systems. Critical data modeling adds an essential complement to existing approaches to critical information studies by grounding the analysis of an information system in both the technical realities of computational systems and the social realities of our communities.

1 INTRODUCTION

Information systems are now an integral part of our lives—socializing, learning, finance, commerce, government, and criminal justice all happen in interaction with information systems. As the technical platforms that realize an “information society” become pervasive, the information representation decisions that structure those systems play out in the lives of our communities. These impacts can be life-altering and can extend across generations, as we can see with examples of harm from mis-classifications in medical information systems (Bowker & Star, 2000), criminal justice databases (Angwin et al., 2016) and systems that track eligibility for public benefits (Eubanks, 2018).

In order to perform its functions, an information system needs data models. Data models instantiate the information representation decisions that let an information system represent the real world, perform data processing, and take actions. Within an information system, multiple levels of representation and encoding provide a path between our messy real world and the binary logic of computing. Information system design decisions may appear neutral and disconnected from the real world, especially when viewed in isolation from the wider sociotechnical context where they will play out for people and communities. However, these decisions are made by humans who are actively creating a lens on the world that structures the information within a system. The resulting system actions and data objects are therefore shaped by the worldviews of information system builders.

This article introduces critical data modeling as an analytical approach to sociotechnical studies of information systems and information objects. I present the Basic Representation Model as a tool for critical data modeling, and discuss results of critical data modeling of a publicly available dataset of police arrest records. The Basic Representation Model is a conceptual model of information objects that supports detailed interrogations of data modeling and information representation within and across information systems. In addition to structuring interrogation of information objects, the Basic Representation Model functions as a synthesizing concept that brings together existing critical studies of information. Unpacking information representation and highlighting data modeling decisions bridges a gap between current critiques of information systems that focus on social impacts and the technically-focused work of building computing systems.

2 CRITICAL DATA MODELING

The increasing role and impact of information systems in modern life calls for new types of information studies that examine sociotechnical factors at play in information system development and use, along with the evolution of methods and models to support that work. I am proposing critical data modeling—the use of data modeling and systems analysis techniques to closely examine the creation and implementation of information systems—as a method for bridging between social factors and technical systems. Using these techniques with a critical lens will let us bring broader social critiques “under the hood” of technical systems, to explain precisely how these systems perpetuate injustices or do harm. The primary goals of critical data modeling are to expose unjust or biased assumptions in data models or algorithms and to highlight the technical roles of those assumptions within a system. Exposing and examining these assumptions in the language of the technical systems in which they are embedded will validate and expand existing critiques and enable the development of more equitable information systems.

2.1 Critical analysis of information systems

As defined by Siva Vaidhyanathan, critical information studies “considers the ways in which culture and information are regulated by their relationship to commerce, creativity, and other human affairs” (2006). This area of study operates at an intersection of many fields, and “interrogates the structures, functions, habits, norms, and practices that guide global flows of information and cultural elements” (Vaidhyanathan, 2006). In recent years, critical information studies have focused attention on the interaction of race and technology. In addition to tracing cultural influence and information flows in the creation and deployment of information systems and policies, critiques arising from critical information studies provide tangible evidence of how structural racism is enacted in a modern society. For example, Virginia Eubanks (2018) has shown how algorithmic decision making in the financial sector has reinforced racism. Even more starkly, investigative journalists have shown how a widely-used machine learning system designed to inform criminal justice sentencing and parole decisions is heavily biased in harmful and racist ways (Angwin et al., 2016).

While the legal and economic impacts of racially biased information systems are deeply alarming, it is worth noting that information systems, and the algorithms that drive them, play a significant role in our social and cultural lives. Safiya Noble argues that the racism found in search suggestions and results on popular search engines, even when used for casual topics, has a significant impact on communities and on individual identity formation (Noble, 2018). Search and retrieval tools in academic contexts, which often use library knowledge organization systems have also been critiqued for racism (Berman, 1993), xenophobia (Baron et al., 2016), homophobia (M. A. Adler, 2015) and ableism (M. Adler et al., 2017). Information systems are pervasive—they appear in school, work, finance, commerce, government, health care, politics, law, and criminal justice. The decision making behind their designs is typically opaque to their users. Even when users are experts in a domain, they are typically not also experts in the technical aspects of the information systems they use daily in their professional or personal lives.

Analytical techniques with an awareness of the technical aspects of information systems have been developed to support these critiques. Geoffrey Bowker and Susan Leigh Star laid the foundations of research on the impact of classifications within information systems in their landmark book “Sorting Things Out” Bowker and Star (2000), and scholars have continued this thread by interrogating the history and impact of classification categories (see, e.g., M. Adler et al. (2017), Baron et al. (2016)). Scholars working at the intersection of computer science and health policy are developing tools and methods to audit machine learning code and input datasets to examine racial disparities in health care information systems (Obermeyer et al., 2019). Costanza-Chock (2020) and the authors of Data Feminism (D'Ignazio & Klein, 2020) have both proposed system design criteria to guide data science projects to be more equitable (and less racist and sexist) in their representation of people and their communities. Methods that closely examine metadata and data structures in application programming interfaces has been shown to be a key element of fully analyzing the role of social media platforms in critical internet studies (Acker & Donovan, 2019). While revealing, these analyses tend to focus on a single level of representation in a computing system, and therefore do not capture the full picture of digital objects and computing systems.

In her book Race After Technology (Benjamin, 2019), Ruha Benjamin shows how the cumulative effect of modern information systems has created an insidious system of oppression that is obfuscated by the technical nature of the systems involved. The complex relationships and layers of encodings involved in the realization of any digital object are an additional obfuscating factor for the analysis of modern information objects and systems. By structuring critiques in alignment with the levels of representation inherent in any digital object, critical data modeling can reveal these injustices in both technical and social terms.

2.2 Models in information systems

Every information system uses data models. In order to critically examine information systems in the context of their technical commitments and operations, we need to interrogate those data models and how they are used or adapted. While certain kinds of data models, such as database schemas, are often explicitly documented, many of the data modeling decisions that go into algorithm design or data processing pipelines are not documented explicitly. Additionally, scholars working in a critical space need methods that allow us to work from available data objects, without depending on close relationships with corporations or governmental agencies to gain access to system documentation.

By “data model,” I mean a set of labels for categories, and a set of assumptions about how those labeled categories will be handled in a computational system. These assumptions include data types and potential inferences or relationships between categories. The data models used in information system design and implementation vary with the type of technologies used to build a system, the types of data being used or produced, and the uses or audience of the system. For example, the COMPAS recidivism prediction system uses demographic and criminal incidence data to produce a “score” that functions as a prediction of the likelihood that an incarcerated individual will undertake illegal actions after release from prison (Angwin et al., 2016). In this case, the information system is using data that already exists, integrating it with new data about an individual, and creating new data objects that are processed into a set of scores to be used by a judge in a parole or release hearing. The data is collected and processed according to a set of data models that define categories to classify individuals according to age, racial identity, and other features for demographic data. The machine learning algorithm then creates the score for an individual based on patterns in the existing data and the data about the individual.

Machine learning algorithms seem to promise a type of objectivity by relying on data and computation over human judgment. But all data are created by humans according to a set of models. Data models necessarily create simplified categories that allow us to map the complexities of the real world into a form that will allow processing. Data models are a series of choices about what is recorded, what is left out, and how those things are recorded. What is recorded and how constrains what can then be done with that information. Calculating a numeric score that aims to predict whether a person will undertake illegal actions within a certain timeframe requires many modeling choices. For one example, the use of percentage points to indicate likelihood of particular future actions is a modeling choice that carries a connotation of scientific certainty, even while the exact semantics of that percentage are unclear. While this simplified score is easy to interpret intuitively, all of the assumptions and modeling choices that went into are obfuscated, making it difficult to interpret critically.

2.3 What we need for critical data modeling

The work of the scholars and investigative journalists described above are important elements of the evolving space around critical information studies. The critical data modeling techniques proposed in this article complement and enhance the existing research landscape by bringing data modeling into the conversation. In addition to building a greater understanding of data models and their consequences, these analyses can inform critical studies of algorithms, by aiding in “unpacking the full socio-technical assemblage of algorithms” (Kitchin, 2016).

One essential aspect of data modeling is the assignment of labeled categories to real-world things or events. For example, an enrollment management system used in a university context will use categories like “student” to represent people who have enrolled in their school. This category will have certain attributes associated with it, such as a student's name and mailing address. While some of these attributes will have meaning outside of the context of the system, many attributes (such as tuition payment status, enrollment data or cumulative GPA) will be defined in ways that are tightly tied to the operation of the university and its information systems. This modeling activity reduces real-world entities into the set of attributes defined by a data model. Of course, reduction is necessary to get anything done with a computing system, because a functioning data model cannot represent the multitudes of information around us. But it is important to recognize that data modeling takes a position on what matters about those entities by elevating some aspects and leaving others out.

In order to perform a critical analysis of an information system, we need to be able to examine the categories, labels, datatypes, and processing code that gives data structure within the information system. The goal of critical data modeling is to ground analyses in the technical reality of information systems, in order to create bridges between sociocultural arguments and the operational context of information systems. These analyses will strengthen sociotechnical critiques by revealing how system design decisions that may appear neutral have real impacts on individual lives and our communities.

The creation of categories in a data model and the development of information processing procedures require what are fundamentally translations. In data modeling, this translation operates between the real world and the categories created by the system designers. In the development of information processing, the translations proliferate as all computing systems operate on the basis of translations between “machine-level” operations or data objects (like flipping bits), and “human-level” operations or data objects (like saving a file or executing a program). Therefore, critical data modeling requires the ability to navigate the levels of representation within computing systems and data objects. This account of representation and encoding is necessary to tell the full story of the translations, actions, and interpretations (by systems and by humans) that are involved in the creation and use of an information system. The following section introduces the Basic Representation Model, which lets us separate out levels of representation in data objects and information systems, to account for these translations and interpretations.

3 THE BASIC REPRESENTATION MODEL

This section introduces the Basic Representation Model and demonstrates how an everyday information object can be analyzed with the model. The model is presented as an entity-relationship model with three entity types and three relationship types (Figure 1). The Basic Representation Model was developed as a model to support digital preservation of scientific data and it provides a general model for information representation and encoding in digital objects (Wickett et al., 2012).

Details are in the caption following the image
The Basic Representation Model.

One of the achievements of the Basic Representation Model is that it identifies the genuine entity types (following Guarino and Welty (2000)) involved in the recording and expression of semantic content, and distinguishes those from the contingent roles those entities take on in particular contexts. As a criterion for distinguishing types from roles one can adapt Guarino and Welty and apply this rule: If it is possible that something that is an F might not have been an F, then being an F is a role that things have; otherwise F is a type of thing. So, using their example, since it is possible that someone who is student might not have been a student (i.e., might not have enrolled this year), student is role. But since it is not possible that something that is a person might not have been a person (and still exist), person is a type of thing.

The separation of entity types and roles means that the Basic Representation Model adaptable for critical data modeling of a range of information objects and systems, since it provides a method that clarifies entry points for analysis and supports tracking entities across different contexts. The Basic Representation Model proposes three fundamental types of things that participate in the representation of semantic content as information, be it in the form of digital objects or in more traditional forms such as printed text. These entity types are Propositional Content, Symbol Structure, and Patterned Matter and Energy.

3.1 A guiding example

As an example to guide the description and application of the model, consider a roster for a class, which was copied as tab-separated values from a web browser into a text editor and then saved as a file on the instructor's computer. This file is viewable with a text editor or in spreadsheet software, and contains information about enrollment in the class at a particular time. The first line of the file lists column headers such as “Record Number,” “Student Name”, “ID,” “Reg Status,” and “Credits.” The following 35 lines go on to list the corresponding information for each of the students who were enrolled in the class when the instructor copied the information from their web browser.

3.2 Entities and relationships

3.2.1 Propositional content

Propositions are defined in the Basic Representation Model as the bearers of truth values. In other words, propositions are statements that may be true or false. While the Basic Representation Model on its own does not account for whether the propositions expressed by an information object are true or false, critical data modelers are likely to include assessments of veracity in their analyses. (See the discussion of conformance with datatypes overriding accuracy in the analysis of an arrest dataset below for an example.) The key position of the Basic Representation Model is that propositions are semantic content that are not bound to particular expressive modes for that content, which aligns with Floridi's definition of information as semantic content (Floridi, 2013). Expressions and encodings of information in various forms are what convey propositional content to some audience.

The propositional content of the class roster consists of a series of statements about students and their enrollment in the instructor's class section. Each student, indicated by their name and ID number, is enrolled in the class for a certain number of credits, and enrolled using a certain registration method. Many of the facts that can be learned from this roster are meaningful beyond the context of the roster, such as a student with a particular name having a particular ID number. Other facts, such as the record numbers assigned to each line of data, are only meaningful within the context of the roster. While the roster expresses this information as a table of data, these same propositions could be expressed in a variety of forms; for example, narratively with a series English language sentences or as a structured XML document.

3.2.2 Symbol structure

In the Basic Representation Model, symbol structures are the arrangements of symbols that express semantic content in a given context, or encode other symbol structures within a computational system. Individual symbols such as letters, numbers and shapes are the atomic components of a symbol structure. Those atomic components are assembled into symbol structures like strings of characters, or mathematical expressions like graphs, relations, and sequences.

3.2.3 Is expressed by

This relationship stands between propositional content and the primary symbol structures that realize that content. The distinction between primary and nonprimary symbol structures allows a separation between the realization of content, in which an author chooses a structural form (e.g., text written in English) to express some information, and the lower-level encodings of those structures in a computational system.

In the case of the class roster, the primary symbol structure is a table of data—a set of rows, where the first row gives labels for the values in the 35 rows that follow. So the fact that the student with the ID number 671351470 is enrolled in the course for 4 credit hours is expressed by the appearance of line of data where “671351470” appears in the position labeled “ID” and “4.000” appears in the position labeled “Credits.” The labels and other textual information are written in English, and the names of the enrolled students are listed in “LastName, FirstName” form. The choice of language and form of name are part of the expressive act that creates the primary symbol structure.

3.2.4 Is encoded by

Mappings and encoding relationships are a fundamental aspect of our computing systems. The processing machinery that drives our computers operates on the basis of information encoded into a binary digital signal—that is, a sequence of 0s and 1s. Therefore, any semantic content stored or displayed by a computing system must, at some level, be translated into a binary digital representation. The graphical characters we see on our screens through software are a visual representation, created by processing a corresponding binary digital representation of the character.

The is Encoded By relationship describes the encodings and mappings that we use to create and manage digital objects in computational systems. The layers of encoding and representation in a digital system are modeled in the Basic Representation Model by a series of is Encoded By relationships between Symbol Structures. Since it is a relationship between symbol structures, it is shown as recursive on the entity-relationship diagram.

In the class roster, the table of data is stored as tab-separated values. This format uses alphanumeric characters to store data, and is readable in many different kinds of software, including text editors, spreadsheet software applications and word processors. The end of each value in each row is indicated by a tab character, and the end of each row is indicated by a line break character. In the specific case of the class roster, the tab and line break characters, as well as all the numbers, letters and special characters that appear as data, are encoded using the Unicode character encoding system.

The first character that appears in the class roster is the “R” in “Record Number.” The Unicode character encoding system assigns hexidecimal numeric codes to characters, and “R” is represented by the numeric “code point” “U+0052.” These assignments give us our initial is Encoded By relationships for the class roster; between every letter, number, or special character that appears in the file, and the corresponding Unicode code points. The numeric code points are in turned mapped to a binary digital representation (a sequence of 0s and 1s) according to a particular Unicode translation format, which is modeled with another is Encoded By relationship. The class roster uses the UTF-8 translation format, in which code points are translated into sets of 8-bit bytes, the length of which depends on the position of the character within a specified range of code points. The character “R” is mapped to the single 8-bit byte “01010010.”

3.2.5 Patterned matter and energy

While information, semantic content and symbol structures are all abstract entities, in order to encounter information we must encounter some material objects that record it in a way we can interpret. In other words, there must be some inscription of symbol structures in the form of patterned matter and energy. The name for this entity type takes inspiration from Marcia Bates' Fundamental Forms of Information (Bates, 2006), but separates the matter or energy that is patterned from the patterning, since the patterning is an abstract arrangement of some physical material.

3.2.6 Is inscribed in

The Is Inscribed In relationship stands between symbol structures that expresses or encodes content and concrete arrangement of matter or energy. The concrete arrangement of matter might be ink on a piece of paper, carved stone, or etched metal. In the context of computing systems, the binary digital representations that encode symbol structures are typically recorded as magnetic charges on a hard disk, carved pits on an optical disk or electrical charges on a flash memory device.

In the case of the class roster, the flash memory device that serves as the internal memory storage on the instructor's laptop has some set of logic gates that use electrical charges to record the binary digital representation of the class roster file. The instructor's laptop uses file management software that manages the inscription of the encoded data onto the storage device, and the electrical charges recording the bits that make up the binary digital representation may not be in physical proximity to each other since their placement may be managed to maximize overall storage capacity on the device. Inscription of symbol structures onto computational memory devices involves complexities that are specific to operating systems and physical memory devices. The ability to integrate a detailed account of computational inscription with the data modeling choices made by a user who is unaware of the complexities of their computing system is a strength of the Basic Representation Model.

The Basic Representation Model describes the entities and relationships that are involved in the encoding, storage and retrieval of information objects. The model focuses on identifying entities in terms of these essential types and lets us separate symbol structures or pieces of matter from the meanings that might be assigned to them in some particular context. This separation supports critical data modeling by acknowledging and accounting for the layers of representation in digital objects. As discussed in the next section, the Basic Representation Model provides as a conceptual backbone for critical data modeling.

4 DISCUSSION

The Basic Representation Model supports technical close readings of information objects, where the lens of qualitative content analysis is applied to the layers of representation and meaning that instantiate digital objects. This methodology draws from qualitative content analysis and close reading, and has been used recently to examine the techniques of information representation used in the creation and management of an information object in studies by Acker and Donovan (2019), Thomer and Wickett (2020) and Poirier (2021). How information is represented shapes how an information object is interpreted, how it can be used, and how and whether it can be preserved for long-term reuse. Critical readings of information objects ask questions about data models, file structures and metadata, and the characters that appear in a file. The Basic Representation Model serves as an organizing framework for these questions by giving names for the entities and relationships involved at any level of representation or encoding, and by drawing attention to the interconnections between information representation choices at the various levels. This section gives examples of the kinds of observations and analysis that are the results of critical data modeling with the Basic Representation Model, presented according to the levels of representation involved.

Organizing critical data modeling around the levels of representation that are present in a digital object is complementary to many existing approaches to sociotechnical critiques of information systems. For instance, the detailed accounting of representation and encoding decisions supports Lindsay Poirier's reading strategies for datasets (Poirier, 2021). Poirier's connotative reading strategy focuses on “tracing the socio-political provenance of data semantics” and critical data modeling embeds a rich exposition of those data semantics into the analytical work of reading a dataset. Critical data modeling is similarly complementary to Jill Walker Rettberg's methods for situated data analysis (Rettberg, 2020), which operate by identifying levels of data aggregation and constructing critical readings of data visualizations and interactions available to users of social media platforms and applications. Similarly, identifying levels of representation in information systems can provide analytical entry points for leveraging André Brock's critical technocultural discourse analysis, which is attuned to technical affordances and rhetorical actions of users, as well as impact of the choice of particular protocols for transmitting content (Brock, 2018).

The examples and discussion that follow are drawn from previously published work and from an ongoing critical reading of the “Arrest Data from 2020 to Present” dataset published through the Los Angeles Open Data platform (Los Angeles Police Department, 2022). The critical data modeling process for the Arrest Dataset is based on close readings of the dataset versions and data model documentation provided through the data catalog. The data model documentation consists of metadata for the entire dataset and a table that describes each of the 25 columns of data provided in the dataset. The discussion below leverages the entities and relationships from the Basic Representation Model that are most relevant for analyses of digital content and networked information systems: propositional content, expression, symbol structures, and encoding. While some of the critiques reside entirely at a single level of representation, critical data modeling frequently reveals inconsistencies and tensions across the levels of representation.

4.1 Critiques of propositional content

Critiques of the propositional content of a digital object ask questions that center on content. These are questions like: what is included in (or left out of) a dataset; how the data model of a digital object divides the world into individuals, properties, and categories; and what entities the rows in a dataset represent. The analytical focus is on content and how content is realized through primary symbol structures. Recent literature that critiques information systems in terms of their role and impact in society frequently focuses on the propositional content of databases and systems. For some examples, Obermeyer et al. (2019) critique a data model that uses health care spending as a proxy for risk; Costanza-Chock (2020) critiques the use of gender in data models, and Eubanks (2018) critiques data collection around eligibility for public assistance. These works all leverage data modeling in their critiques by examining how those data models pick out certain entities and relationships in order to represent some aspect of the world.

The propositional content of the Arrest Dataset is a set of assertions about arrests carried out by the LAPD between January 1, 2020 and the latest update to the dataset (August 31, 2022 as of this writing). The framing of the data model documentation in terms of columns and rows indicates that the tabular representation of the data is the primary symbol structure for this propositional content. Therefore the propositional content of the Arrest Dataset is constrained by the data model, which separates the content into 25 columns, which represent the 25 attributes of an arrest event that are made public through this dataset.

A critical data modeling approach leads us to interrogate the selection of events that are recorded as arrests in the Arrest Dataset, and how that selection influences the use of the dataset to draw inferences about the real world. The dataset metadata states clearly that “Each row represents an arrest.” In other words, arrests are the primary entity that is the object of the description in the dataset, and the Arrest Dataset is labeled with the term “Public Safety” in the data catalog. A common use case for arrest data is to map the incidence of criminal activity in a community (Jefferson, 2020). Taking the general positioning of the dataset as public safety information together with the inclusion of several attributes linking an arrest to the location of a crime (e.g., the attribute Address is described as “Street address of crime incident rounded to the nearest hundred block to maintain anonymity” and Location is described as “Location where the crime incident occurred.”), uncovers an assumed correspondence between a row in the dataset and a criminal incident.

However, arrests do not necessarily correspond with criminal activity, and certainly not with a one-to-one correspondence that could support accurate counting. First, many individuals might be arrested in connection to a single incidence of criminal activity. Moreover, individuals may have been arrested by police without a real-world connection to any criminal activity, although it is difficult to detect direct evidence such cases via the Arrest Dataset, since only arrests are given the status of a first-class entity via unique identification. Examination of values for the Charge Description attribute does give direct evidence that not all arrest events occurred on the basis of the criminal activity of the arrested individual described by a row in the dataset. Filtering the dataset for the value “PARENT IN CUSTODY, NO CARETAKER AVAILABLE” gives rows with Age values between 0 and 17. These rows represent children taken into custody by the LAPD and referred to social services when one or both of their parents were arrested. While the rows for children in the Arrest Dataset have a tendency for sparseness (e.g., 66 of the 98 rows with Age of 0 have no Charge Description listed), these rows still tend to have location information (2 of the 98 rows with Age of 0 have location as (0,0), which indicates missing data, all other 96 rows list a latitude and longitude for location) and date information. It is important to note that Location is intended to represent the location of a criminal incident, not the location of an arrest. Examination of rows and values shows that if a parent is arrested and taken into custody in connection with a criminal incident, the location of that criminal incident appears in connection in the dataset with the event of their child being taken into custody. This raises important questions around the ways the Arrest Dataset might be used to make policy arguments around public safety.

The dataset summary for the Arrest Dataset states that “data is transcribed from original arrest reports that are typed on paper and therefore there may be some inaccuracies within the data.” It is clear from this statement and a common-sense understanding of police operations that the Arrest Dataset includes a limited subset of the information that appears in the original arrest reports. The personal information on arrested individuals is limited to Age, Sex and Descent Code and does not include any other identifying information. There is no information made available in the Arrest Dataset on the police officer or officers who carried out an arrest. In contrast, information on places is expansively covered in the Arrest Dataset, with 10 of the 25 available attributes offering information about geographic location, at varying levels of granularity (Area ID, Area Name, Reporting District, Address, Cross Street, LAT, LON, Location, Booking Location, Booking Location Code).

4.2 Critiques of the expression of content

Critical data modeling that focuses on the expression of content is centered on the primary symbol structure that realizes some propositional content. These critiques ask questions about aspects of expression such the formats made available for datasets, the differences between a dataset expressed as in a tabular format as opposed to a document format, and the use and construction of controlled vocabularies for data values. Critiques of the choice of terms and the hierarchical structures in library classifications operate at this level of representation. For examples, see M. Adler et al. (2017), Berman (1993), Baron et al. (2016). The methods of metadata repair to identify and describe works by or about African American women that were insufficiently described in large full-text corpora described by Brown et al. (2019) operate at the level of expression of content. The examination of how misinformation campaigns “leverage metadata to manipulate discourse in platforms” from Acker and Donovan (2019) also operates at the level of expression, as does the investigation by Thomer and Wickett (2020) of how the available formats of a dataset impact how that dataset can be processed and curated.

The selection and structuring of information to be included in a publicly available dataset speaks to the intended role of such a dataset. The emphasis on geography can be seen for the Arrest Dataset by examining the formats made available for export. The data catalog listing includes an “Export” button, which gives three main formats for downloading the Arrest Dataset: CSV, KML, and Shapefile. KML is “an XML language focused on geographic visualization, including annotation of maps and images” originally developed by Google and currently maintained as a standard by the Open Geospatial Consortium (Open Geospatial Consortium, 2022). Shapefile is a spatial data format for geographic information systems (GIS) developed by the Environmental Systems Research Institute and intended for use in GIS software such as ArcGIS (Environmental Systems Research Institute, 1998). While CSV is a general data format for tabular data, KML and Shapefile were specifically designed to handle geographic information and to create data visualizations in the form of maps. The foregrounding of GIS data formats in the presentation of this dataset, along with the emphasis on place and location in the dataset attributes position this dataset is as geographic information. The emphasis on geographic data formats for the Arrest Dataset aligns with Brian Jefferson's argument that the information systems we use to represent information about crime and criminal justice transform that information into geographic information and has a significant impact on how we understand places and people in our communities (Jefferson, 2020).

4.3 Critiques of symbol structures and encodings

Symbol structures are encoded in computing systems by symbol structures. Critical data modeling draws attention to the encoding of data values for detailed investigation of the representational choices that went into the creation of a digital object. For example in the class roster spreadsheet example, the “Last name” field in the uses Latin characters to encode name information. This data modeling decision will have the consequence that many students will never see the university express their name using the characters it was originally written in. These decisions can have longer term consequences when formerly active records become historical evidence. For example, Han and Han (2021) argue that the varying use of Romanized Chinese names in historical archival materials in the United States has had a negative impact on discoverability of historical individuals. Arguments around the development of Unicode as a standard to allow scripts from around the globe to be shared widely on the internet include analyses of encodings (John, 2013). Understanding and accounting for the impact of errors that occur in optical character recognition processes, as performed by Traub et al. (2015), similarly falls at the level of encodings. Data transmission protocols are a form of encoding and therefore the explication of the impact of the Short Messaging System (SMS) protocol in the growth of Black Twitter from Brock (2012) operates at this level of representation.

Examining data values and encoding in the Arrest Dataset, we can see further emphasis on geographic information through the selection of the geo-referenced “Point” datatype for the Location attribute. The dataset documentation notes that Point “represents a location on the Earth as a WGS84 Latitude and Longitude” and that “The location is encoded as a GeoJSON ‘point’” (Socrata, n.d.-a). WGS84 is a data standard that assigns coordinates to places on the globe and serves as the basis for global positioning and mapping systems (National Geospatial-Intelligence Agency, 2014), and GeoJSON is a data format that is designed for encoding and exchanging geographic information (Butler et al., 2016). Data standards provide a structure that supports the computational processing of data to create visualizations and analyses. They also inform the interpretation of data and shape meaning by making some processes easier and some more difficult. The use of these standards for encoding data in the Arrest Dataset reinforces the assumption that information about arrests should be understood and processed as geographic information.

The point datatype for Location is structured as a pair of coordinates listed as “(longitude, latitude).” The dataset summary states, “Some location fields with missing data are noted as 0.0000 ,0.0000 $$ \left({0.0000}^{\circ }{,\mathrm{0.0000}}^{\circ}\right) $$ .” Despite the use here as a missing data indicator, 0.0000 ° ,0.0000 ° $$ \left({0.0000}^{{}^{\circ}}{,\mathrm{0.0000}}^{{}^{\circ}}\right) $$ is a real geo-referenced point on the globe, in the Atlantic Ocean off the western coast of Africa. This point appears in 4640 number of rows in the dataset. Clearly these arrests did not take place there. Examination of the dataset and reading the dataset documentation show that very row in the dataset has a value for Location that matches the Point datatype, even in rows that were missing location data in the original arrest record. This is a case where maintaining conformance with the datatype has taken precedence over accuracy of the data.

A similar dominance of datatype conformance over accuracy can be seen with attributes in the dataset that use the Floating Timestamp datatype, namely Arrest Date and Booking Date. Although the local dataset metadata for both of these attributes describes a “MM/DD/YYYY” format, the datatype metadata links to the API platform documentation, which states “Floating timestamps represent an instant in time with millisecond precision, with no timezone value, encoded as ISO8601 Times with no timezone offset” (Socrata, n.d.-b). In other words, in order to conform with the datatype, values for this attribute must include a time, not just a date. Examination of the dataset shows that every row of the dataset has values that match the datatype for Arrest Date and Booking Date and include a time for these attributes. However, every time value listed for both of these attribute is “12:00:00 AM” Common sense tells us that those time values, repeated exactly for every row, are the likely result of overgeneralized automatic processing and are unlikely to be accurate reflections of the time an arrest or a booking occurred. In fact, each of these date attributes has a corresponding time attribute (Time corresponds with Arrest Date, and Booking Time corresponds with Booking Date) that give times with a distribution of values that suggest that they are accurate transcriptions from the original arrest records. Therefore, the times given in the Time and Booking Time attributes can be taken as accurate time information. The “12:00:00 AM” times given in Arrest Date and Booking Date conflict with that time information, indicating that this is another case of conformance with a datatype overriding a commitment to accuracy within the dataset.

Location, Arrest Date and Booking Date all have data values that are implausible when taken in the context of the Arrest Dataset as whole. The Basic Representation Model provides a way to structure critical data modeling of a dataset that lets us analyze these errors in terms of the interaction between propositional content and encodings. In these cases the rigid conformance to the data encodings has led to the introduction of new, inaccurate, propositional content carried by the dataset.

5 CONCLUSION

Information science is called as a field to examine the ways in which social injustices are built into our information systems through data models. Critical data modeling is a method for analyzing data models and existing data objects in order to examine the sociotechnical commitments, underlying assumptions, and social and cultural consequences of information systems. As the analyses in the previous section demonstrate, Basic Representation Model provides a logical framework for a critical interrogation of an information object, and is one of many possible tools for conducting this kind of examination.

The goal of critical data modeling is to generate critiques that forge information systems analyses together with social and cultural analyses. Critical data modeling is intended to operate in collaboration with critiques of information systems from race and ethnic studies, community informatics, women gender and sexuality studies. Future work will continue to use critical data modeling and related techniques to examine educational systems, medical information systems, and criminal justice and policing databases, and carceral information systems. In particular, I plan to expand the analysis of arrest records described in the previous section; bringing in consideration of broader factors and agents in the development of civic open data systems, and extending the analysis to the level of inscriptions of information.

Every information system uses data models, and no data model is inherent. Information systems and data models are constructed and used by humans who will necessarily embed their perspectives of the world into these systems. Therefore information systems will naturally replicate, enact, and reproduce structural injustices in our society. By grounding the analysis of an information system in the technical realities of computational systems and the social realities of our communities, critical data modeling adds an essential complement to existing approaches to critical information studies.