Text and Image Mining: From Mass Observation to Stage Magic

Wolfram R&D
6 Sept 202412:48

TLDRThe speaker discusses their experiences with Mathematica in text and image mining for humanities research. They highlight projects like the Old Bailey online, which visualizes criminal trial lengths, and the Mass Observation project, analyzing mid-20th century British life. They also explore stage magic history and develop tools for circuit diagram understanding. The talk covers various text and image mining techniques, emphasizing the importance of computational tools in uncovering historical patterns and insights.

Takeaways

  • 📚 The speaker discusses the application of Mathematica in text and image mining for humanities research, highlighting the differences in methodology between humanities and STEM fields.
  • 👨‍⚖️ The Old Bailey Online project is mentioned as a significant example, showcasing a searchable database of criminal trials from 1674 to 1913, with a unique visualization of trial length over time.
  • 📊 A notable pattern of bifurcated trial lengths in the 1800s is observed, likely due to the rise of plea bargaining and guilty pleas, which was previously unnoticed.
  • 📝 The Mass Observation project is introduced, detailing the collection of diaries and questionnaires to understand the lives of ordinary Britons in the mid-20th century.
  • 🔍 The use of Distributional Concept Analysis (DCA) in text analysis is explained, focusing on the context surrounding a word of interest rather than just the word itself.
  • 🎩 A collaborative project on stage magic involves text and image mining, including the extraction and identification of images from magic-related periodicals.
  • 🔗 The development of tools to understand and label circuit diagrams from historical texts is discussed, aiming to identify meaningful design patterns.
  • 🌉 Another project focuses on creating a database of historical bridge images with metadata, combining Machine Vision with linked open data for civil engineering analysis.
  • 🌐 The speaker mentions using Mathematica to compile archives from the open web, particularly texts related to the history of electronics and computation.
  • 🔑 Techniques like RAKE and TF-IDF are used for keyword extraction and text relevance assessment, aiding in research question exploration and discovery within large text collections.

Q & A

  • What is the main focus of research in the humanities according to the transcript?

    -The main focus of research in the humanities is on close reading and interpretation of sources, which include texts, images, artifacts, and media.

  • What is the significance of the Old Bailey Online project mentioned in the transcript?

    -The Old Bailey Online project is significant as it is a fully searchable database of all criminal trials held at London's Central Criminal Court between 1674 and 1913, providing valuable historical data on criminal trials during that period.

  • How does the graph of criminal trial lengths in the Old Bailey Online project reveal changes over time?

    -The graph shows a bifurcation in trial lengths starting in the 1800s, with some trials becoming shorter and others longer, reflecting factors such as the rise of plea bargaining and the guilty plea.

  • What is the mass observation project discussed in the transcript?

    -The mass observation project was an initiative in 1937 that recruited volunteers to write diaries and answer questionnaires, and paid investigators to record public conversations and behaviors, aiming to understand the lives of ordinary Britons.

  • How does distributional concept analysis (DCA) differ from other text analysis methods?

    -DCA differs from other text analysis methods by relying on small engram windows positioned some distance before and after the word of interest, rather than engram windows centered on a word of interest.

  • What was the goal of the stage magic research project involving text and image mining?

    -The goal of the stage magic research project was to develop an experimental approach to history by applying techniques like desktop fabrication, physical computing, and text and image mining to analyze historical materials related to stage magic.

  • What is the purpose of the circuit diagram analysis project mentioned in the transcript?

    -The purpose of the circuit diagram analysis project is to develop tools that can understand and automatically label components in circuit diagrams, with the ultimate aim of identifying meaningful assemblies of components or design idioms in schematics.

  • How does the speaker use Mathematica to compile an archive of texts from the open web?

    -The speaker uses Mathematica to write crawlers that compile an archive of millions of pages of text from the open web, focusing on texts related to the histories of electronics, computation, and scientific instrumentation.

  • What is the objective of the historical bridge images project?

    -The objective of the historical bridge images project is to create a large database of bridge images with metadata, combine these records with linked open data, and develop a machine vision system to extract features of interest in the history of civil engineering.

  • How does the RAKE (Rapid Automatic Keyword Extraction) method assist in research as described in the transcript?

    -The RAKE method assists in research by extracting phrasal keywords, particularly in shorter texts, and can be combined with other methods like TF-IDF to assess the relevance of a text to a query and measure the similarity between texts.

  • What insights can be gained from the random indexing method discussed in the transcript?

    -The random indexing method can provide insights into the semantic characteristics of text by creating vectors of co-occurrence events and can be used to find words that pattern together, measure semantic density, study semantic evolution over time, and monitor ongoing discourse for anomalies.

Outlines

00:00

📜 Text and Image Mining in Humanities Research

The speaker discusses their experiences using Mathematica for text and image mining in humanities research, highlighting the differences between humanities and science/engineering disciplines. They emphasize the importance of close reading and interpretation of sources like texts, images, and artifacts. The speaker shares their involvement in the Old Bailey Online project, which digitized criminal trials from 1674 to 1913. They present a graph showing the length of trials over time, revealing a bifurcation in trial length during the 1800s, which was previously unnoticed and is attributed to the rise of plea bargaining. Another project mentioned is a collaboration with an expert on mid-20th century British cultural history, focusing on the Mass Observation Project, which collected diaries and observations of daily life to understand ordinary Britons. They employ distributional concept analysis (DCA) to study the data. The speaker also talks about a project on stage magic, using text and image mining to develop an experimental approach to history, including the extraction of images and identification of magic-related items.

05:02

🔍 Advanced Text and Image Mining Techniques

The speaker elaborates on various text and image mining projects they have been involved in, starting with the creation of a searchable archive of millions of pages related to the history of electronics, computation, and scientific instrumentation. They discuss developing tools to understand circuit diagrams, including automated labeling of components and identifying meaningful assemblies of components in schematics. Another project involves building a database of historical bridge images with metadata, aiming to extract features of interest using machine vision and image processing techniques. The speaker also mentions using Mathematica to crawl related records from the WorldCat Identities API and to categorize and resolve entities from crawled texts to linked open data sources. They discuss the use of Rake for rapid automatic keyword extraction and TF-IDF for assessing text relevance and similarity. Additionally, they introduce a clustering algorithm based on kog complexity for discovering relationships between historical figures, and the use of random indexing for mining terminology and understanding semantic characteristics of text.

10:04

🔎 Semantic Analysis and Text Clustering

In this paragraph, the speaker delves into the use of random indexing, a method developed by Magnus Søgaard and colleagues, for semantic analysis of text. This method creates vectors of co-occurrence events within a context window around a specific word, allowing for efficient analysis even in large text collections. The speaker provides an example of how random indexing can reveal authorial preferences in phrasing, using the contrasting usage by Lewis Carroll of 'said' and 'replied' with character names. They also explain how random indexing can be used to identify words that pattern together and to measure semantic density and evolution over time. The speaker emphasizes the utility of this method for ongoing discourse analysis and anomaly detection.

Mindmap

Keywords

💡Text and Image Mining

Text and image mining refers to the process of extracting useful information from textual and visual data using computational methods. In the context of the video, this involves using tools like Mathematica for research in the humanities, which traditionally focuses on close reading and interpretation of sources. The speaker discusses how they have applied these methods to historical documents and images to uncover patterns and insights that were not previously evident.

💡Humanities

The humanities are academic disciplines that study human culture through methods in critical theory, historical analysis, and the interpretative analysis of texts. In the video, the speaker contrasts the humanities with science and engineering, highlighting the humanities' focus on close reading and interpretation of sources like texts, images, and artifacts. The application of computational tools in the humanities is a central theme of the speaker's work.

💡Close Reading

Close reading is a method of textual analysis that involves a careful, detailed examination of a text, often focusing on elements like word choice, syntax, and structure. The video discusses how this method is a cornerstone of humanities research and how computational tools can augment this process by analyzing large volumes of text and identifying patterns that may not be apparent through manual reading.

💡Old Bailey Online

The Old Bailey Online is a digital archive mentioned in the video that contains fully searchable records of criminal trials held at London's Central Criminal Court between 1674 and 1913. The speaker uses this archive to demonstrate how text and image mining can reveal historical patterns, such as the bifurcation in trial length during the 19th century, which reflects changes in legal practices like the rise of plea bargaining.

💡Mass Observation Project

The Mass Observation Project, initiated in 1937, is a significant source discussed in the video. It involved recruiting volunteers to write diaries and answer questionnaires to better understand the lives of ordinary Britons. The speaker is working on a research project using this archive, applying text analysis methods like distributional concept analysis to uncover insights into mid-20th century British culture.

💡Distributional Concept Analysis (DCA)

Distributional Concept Analysis is a text analysis method developed by Peter de Bolla and his colleagues. Unlike other methods that focus on word proximity, DCA uses small windows positioned before and after a word of interest. The speaker is applying DCA to the Mass Observation Project to analyze the cultural history of mid-20th century Britain, demonstrating how this method can provide nuanced understandings of text data.

💡Stage Magic

Stage magic is a performing art that involves various techniques to create the illusion of impossible feats. In the video, the speaker discusses a project that involved text and image mining to study the history of stage magic in the late 19th and early 20th centuries. They used techniques like desktop fabrication and physical computing, along with text and image mining, to develop an experimental approach to historical research.

💡Machine Vision

Machine vision is a field of artificial intelligence that enables computers to interpret and understand the visual world. The speaker mentions using machine vision techniques to automatically label and identify components in circuit diagrams, which is part of a larger project to develop a system that can understand meaningful assemblies of components in schematics.

💡Linked Open Data

Linked open data refers to a set of practices for publishing structured data so that it can be interlinked and become more useful through being connected to other related data. In the video, the speaker discusses how they are combining historical bridge images with linked open data using Mathematica's support for SPARQL queries, aiming to create a system that can extract features of interest in the history of civil engineering.

💡Rapid Automatic Keyword Extraction (RAKE)

RAKE is a tool for extracting keywords from text. It is particularly useful for shorter texts and can help in identifying phrasal keywords. The speaker uses RAKE in their research to find keywords that can be linked to entities and used for discovery within large collections of sources, illustrating how it can assist in finding relevant information and making connections within a body of text.

Highlights

Experiences using Mathematica for text and image mining in humanities research.

Humanities research focuses on close reading and interpretation of sources like texts, images, artifacts, and media.

The Old Bailey online project: a searchable database of criminal trials from 1674 to 1913.

Visualization of criminal trial length over time reveals a bifurcation in the 1800s.

The rise of plea bargaining and guilty pleas in the 19th century affected trial lengths.

Collaboration with Amy Bell on mid-20th century British cultural history.

Mass Observation project: analyzing diaries and questionnaires for understanding ordinary Britons' lives.

Distributional Concept Analysis (DCA) method for text analysis.

Stage magic research combining desktop fabrication, physical computing, and text/image mining.

Extraction of images and identification of magic apparatus from historical texts.

Creating wobble images from early 20th-century seance photography.

Crawling the open web to compile an archive of texts related to electronics and computation history.

Developing tools to understand and label circuit diagrams.

Semantic search for the history of electronics aiming to identify design idioms.

Creating a database of historical bridge images with metadata for civil engineering research.

Using Machine Vision to extract features of interest in historical bridge images.

Crawling WorldCat Identities API for related records and metadata.

Using RAKE for rapid automatic keyword extraction in texts.

Combining keyword extraction with entity recognition for text discovery.

Using compression clustering algorithms to discover relationships in historical data.

Random Indexing method for mining terminology and semantic characteristics of text.