Text and Image Mining: From Mass Observation to Stage Magic
TLDRThe speaker discusses their experiences with Mathematica in text and image mining for humanities research. They highlight projects like the Old Bailey online, which visualizes criminal trial lengths, and the Mass Observation project, analyzing mid-20th century British life. They also explore stage magic history and develop tools for circuit diagram understanding. The talk covers various text and image mining techniques, emphasizing the importance of computational tools in uncovering historical patterns and insights.
Takeaways
- 📚 The speaker discusses the application of Mathematica in text and image mining for humanities research, highlighting the differences in methodology between humanities and STEM fields.
- 👨⚖️ The Old Bailey Online project is mentioned as a significant example, showcasing a searchable database of criminal trials from 1674 to 1913, with a unique visualization of trial length over time.
- 📊 A notable pattern of bifurcated trial lengths in the 1800s is observed, likely due to the rise of plea bargaining and guilty pleas, which was previously unnoticed.
- 📝 The Mass Observation project is introduced, detailing the collection of diaries and questionnaires to understand the lives of ordinary Britons in the mid-20th century.
- 🔍 The use of Distributional Concept Analysis (DCA) in text analysis is explained, focusing on the context surrounding a word of interest rather than just the word itself.
- 🎩 A collaborative project on stage magic involves text and image mining, including the extraction and identification of images from magic-related periodicals.
- 🔗 The development of tools to understand and label circuit diagrams from historical texts is discussed, aiming to identify meaningful design patterns.
- 🌉 Another project focuses on creating a database of historical bridge images with metadata, combining Machine Vision with linked open data for civil engineering analysis.
- 🌐 The speaker mentions using Mathematica to compile archives from the open web, particularly texts related to the history of electronics and computation.
- 🔑 Techniques like RAKE and TF-IDF are used for keyword extraction and text relevance assessment, aiding in research question exploration and discovery within large text collections.
Q & A
What is the main focus of research in the humanities according to the transcript?
-The main focus of research in the humanities is on close reading and interpretation of sources, which include texts, images, artifacts, and media.
What is the significance of the Old Bailey Online project mentioned in the transcript?
-The Old Bailey Online project is significant as it is a fully searchable database of all criminal trials held at London's Central Criminal Court between 1674 and 1913, providing valuable historical data on criminal trials during that period.
How does the graph of criminal trial lengths in the Old Bailey Online project reveal changes over time?
-The graph shows a bifurcation in trial lengths starting in the 1800s, with some trials becoming shorter and others longer, reflecting factors such as the rise of plea bargaining and the guilty plea.
What is the mass observation project discussed in the transcript?
-The mass observation project was an initiative in 1937 that recruited volunteers to write diaries and answer questionnaires, and paid investigators to record public conversations and behaviors, aiming to understand the lives of ordinary Britons.
How does distributional concept analysis (DCA) differ from other text analysis methods?
-DCA differs from other text analysis methods by relying on small engram windows positioned some distance before and after the word of interest, rather than engram windows centered on a word of interest.
What was the goal of the stage magic research project involving text and image mining?
-The goal of the stage magic research project was to develop an experimental approach to history by applying techniques like desktop fabrication, physical computing, and text and image mining to analyze historical materials related to stage magic.
What is the purpose of the circuit diagram analysis project mentioned in the transcript?
-The purpose of the circuit diagram analysis project is to develop tools that can understand and automatically label components in circuit diagrams, with the ultimate aim of identifying meaningful assemblies of components or design idioms in schematics.
How does the speaker use Mathematica to compile an archive of texts from the open web?
-The speaker uses Mathematica to write crawlers that compile an archive of millions of pages of text from the open web, focusing on texts related to the histories of electronics, computation, and scientific instrumentation.
What is the objective of the historical bridge images project?
-The objective of the historical bridge images project is to create a large database of bridge images with metadata, combine these records with linked open data, and develop a machine vision system to extract features of interest in the history of civil engineering.
How does the RAKE (Rapid Automatic Keyword Extraction) method assist in research as described in the transcript?
-The RAKE method assists in research by extracting phrasal keywords, particularly in shorter texts, and can be combined with other methods like TF-IDF to assess the relevance of a text to a query and measure the similarity between texts.
What insights can be gained from the random indexing method discussed in the transcript?
-The random indexing method can provide insights into the semantic characteristics of text by creating vectors of co-occurrence events and can be used to find words that pattern together, measure semantic density, study semantic evolution over time, and monitor ongoing discourse for anomalies.
Outlines
📜 Text and Image Mining in Humanities Research
The speaker discusses their experiences using Mathematica for text and image mining in humanities research, highlighting the differences between humanities and science/engineering disciplines. They emphasize the importance of close reading and interpretation of sources like texts, images, and artifacts. The speaker shares their involvement in the Old Bailey Online project, which digitized criminal trials from 1674 to 1913. They present a graph showing the length of trials over time, revealing a bifurcation in trial length during the 1800s, which was previously unnoticed and is attributed to the rise of plea bargaining. Another project mentioned is a collaboration with an expert on mid-20th century British cultural history, focusing on the Mass Observation Project, which collected diaries and observations of daily life to understand ordinary Britons. They employ distributional concept analysis (DCA) to study the data. The speaker also talks about a project on stage magic, using text and image mining to develop an experimental approach to history, including the extraction of images and identification of magic-related items.
🔍 Advanced Text and Image Mining Techniques
The speaker elaborates on various text and image mining projects they have been involved in, starting with the creation of a searchable archive of millions of pages related to the history of electronics, computation, and scientific instrumentation. They discuss developing tools to understand circuit diagrams, including automated labeling of components and identifying meaningful assemblies of components in schematics. Another project involves building a database of historical bridge images with metadata, aiming to extract features of interest using machine vision and image processing techniques. The speaker also mentions using Mathematica to crawl related records from the WorldCat Identities API and to categorize and resolve entities from crawled texts to linked open data sources. They discuss the use of Rake for rapid automatic keyword extraction and TF-IDF for assessing text relevance and similarity. Additionally, they introduce a clustering algorithm based on kog complexity for discovering relationships between historical figures, and the use of random indexing for mining terminology and understanding semantic characteristics of text.
🔎 Semantic Analysis and Text Clustering
In this paragraph, the speaker delves into the use of random indexing, a method developed by Magnus Søgaard and colleagues, for semantic analysis of text. This method creates vectors of co-occurrence events within a context window around a specific word, allowing for efficient analysis even in large text collections. The speaker provides an example of how random indexing can reveal authorial preferences in phrasing, using the contrasting usage by Lewis Carroll of 'said' and 'replied' with character names. They also explain how random indexing can be used to identify words that pattern together and to measure semantic density and evolution over time. The speaker emphasizes the utility of this method for ongoing discourse analysis and anomaly detection.
Mindmap
Keywords
💡Text and Image Mining
💡Humanities
💡Close Reading
💡Old Bailey Online
💡Mass Observation Project
💡Distributional Concept Analysis (DCA)
💡Stage Magic
💡Machine Vision
💡Linked Open Data
💡Rapid Automatic Keyword Extraction (RAKE)
Highlights
Experiences using Mathematica for text and image mining in humanities research.
Humanities research focuses on close reading and interpretation of sources like texts, images, artifacts, and media.
The Old Bailey online project: a searchable database of criminal trials from 1674 to 1913.
Visualization of criminal trial length over time reveals a bifurcation in the 1800s.
The rise of plea bargaining and guilty pleas in the 19th century affected trial lengths.
Collaboration with Amy Bell on mid-20th century British cultural history.
Mass Observation project: analyzing diaries and questionnaires for understanding ordinary Britons' lives.
Distributional Concept Analysis (DCA) method for text analysis.
Stage magic research combining desktop fabrication, physical computing, and text/image mining.
Extraction of images and identification of magic apparatus from historical texts.
Creating wobble images from early 20th-century seance photography.
Crawling the open web to compile an archive of texts related to electronics and computation history.
Developing tools to understand and label circuit diagrams.
Semantic search for the history of electronics aiming to identify design idioms.
Creating a database of historical bridge images with metadata for civil engineering research.
Using Machine Vision to extract features of interest in historical bridge images.
Crawling WorldCat Identities API for related records and metadata.
Using RAKE for rapid automatic keyword extraction in texts.
Combining keyword extraction with entity recognition for text discovery.
Using compression clustering algorithms to discover relationships in historical data.
Random Indexing method for mining terminology and semantic characteristics of text.