After a long, exhausting battle, while sipping coffee the next morning, an intriguing thought pop into my head, "What a journey! I've developed a web app to transfer text data into digital insights, and invite humanity to re-question the notions of AI mind and intelligence. Am I not, in essence, striving to decipher the computer 'mind' through its own language system along the way?".
Language is one of the most powerful tools for expressing human thoughts and emotions, and it plays a crucial role in communication and the development of complex societies and cultures. Yet, the core role of language extends beyond communication. Noam Chomsky, a great mind and leading figure in linguistics, emphasized the profound internal aspect of language. He suggested that the primary function of language is related more to thought than to communication. This concept underscores the importance role of internal cognition in forming internal mental models. Thus, while language alone may not be the sole indicator of consciousness and intelligence, it forms part of a broader array of cognitive and behavioral indicators that collectively reflect the presence of a mind.
Despite significant advancements in Artificial Neural Networks (ANNs), particularly in Large Language Models (LLMs), which excel in tasks such as generating human-like text and answering questions, these models fundamentally lack a true understanding of language in the human sense, especially in terms of deep semantic comprehension. They struggle with the complexities and subtleties of language that human beings navigate effortlessly. So, while LLMs are able to demonstrate sophisticated linguistic knowledges, this does not necessarily imply they possess their own 'mind'. What, then, would be required beyond sufficient language capabilities to build these 'internal models' that we associate with true self-awareness? A true intelligence, capable of self-reflection and regulation?
This project aims to leverage machine learning and AI techniques—including Knowledge Graphs, Natural Language Processing (NLP), Distant Reading, Graph Neural Network, and LLMs—to explore potential solutions for deciphering the concept of the " AI Mind" through domain specificity and computational cognitive dimensions. By analyzing text data with statistical algorithms and NLP techniques, it is hoped that this exploration can deepen our understanding of how AI transforms data into information, and information into insight, through analysis, visualization and generation. And envision whether there will be a possible future where AI possesses higher order cognitive processing ability to understand the deep semantic meaning within human language.
By integrating nuanced human insights, this interactive and artistic representation of "AI mind" seeks to provide a novel and intriguing opportunity for humanity to collectively re-question ourselves the notions of 'Mind' and 'Intelligence'.
In today's data-driven era, despite their inherent limitations, data analysis and narrative visualization remain powerful tools in terms of offering us both detailed and macroscopic insights. Leveraged by mathematical algorithms and computational linguistic techniques, these tools provide us a more holistic interpretation of information that extends beyond what our notably highly sophisticated eyes can perceive. The idea of representing information and knowledge in graphical forms trace way back to ancient times, like cave paintings and hieroglyphs. These early forms of visual communication laid the groundwork for the modern concept of representing knowledge in graph form, which we see today in artificial intelligence (AI) areas such as semantic networks, knowledge graphs, and even graph neural network.
Semantic networks, first introduced in 1960s, are a form of knowledge representation in AI and a foundational concept in computer science, particularly in areas of knowledge representation and information processing. They are primarily used to visualize complex sets of relationships between entities and facilitate the understanding and analysis of specific domains. Semantic networks aid in reasoning about the connections and interactions between concepts through graph traversal techniques.
Complemented by semantic networks and other data structuring techniques, the term "distant reading" was first coined by Franco Moretti in 2002. Positioned within the context of literary studies, Moretti proposed this method as a way to analyze large volumes of literary texts using computational methods without focusing on individual texts in detail. Rather than closely analyzing specific texts, distant reading employs computational tools and techniques to identify patterns, trends, and structures across a broad corpus of literature. Through statistical analysis, knowledge graph visualization, and machine learning models, researchers can uncover insights into literary history, cultural trends, and thematic developments that are often obscured by traditional close reading methods.
Even though our current LLMs are not built for domain-specific tasks and therefore do not focus on deep semantic understanding, both semantic networks and distant reading have enhanced the structure and functionality of modern AI systems used by large-scale search engines, recommendation systems, and in semantic search. These methodologies leverage the principles of older models to link data, enabling more sophisticated information retrieval, natural language processing, and AI applications. They are indispensable in today’s machine learning and AI landscape for tasks that require an understanding of complex relationships and interdependencies in both textual and numerical data.
After extensive review of related papers and technical research to conduct thorough experiments, the following techniques and methodologies in the Modeling Lab were explored and employed to better visualize the "cognitive process of the AI mind".
The datasets used in this application are categorized into three main types:
► Two large volumes of text data derived from classical books.
Three small real-world entity datasets from Corpora.
► Social media data, both numerical and categorical, sourced from Twitter and Facebook platforms.
► User-generated text data through real-time interaction within the web app.
Data Dictionary
Dataset 1 ⏤ Large volume of literary text data from the book "Pride and Prejudice" and "Alice's adventure in wonderland".
Size: 976 kb in total. Containing 156,644 words in "Pride and Prejudice" and 26,432 words in "Alice's adventure in wonderland".
Source: Project Gutenberg: Free eBooks
Content: original text data from each ebook.
App page: Explore - Social Network Visualization - Pride and Prejudice
➢ Three real-world entity datasets from Corpora: celebrities.json; books.json; president_quote.json.
Size: 35 kb in total.
Source: Corpora Github
Content: A collection of small corpuses of interesting data for the creation of bots and similar stuff.
App page: Explore - Generate your own - Digital Story
Dataset 2 ⏤ Social Media Datasets
➣ Twitter Posts
Size: 43MB, includes 416,124 pieces of real user-generated content from English-speaking Twitter users.
Source: Twitter posts
Content: This dataset primarily used for sentiment analysis model training. Each entry in this dataset consists of a text segment representing a Twitter message and a corresponding label indicating the predominant emotion conveyed. The emotion label are classified into six categories by numerical numbers: sadness (0), joy (1), love (2), anger (3), fear (4), and surprise (5).
App page: Explore - Sentiment Analyzer
➣ Facebook Social Circles
Size: 5.3MB, containing 34,791 real user-generated posts from Facebook.
Source: Facebook data was collected from survey participants using Facebook app.
Content: Facebook data has been anonymized by replacing the Facebook-internal ids for each user with a new value. Also, while feature vectors from this dataset have been provided, the interpretation of those features has been obscured. For instance, where the original dataset may have contained a feature "political=Democratic Party", the new data would simply contain "political=anonymized feature 1". Thus, using the anonymized data it is possible to determine whether two users have the same political affiliations, but not what their individual political affiliations represent.
App page: Explore - Social Network Visualization - Facebook Social Network
Dataset 3 ⏤ User-generated data through interaction within app.
Size: 0. All the data will be discarded once user close the site as it denoted in the application.
Content: The user-generated data will be used for social network graph creation and story generation.
App page: Explore - Generate your own
For demonstration purpose, I'll exemplified the knowledge graph visualization and analysis process using the book "Pride and Prejudice".
In order to perform the mathematical analysis and interactive graph representation, these python libraries were used to build the pipeline:
Removing numbers and special characters, then split the book into chapters.
Circumplex model of affect (adapted from Posner et al., 2005)
In order to better understand sentiment conveyed by words, we have to dive deeper into semantic meaning of each word, consequently Lexicon-based models holds a priority in this project. It relies on a predefined dictionary or lexicon of words, each associated with specific semantic label. These models evaluate the sentiment of a text by aggregating the sentiment scores of the words found in the text according to the lexicon. The overall sentiment of the text is determined based on the presence and combinations of the words.
Strengths:
- Simple to understand and implement.
- Do not require training data, making them useful in scenarios where labeled data is scarce or unavailable.
- Transparent in how they calculate confident scores, as the process is directly related to the lexicon.
Limitations:
- Heavily dependent on the quality and comprehensiveness of the lexicon.
- They may not effectively capture context, sarcasm, or nuanced expressions of sentiment.
- Lexicon-based models can struggle with domain-specific language unless the lexicon is specifically tailored for that domain.
Deep learning models, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), learn to predict sentiments or other semantic properties directly from the text data. They do this by automatically extracting and learning complex features from the raw text during the training process, often using a large amount of labeled data. These models can capture contextual nuances, sequential dependencies in text, and even implicit sentiment expressions.
Strengths:
- Can achieve high accuracy and are capable of understanding complex language patterns, context, and subtle nuances that lexicon-based models may miss.
- They are highly adaptable to different domains and languages, provided that sufficient training data is available.
Limitations:
- Require substantial amounts of labeled data to train effectively.
- More computationally intensive and less interpretable than lexicon-based models, making it challenging to understand how and why they arrive at specific predictions, so called "black box".
Data cleaning and preprocessing
While Twitter Dataset has relatively larger size compare to other dataset in this project, after analyzing its raw data value, the more data cleaning and feature engineering needed to be done. Such as:
- Create a emotion_map dictionary, convert all the label into corresponding sentiment;
- Define a dictionary of chat word mappings, and replace_chat_words;
- Remove special characters in the text string;
- Convert all string into lower character;
- Remove Non alpha numeric characters;
- Remove URLs;
Emotion distributional diagram
The initial attempt at applying a lexicon-based approach through Logistic Regression model was completed in under 1 minutes. Despite achieving only 0.8% accuracy, the time efficiency of the model is notably impressive. This highlights the model's rapid processing capability, although it also indicates a need for further refinement to improve accuracy.
The second attempt at leveraging deep learning through RNN model took over two hours to complete, at a latest MacBook Pro laptop. This approach resulted in a remarkable accuracy of 0.94, indicating very promising model performance. The significant improvement in accuracy, compared to earlier lexicon-based or simpler machine learning language models, indicates the potential of RNNs in capturing the complexities and nuances of language.
Notably, after multiple prediction tests, each model exhibits strong biases toward certain words. For instance, it evaluates the word 'queen' and 'woman' as joy, while 'man' and 'king' are associated with anger and sadness. This indicates a significant bias presented in the selected posts from the Twitter dataset.
Before the advent of advanced AI technologies, processing and analyzing text data was arguably the most challenging machine learning project—and it remains a complex task today. Initially, I intent to build the same pipeline I've built previously for 'Pride and Prejudice' for a knowledge graph generator, which would empower user to create their own dynamic knowledge graphs with desired book.
However, more procedures are needed to build a functional pipeline that can convert any incoming raw text data into the same format, enabling the model to handle "the uncertainty" more effectively.
Due to the technique constrain and a desire to more transparent demonstration, I then started to simplify the generator pipeline. This simplification will allow user to gain a more intuitive and straightforward understanding of the statistic logic and techniques that used in semantic network analysis in AI system.
Graph generator code
In the story generator, the first two-thirds of each narrative is meticulously shaped using a tracery-like grammar with a rule-based structure that I've hard coded. It randomly weaves elements like real-world book titles, presidential quotes, and celebrity names into the fabric of the story. The remaining third leverages the power of the GPT-2 model to generate the rest based on the clues provided in the initial sentences. Each narrative unfolds uniquely based on the input, tailoring an experience that invites us collectively to re-examine human and AI mind, fostering innovative storytelling through the combination of electronic text and the mathematical algorithms within the AI language model.
Story generator code
To better document every processes and execute each steps, I utilized Vision-Control method and Github's project workflow interface to create a robust system for tracking changes. This method is a great way to facilitate any future collaboration, and maintain a clear history of the project's development.
Vision-Control in Terminal
Project workflow in Github
The term 'Data Science and Analysis' can be daunting, not only to the general public unfamiliar with its intricate processes but sometimes even to data scientists themselves. So for the UX/UI design in this project, I aim to create an interface that is both concise and straightforward, by leveraging the clarity provided by graphical visualizations of data to simplify the complex underlying computational processes. Through the lens of statistically analysis and knowledge graph, I intend to represent the text data in a more visually artistic and engaging manner.
With this invention in mind, the information architecture of the website will be designed with a minimalist approach, featuring two sidebars for easy navigation and four functional pages, each serving a unique purpose. This layout ensures that users can navigate the website intuitively without being overwhelmed by excessive options or complex structures.
Home Page - Pride and Prejudice Social New Work Visualization
Facebook Social Community
AI Vision
Sentiment Analyzer
Network Visualization - Alice's Adventure in Wonderland
Generate your own - Electronic Story
Generate your own - Social Net Graph
For this project, I opted to utilize Streamlit over building a custom server from scratch. The user interface, developed with additional HTML, CSS, and JavaScript, receives text inputs and establishes a WebSocket connection for data transfer to the Streamlit-powered Python backend. Then, a Machine Learning pipeline processes the data, and the results are visualized through an NLP pipeline, dynamically displaying interactive graphs on the web interface. Python handles backend data acquisition, analysis, and classification, while the Vis-Network library enhances the visual output.
Throughout the development, I navigated challenges related to syntax differences, data formatting, and lengthy processing times. Through persistence, I learned how to combine different syntax on the same page, refined the app's infrastructure, and developed automated procedures for storing data in JSON format, enabling animated graph visualization on an HTML canvas without Python's typical constraints.
Facebook Social Network - Visualization
Customized text data uploading system:
User Data Uploading system - generate graph & generate story
All user-inputted data will be automatically discarded once they close the website. This feature has been intentionally designed and clearly stated on the page to address any data privacy concerns. As no data is saved, users are provided with a download
option to save any interesting work they create using this system.
User Generated Text Representation
The deployment phase of any software or system, especially in fields like AI and software development, can be incredibly challenging. It often involves addressing numerous small details and resolving various bugs that weren't evident during earlier stages of development.
This was the first time I single-handedly deployed a full-stack web application, truly a one man army effort. The difficulties I encountered during the deployment process were overwhelming, with numerous errors, bugs, and package management issues, not to mention the subtle discrepancies in the UI.
I spent three consecutive nights to dive deeper into these formidable challenges, which eventually taught me how to decipher obscured error message from the back-end terminal in both system. This newfound skill in turn helped me tackle down the problems one by one and successfully deployed the app in the end.
After a long, exhausting battle, while sipping coffee the next morning, an intriguing thought pop into my head, 'What a journey! I've built a web app to transfer text data into digital insights, and invite humanity to re-question the notions of AI mind and intelligence. Am I not, in essence, striving to decipher the computer 'mind' through its own language system along the way?".
Successfully Deployed Web Application
Partial code of the web app
App Deployment back-end process
Data Annotation, to me, is one of most crucial elements in any kind of machine learning project.
I devoted a good amount of time carefully to writing the data annotation both in the app guidelines and documentation part for each pages. Here's a break down of the process:
In the middle of this project, I began drafting a paper titled " The Evolution of Cognitive Representation from Cortex to Computing: The Visualization of Mind ". This work mainly reviews how the mind has been portrayed throughout the history of neuroscience, both from scientific and artistic perspectives. The final section of the paper briefly discussed the visualization of mind in artificial intelligence field, and how our current AI model akin to human cognitive function in both theoretical and practical way. The discussion then circles back to this project.
As we trace the evolution of cognitive representation from historical contexts and delve into recent advancements and setbacks in AI, the convergence of artificial intelligence with our deeper understanding of human cognition raises profound questions about the nature of 'intelligence', 'cognition', and 'perception'. Although our current AI systems are inspired by animal and human neural system, and inherently connected to human cognition, they have not yet achieved full parity with human capabilities with no doubt. The human brain is, by nature, a highly sophisticated and complicated computational machine. Despite remarkable advancements in science, math, and technology, we have only began to unravel the complexities of our remarkable brain. Whether AI will ever truly match human cognitive capabilities remains speculative and depends not only on technological advancements but also on deeper philosophical understanding and ethical considerations.
The intersection and convergence of cognitive science, technology, and art, to me, is a poetic way to illuminate the imperceptible of human nature. It is my hope that through the technologies we've developed, the narratives we've crafted, and the data we've meticulously analyzed, both natural and artificial phenomena can be understood more intuitively and insightfully. AI, perhaps the most transformative tool in human history, holds the potential to fundamentally alter our world into a more sustainable and peaceful environment, provided it is steered with the right ethical and philosophical guidance.
As we all strive to understand our very own experiences of joy and sorrow daily, what the future AGI’s motivation could be? How can we collectively forge an environment that fosters more joy and beauty in this potential 'digital mind', instead of perpetuating the biases and hatred that already existing within our species?
- THE END-