Automated Knowledge Graph Construction using ChatGPT (2024)

In this post, we will cover the topic of constructing a knowledge graph from raw text data using OpenAI’s gpt-3.5-turbo. LLMs have shown superior performance in text generation and question-answering tasks. Retrieval-augmented generation (RAG) has further improved their performance, allowing them to access up-to-date and domain specific knowledge. Our goal in this post is to utilize LLMs as information extraction tools, to transform raw texts into facts that can easily be queried to gain useful insights. But first, we need to define a few of the key concepts.

What is a knowledge graph?

A knowledge graph is a semantic network which represents and interlinks real-world entities. These entities often correspond to people, organizations, objects, events, and concepts. The knowledge graph consists of triplets having the following structure:

head → relation → tail

or in the terminology of the Semantic Web:

subject → predicate → object

The network representation allows us to extract and analyze the complex relationships that exist between these entities.

A knowledge graph is often accompanied by a definition of the concepts, relations, and their properties — an ontology. The ontology is a formal specification which defines the concepts and their relation in the target domain, thus providing semantics for the network.

Ontologies are used by search engines and other automated agents on the Web to understand what the content of a specific Web page means in order to index it and display it correctly.

Case description

For this use-case, we are going to create a knowledge graph using OpenAI’s gpt-3.5-turbo from product descriptions in the Amazon Products Dataset.

There are a lot of ontologies used on the Web to describe products, the most popular ones being the Good Relations Ontology, and the Product Types Ontology. Both of these ontologies extend the Schema.org Ontology.

Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet. Schema.org vocabulary can be used with many different encodings, including RDFa, Microdata and JSON-LD.

For the task at hand, we are going to use the Schema.org definitions for products and related concepts, including their relations, to extract triplets from product descriptions.

We are going to implement the solution in Python. First, we need to install and import the required libraries.

Import libraries and read the data

!pip install pandas openai sentence-transformers networkx
import json
import logging
import matplotlib.pyplot as plt
import networkx as nx
from networkx import connected_components
from openai import OpenAI
import pandas as pd
from sentence_transformers import SentenceTransformer, util

Now, we are going to read the Amazon Products Dataset as a pandas dataframe.

data = pd.read_csv("amazon_products.csv")

We can see the contents of the dataset in the figure below. The dataset contains the following columns: ‘PRODUCT_ID’, ‘TITLE’, ‘BULLET_POINTS’, ‘DESCRIPTION’, ‘PRODUCT_TYPE_ID’, and ‘PRODUCT_LENGTH’. We are going to combine the columns ‘TITLE’, ‘BULLET_POINTS’, and ‘DESCRIPTION’ into one column ‘text’, which will represent the specification of the product that we are going to prompt ChatGPT to extract entities and relations from.

Automated Knowledge Graph Construction using ChatGPT (2)
data['text'] = data['TITLE'] + data['BULLET_POINTS'] + data['DESCRIPTION']

Information extraction

We are going to instruct ChatGPT to extract entities and relations from a provided product specification and return the result as an array of JSON objects. The JSON objects must contain the following keys: ‘head’, ‘head_type’, ‘relation’, ‘tail’, and ‘tail_type’.

The ‘head’ key must contain the text of the extracted entity with one of the types from the provided list in the user prompt. The ‘head_type’ key must contain the type of the extracted head entity which must be one of the types from the provided user list. The ‘relation’ key must contain the type of relation between the ‘head’ and the ‘tail’, the ‘tail’ key must represent the text of an extracted entity which is the object in the triple, and the ‘tail_type’ key must contain the type of the tail entity.

We are going to use the entity types and relation types listed below to prompt ChatGPT for entity-relation extraction. We will map these entities and relations to the corresponding entities and relations from the Schema.org ontology. The keys in the mapping represent the entity and relation types provided to ChatGPT, and the values represent the URLS of the objects and properties from Schema.org.

# ENTITY TYPES:
entity_types = {
"product": "https://schema.org/Product",
"rating": "https://schema.org/AggregateRating",
"price": "https://schema.org/Offer",
"characteristic": "https://schema.org/PropertyValue",
"material": "https://schema.org/Text",
"manufacturer": "https://schema.org/Organization",
"brand": "https://schema.org/Brand",
"measurement": "https://schema.org/QuantitativeValue",
"organization": "https://schema.org/Organization",
"color": "https://schema.org/Text",
}

# RELATION TYPES:
relation_types = {
"hasCharacteristic": "https://schema.org/additionalProperty",
"hasColor": "https://schema.org/color",
"hasBrand": "https://schema.org/brand",
"isProducedBy": "https://schema.org/manufacturer",
"hasColor": "https://schema.org/color",
"hasMeasurement": "https://schema.org/hasMeasurement",
"isSimilarTo": "https://schema.org/isSimilarTo",
"madeOfMaterial": "https://schema.org/material",
"hasPrice": "https://schema.org/offers",
"hasRating": "https://schema.org/aggregateRating",
"relatedTo": "https://schema.org/isRelatedTo"
}

In order to perform information extraction using ChatGPT, we create an OpenAI client, and using the chat completions API, we generate the output array of JSON objects for each identified relation from the raw product specification. The default model is chosen to be gpt-3.5-turbo since it’s performance is good enough for this simple demonstration.

client = OpenAI(api_key="<YOUR_API_KEY>")
def extract_information(text, model="gpt-3.5-turbo"):
completion = client.chat.completions.create(
model=model,
temperature=0,
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": user_prompt.format(
entity_types=entity_types,
relation_types=relation_types,
specification=text
)
}
]
)

return completion.choices[0].message.content

Prompt engineering

The system_prompt variable contains the instructions guiding ChatGPT to extract entities and relations from the raw text, and return the result in the form of arrays of JSON objects, each having the keys: ‘head’, ‘head_type’, ‘relation’, ‘tail’, and ‘tail_type’.

system_prompt = """You are an expert agent specialized in analyzing product specifications in an online retail store.
Your task is to identify the entities and relations requested with the user prompt, from a given product specification.
You must generate the output in a JSON containing a list with JOSN objects having the following keys: "head", "head_type", "relation", "tail", and "tail_type".
The "head" key must contain the text of the extracted entity with one of the types from the provided list in the user prompt, the "head_type"
key must contain the type of the extracted head entity which must be one of the types from the provided user list,
the "relation" key must contain the type of relation between the "head" and the "tail", the "tail" key must represent the text of an
extracted entity which is the tail of the relation, and the "tail_type" key must contain the type of the tail entity. Attempt to extract as
many entities and relations as you can.
"""

The user_prompt variable contains a single example of the required output for a single specification from the dataset and prompts ChatGPT to extract entities and relations in the same way from the provided specification. This is an example of single-shot learning with ChatGPT.

user_prompt = """Based on the following example, extract entities and relations from the provided text.
Use the following entity types:

# ENTITY TYPES:
{entity_types}

Use the following relation types:
{relation_types}

--> Beginning of example

# Specification
"YUVORA 3D Brick Wall Stickers | PE Foam Fancy Wallpaper for Walls,
Waterproof & Self Adhesive, White Color 3D Latest Unique Design Wallpaper for Home (70*70 CMT) -40 Tiles
[Made of soft PE foam,Anti Children's Collision,take care of your family.Waterproof, moist-proof and sound insulated. Easy clean and maintenance with wet cloth,economic wall covering material.,Self adhesive peel and stick wallpaper,Easy paste And removement .Easy To cut DIY the shape according to your room area,The embossed 3d wall sticker offers stunning visual impact. the tiles are light, water proof, anti-collision, they can be installed in minutes over a clean and sleek surface without any mess or specialized tools, and never crack with time.,Peel and stick 3d wallpaper is also an economic wall covering material, they will remain on your walls for as long as you wish them to be. The tiles can also be easily installed directly over existing panels or smooth surface.,Usage range: Featured walls,Kitchen,bedroom,living room, dinning room,TV walls,sofa background,office wall decoration,etc. Don't use in shower and rugged wall surface]
Provide high quality foam 3D wall panels self adhesive peel and stick wallpaper, made of soft PE foam,children's collision, waterproof, moist-proof and sound insulated,easy cleaning and maintenance with wet cloth,economic wall covering material, the material of 3D foam wallpaper is SAFE, easy to paste and remove . Easy to cut DIY the shape according to your decor area. Offers best quality products. This wallpaper we are is a real wallpaper with factory done self adhesive backing. You would be glad that you it. Product features High-density foaming technology Total Three production processes Can be use of up to 10 years Surface Treatment: 3D Deep Embossing Damask Pattern."

################

# Output
[
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "isProducedBy",
"tail": "YUVORA",
"tail_type": "manufacturer"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasCharacteristic",
"tail": "Waterproof",
"tail_type": "characteristic"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasCharacteristic",
"tail": "Self Adhesive",
"tail_type": "characteristic"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasColor",
"tail": "White",
"tail_type": "color"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasMeasurement",
"tail": "70*70 CMT",
"tail_type": "measurement"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasMeasurement",
"tail": "40 tiles",
"tail_type": "measurement"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasMeasurement",
"tail": "40 tiles",
"tail_type": "measurement"
}}
]

--> End of example

For the following specification, generate extract entitites and relations as in the provided example.

# Specification
{specification}
################

# Output

"""

Now, we call the extract_information function for each specification in the dataset and create a list of all extracted triplets which will represent our knowledge graph. For this demonstration, we will generate a knowledge graph using a subset of only 100 product specifications.

kg = []
for content in data['text'].values[:100]:
try:
extracted_relations = extract_information(content)
extracted_relations = json.loads(extracted_relations)
kg.extend(extracted_relations)
except Exception as e:
logging.error(e)

kg_relations = pd.DataFrame(kg)

The results from the information extraction are displayed in the figure below.

Automated Knowledge Graph Construction using ChatGPT (3)

Entity resolution

Entity resolution (ER) is the process of disambiguating entities that correspond to real world concepts. In this case, we will attempt to perform basic entity resolution on the head and tail entities in dataset. The reason for this is to have a more concise representation of the facts present in the texts.

We will perform entity resolution using NLP techniques, more specifically we are going to create embeddings for each head, using the sentence-transformers library, and calculate the cosine similarity between the head entities.

We will use the ‘all-MiniLM-L6-v2’ sentence transformer to create the embeddings since it’s a fast and relatively accurate model, suitable for this use-case. For each pair of head entities, we will check if the similarity is larger than 0.95, if so we will consider these entities as being the same entity and we normalize their text values to be the equal. The same reasoning works for the tail entities.

This process will help us achieve the following result. If we have two entities, one having the value ‘Microsoft’ and the second one ‘Microsoft Inc.’, then these two entities will be merged into one.

We load and use the embedding model in the following way to calculate the similarity between the first and second head entities.

heads = kg_relations['head'].values
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(heads)
similarity = util.cos_sim(embeddings[0], embeddings[1])

In order to visualize the extracted knowledge graph after entity resolution, we use the networkx Python library. First, we create an empty graph, and add each extracted relation to the graph.

G = nx.Graph()
for _, row in kg_relations.iterrows():
G.add_edge(row['head'], row['tail'], label=row['relation'])

To draw the graph we can use the following code:

pos = nx.spring_layout(G, seed=47, k=0.9)
labels = nx.get_edge_attributes(G, 'label')
plt.figure(figsize=(15, 15))
nx.draw(G, pos, with_labels=True, font_size=10, node_size=700, node_color='lightblue', edge_color='gray', alpha=0.6)
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, font_size=8, label_pos=0.3, verticalalignment='baseline')
plt.title('Product Knowledge Graph')
plt.show()

A subgraph from the generated knowledge graph is displayed in the figure below:

We can see that in this way, we can connect multiple different products based on characteristics they share. This is useful for learning common attributes between products, normalizing product specifications, describing resources on the Web by using a common schema such as Schema.org, and even making product recommendations based on the product specifications.

Most corporations have a vast amount of unstructured data lying around unused in data lakes. The approach of creating a knowledge graph to extract insights from this unused data will help to obtain information which is trapped in unprocessed and unstructured text corpora, and use this information for making more informed decisions.

So far, we have seen that LLMs can be used to extract triplets of entities and relations from raw text data and automatically construct a knowledge graph. In the next post, we will attempt to create a product recommendation system based on the extracted knowledge graph.

Automated Knowledge Graph Construction using ChatGPT (2024)

References

Top Articles
17 Pretty Baby in Bloom Shower Ideas - The Greenspring Home
5 Simple Steps to Create a Chore Chart For Kids That Works! • Mindfulmazing.com
Funny Roblox Id Codes 2023
Golden Abyss - Chapter 5 - Lunar_Angel
Www.paystubportal.com/7-11 Login
Joi Databas
DPhil Research - List of thesis titles
Shs Games 1V1 Lol
Evil Dead Rise Showtimes Near Massena Movieplex
Steamy Afternoon With Handsome Fernando
Which aspects are important in sales |#1 Prospection
Top Hat Trailer Wiring Diagram
World History Kazwire
George The Animal Steele Gif
Red Tomatoes Farmers Market Menu
Nalley Tartar Sauce
Chile Crunch Original
Immortal Ink Waxahachie
Craigslist Free Stuff Santa Cruz
Mflwer
Spergo Net Worth 2022
Costco Gas Foster City
Obsidian Guard's Cutlass
Marvon McCray Update: Did He Pass Away Or Is He Still Alive?
Mccain Agportal
Amih Stocktwits
Fort Mccoy Fire Map
Uta Kinesiology Advising
Kcwi Tv Schedule
What Time Does Walmart Auto Center Open
Nesb Routing Number
Random Bibleizer
10 Best Places to Go and Things to Know for a Trip to the Hickory M...
Black Lion Backpack And Glider Voucher
Gopher Carts Pensacola Beach
Duke University Transcript Request
Lincoln Financial Field, section 110, row 4, home of Philadelphia Eagles, Temple Owls, page 1
Jambus - Definition, Beispiele, Merkmale, Wirkung
Netherforged Lavaproof Boots
Ark Unlock All Skins Command
Craigslist Red Wing Mn
D3 Boards
Jail View Sumter
Nancy Pazelt Obituary
Birmingham City Schools Clever Login
Thotsbook Com
Vérificateur De Billet Loto-Québec
Funkin' on the Heights
Vci Classified Paducah
Www Pig11 Net
Ty Glass Sentenced
Latest Posts
Article information

Author: Edwin Metz

Last Updated:

Views: 5675

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Edwin Metz

Birthday: 1997-04-16

Address: 51593 Leanne Light, Kuphalmouth, DE 50012-5183

Phone: +639107620957

Job: Corporate Banking Technician

Hobby: Reading, scrapbook, role-playing games, Fishing, Fishing, Scuba diving, Beekeeping

Introduction: My name is Edwin Metz, I am a fair, energetic, helpful, brave, outstanding, nice, helpful person who loves writing and wants to share my knowledge and understanding with you.