Using LLMs to Query PubMed Knowledgebases for BioMedical Research
Jul 19, 2024Part 1: Creating the Knowledgebase
In this article, we'll explore how to leverage large language models (LLMs) to search and query scientific papers from PubMed, a free resource for accessing biomedical and life sciences literature. We'll be using AWS Bedrock as our AI backend, PostgreSQL as the vector database for storing embeddings, and the LangChain library in Python to load papers and query the knowledge base.
If you only care about the results generated by querying the knowledge base skip down to the end.
The specific use case we'll be focusing on is querying papers related to Rheumatoid Arthritis, a chronic inflammatory disorder affecting joints. We'll use the query ((rheumatoid arthritis) AND gene) AND cell
to retrieve around 10,000 relevant papers from PubMed, and then sample that down to approximately 5,000 papers for our knowledge base.
Disclaimer
I'm not including all the source code because the AI libraries change so frequently and because there are so many different ways to configure a knowledgebase backend, but I have included some helper functions so you can follow along.
PGVector: Storing Embeddings in a Vector Database
To make it easier for the LLM to process and understand the textual data from the research papers, we'll convert the text into numerical embeddings, which are dense vector representations of the text. These embeddings will be stored in a PostgreSQL database using the PGVector library. This step essentially simplifies the text data into a format that the LLM can more easily work with.
I'm running a local postgresql database, which is fine for my datasets. Hosting AWS Bedrock Knowledgebases can get expensive, and I'm not trying to run up my AWS bill this month. It's summer, and I have kids camp to pay for!
AWS Bedrock: The AI Backend
AWS Bedrock is a managed service provided by Amazon Web Services (AWS), allowing you to easily deploy and operate large language models. In our setup, Bedrock will host the LLM that we'll use to query and retrieve relevant information from our knowledge base of research papers.
LangChain: Loading and Querying the Knowledge Base
LangChain is a Python library that simplifies building applications with large language models. We'll use LangChain to load our research papers and their associated embeddings into a knowledge base and then query this knowledge base using the LLM hosted on AWS Bedrock.
Using PubGet for Data Acquisition
While this setup can work with research papers from any source, we're using PubMed because it's a convenient source for acquiring a large volume of papers based on specific search queries. We'll use the PubGet tool to retrieve the initial set of 10,000 papers matching our query on Rheumatoid Arthritis, genes, and cells.
pubget run -q "((rheumatoid arthritis) AND gene) AND cell" \
pubget_data
This will get us articles in xml
format.
Structuring and Organizing the Dataset
Beyond the technical aspects, this article will focus on how to structure and organize your dataset of research papers effectively. This includes topics such as:
- Dataset: Managing your datasets on a global level using collections.
- Metadata Management: Handling and incorporating metadata associated with the papers, such as author information, publication dates, and keywords.
You'll want to think about this upfront. When using LangChain, you query datasets based on their collections. Each collection has a name and a unique identifier.
When you load your data, whether it's pdf papers, xml downloads, markdown files, codebases, powerpoint slides, text documents, etc, you can attach additional metadata. You can later use this metadata to filter your results. The metadata is an open dictionary, and you can add tags, source, phenotype, or anything you think may be relevant.
Loading and Querying the Knowledge Base
The article will also cover best practices for loading your preprocessed and structured dataset into the knowledge base and provide examples of how to query the knowledge base effectively using the LLM hosted on AWS Bedrock.
By the end of this article, you should have a solid understanding of how to leverage LLMs to search and retrieve relevant information from a large corpus of research papers, as well as strategies for structuring and organizing your dataset to optimize the performance and accuracy of your knowledge base.
import boto3
import pprint
import os
import boto3
import json
import hashlib
import funcy
import glob
from typing import Dict, Any, TypedDict, List
from langchain.llms.bedrock import Bedrock
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain_core.documents import Document
from langchain_aws import ChatBedrock
from langchain_community.embeddings import BedrockEmbeddings # to create embeddings for the documents.
from langchain_experimental.text_splitter import SemanticChunker # to split documents into smaller chunks.
from langchain_text_splitters import CharacterTextSplitter
from langchain_postgres import PGVector
from pydantic import BaseModel, Field
from langchain_community.document_loaders import (
WebBaseLoader,
TextLoader,
PyPDFLoader,
CSVLoader,
Docx2txtLoader,
UnstructuredEPubLoader,
UnstructuredMarkdownLoader,
UnstructuredXMLLoader,
UnstructuredRSTLoader,
UnstructuredExcelLoader,
DataFrameLoader,
)
import psycopg
import uuid
Setup your database connection
I'm running a local Supabase postgresql database running using their docker-compose
setup. In a production setup, I'd recommend using a real database, like AWS AuroraDB or Supabase running someplace besides your laptop. Also change your password to something besides password.
I didn't notice any difference in performance for smaller datasets between an AWS-hosted knowledgebase and my laptop, but your mileage may vary.
connection = f"postgresql+psycopg://{user}:{password}@{host}:{port}/{database}"
# Establish the connection to the database
conn = psycopg.connect(
conninfo = f"postgresql://{user}:{password}@{host}:{port}/{database}"
)
# Create a cursor to run queries
cur = conn.cursor()
Insert AWS BedRock Embeddings into the table using Langchain
We're using AWS Bedrock as our AI Knowledgebase. Most of the companies I work with have some kind of proprietary data, and Bedrock has a guarantee that your data will remain private. You could use any of the AI backends here.
os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'
bedrock_client = boto3.client("bedrock-runtime")
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",client=bedrock_client)
bedrock_embeddings_image = BedrockEmbeddings(model_id="amazon.titan-embed-image-v1",client=bedrock_client)
llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=bedrock_client)
# function to create vector store
# make sure to update this if you change collections!
def create_vectorstore(embeddings,collection_name,conn):
vectorstore = PGVector(
embeddings=embeddings,
collection_name=collection_name,
connection=conn,
use_jsonb=True,
)
return vectorstore
def load_and_split_pdf_semantic(file_path, embeddings):
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
return docs
def load_xml(file_path, embeddings):
loader = UnstructuredXMLLoader(
file_path,
)
docs = loader.load_and_split()
return docs
def insert_embeddings(files, bedrock_embeddings, vectorstore):
logging.info(f"Inserting {len(files)}")
x = 1
y = len(files)
for file_path in files:
logging.info(f"Splitting {file_path} {x}/{y}")
docs = []
if '.pdf' in file_path:
try:
with funcy.print_durations('process pdf'):
docs = load_and_split_pdf_semantic(file_path, bedrock_embeddings)
except Exception as e:
logging.warning(f"Error loading docs")
if '.xml' in file_path:
try:
with funcy.print_durations('process xml'):
docs = load_xml(file_path, bedrock_embeddings)
except Exception as e:
logging.warning(e)
logging.warning(f"Error loading docs")
filtered_docs = []
for d in docs:
if len(d.page_content):
filtered_docs.append(d)
# Add documents to the vectorstore
ids = []
for d in filtered_docs:
ids.append(
hashlib.sha256(d.page_content.encode()).hexdigest()
)
if len(filtered_docs):
texts = [ i.page_content for i in filtered_docs]
# metadata is a dictionary. You can add to it!
metadatas = [ i.metadata for i in filtered_docs]
#logging.info(f"Adding N: {len(filtered_docs)}")
try:
with funcy.print_durations('load psql'):
vectorstore.add_texts(texts=texts, metadatas = metadatas, ids=ids)
except Exception as e:
logging.warning(e)
logging.warning(f"Error {x - 1}/{y}")
#logging.info(f"Complete {x}/{y}")
x = x + 1
collection_name_text = "MY_COLLECTION" #pubmed, smiles, etc
vectorstore = create_vectorstore(bedrock_embeddings,collection_name_text,connection)
Load and process Pubmed XML Papers
Most of our data was fetched using the pubget
tool, and the articles are in XML format. We'll use the LangChain XML Loader to process, split and load the embeddings.
files = glob.glob("/home/jovyan/data/pubget_ra/pubget_data/*/articles/*/*/article.xml")
#I ran this previously
insert_embeddings(files[0:2], bedrock_embeddings, vectorstore)
Load and Process Pubmed PDF Papers
PDFs are easier to read, and I grabbed some for doing QA against the knowledgebase.
files = glob.glob("/home/jovyan/data/pubget_ra/papers/*pdf")
insert_embeddings(files[0:2], bedrock_embeddings, vectorstore)
Part 2 - Query the Knowledgebase
Now that we have our knowledgebase setup we can use Retrieval Augmented Generation, RAG methods, to use the LLMs to run queries.
Our queries are:
- Tell me about T cell–derived cytokines in relation to rheumatoid arthritis and provide citations and article titles
- Tell me about single-cell research in rheumatoid arthritis.
- Tell me about protein-protein associations in rheumatoid arthritis.
- Tell me about the findings of GWAS studies in rheumatoid arthritis.
import hashlib
import logging
import os
from typing import Optional, List, Dict, Any
import glob
import boto3
from toolz.itertoolz import partition_all
import json
import funcy
import psycopg
from IPython.display import Markdown, display
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import PromptTemplate
from langchain.retrievers.bedrock import (
AmazonKnowledgeBasesRetriever,
RetrievalConfig,
VectorSearchConfig,
)
from aws_bedrock_utilities.models.base import BedrockBase, RAGResults
from aws_bedrock_utilities.models.pgvector_knowledgebase import BedrockPGWrapper
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from pprint import pprint
import time
import logging
from rich.logging import RichHandler
FORMAT = "%(message)s"
logging.basicConfig(
level="INFO", format=FORMAT, datefmt="[%X]", handlers=[RichHandler()]
)
os.environ['POSTGRES_USER'] = 'postgres'
You'll need to first have the collection name you're querying along with your queries.
I always recommend running a few QA queries. Ask the obvious questions in several different ways.
You'll also want to adjust the MAX_DOCS_RETURNED
based on your time constraints and how many articles are in your knowledgebase. The LLM will search until it hits that maximum, and then stops. You'll need to increase that number for an exhaustive search.
# Make sure to keep the collection name consistent!
COLLECTION_NAME = "MY_COLLECTION"
MAX_DOCS_RETURNED = 50
p = BedrockPGWrapper(collection_name=COLLECTION_NAME)
INFO Found credentials in environment variables. credentials.py:1147
#model = "anthropic.claude-3-sonnet-20240229-v1:0"
model = "anthropic.claude-3-haiku-20240307-v1:0"
queries = [
"Tell me about T cell–derived cytokines in relation to rheumatoid arthritis and provide citations and article titles",
"Tell me about single-cell research in rheumatoid arthritis.",
"Tell me about protein-protein associations in rheumatoid arthritis.",
"Tell me about the findings of GWAS studies in rheumatoid arthritis.",
]
ai_responses = []
for query in queries:
answer = p.run_kb_chat(query=query, collection_name= COLLECTION_NAME, model_id=model, search_kwargs={'k': MAX_DOCS_RETURNED, 'fetch_k': 50000 })
ai_responses.append(answer)
time.sleep(1)
answer.keys()
dict_keys(['source_documents', 'result', 'query'])
len(answer['source_documents'])
50
len(ai_responses)
4
for answer in ai_responses:
t = Markdown(f"""
### Query
{answer['query']}
### Response
{answer['result']}
""")
display(t)
Part 3: Results!
We've built our knowledgebase, run some queries, and now we're ready to look at the results the LLM generated for us.
Each result is a dictionary with the original query, the response, and the relevant snippets of the source document.
Query
Tell me about T cell–derived cytokines in relation to rheumatoid arthritis and provide citations and article titles
Response
T cell-derived cytokines play a key role in the pathogenesis of rheumatoid arthritis (RA). Some key findings include:
-
Increased levels of IL-17, a cytokine produced by Th17 cells, have been found in the synovial fluid of RA patients. IL-17 can stimulate fibroblast-like synoviocytes (FLS) and macrophages to produce inflammatory mediators like VEGF, IL-1, IL-6, TNF-α, and prostaglandin E2, and promote osteoclast formation, contributing to joint inflammation and destruction (Honorati et al. 2006, Schurgers et al. 2011).
-
Th1 cells, which produce IFN-γ, are also implicated in RA pathogenesis. IFN-γ can induce macrophages to polarize towards a pro-inflammatory M1 phenotype (Schurgers et al. 2011, Kebir et al. 2009, Boniface et al. 2010).
-
CD161+ Th17 cells, which can produce both IL-17 and IFN-γ, are enriched in the synovium of RA patients and may contribute to the inflammatory environment (Afzali et al. 2013, Bovenschen et al. 2011, Koenen et al. 2008, Pesenacker et al. 2013).
-
Regulatory T cells (Tregs), which normally suppress inflammation, show impaired function in RA, potentially contributing to the dysregulated immune response (Moradi et al. 2014, Samson et al. 2012, Zhang et al. 2018, Walter et al. 2013, Wang et al. 2018, Morita et al. 2016).
In summary, the imbalance between pro-inflammatory T cell subsets (Th1, Th17) and anti-inflammatory Tregs is a hallmark of RA pathogenesis, with cytokines like IL-17 and IFN-γ playing central roles in driving joint inflammation and destruction.
Query
Tell me about single-cell research in rheumatoid arthritis.
Response
Single-cell research has provided important insights into the pathogenesis of rheumatoid arthritis (RA):
-
Single-cell RNA sequencing (scRNA-seq) studies have identified distinct cell states and subpopulations within the RA synovium, including pathogenic T cell subsets like T peripheral helper (Tph) cells and cytotoxic CD8+ T cells.
-
Analyses of the T cell receptor (TCR) repertoire in the RA synovium have revealed clonal expansion of CD4+ and CD8+ T cell populations, suggesting antigen-driven responses.
-
scRNA-seq has also characterized expanded populations of activated B cells, plasmablasts, and plasma cells in the RA synovium that demonstrate substantial clonal relationships.
-
Receptor-ligand analyses from scRNA-seq data have predicted key cell-cell interactions, such as between Tph cells and B cells, that may drive synovial inflammation.
-
Overall, single-cell studies have uncovered the cellular and molecular heterogeneity within the RA synovium, identifying specific immune cell subsets and pathways that could serve as targets for more personalized therapeutic approaches.
Query
Tell me about protein-protein associations in rheumatoid arthritis.
Response
Based on the information provided in the context, some key protein-protein associations in rheumatoid arthritis (RA) include:
-
Rheumatoid factor (RF) and anti-citrullinated protein antibodies (ACPAs):
- RF is found in about 80% of patients in the pre-articular phase of RA.
- ACPAs are highly specific for RA and can be detected years before the onset of clinical symptoms.
-
Peptidylarginine deiminase (PAD) enzymes and anti-PAD antibodies:
- Anti-PAD2 antibodies are associated with a moderate disease course, while anti-PAD4 antibodies are linked to more severe and rapidly progressive RA.
- Anti-PAD3/4 antibodies may signal the development of RA-associated interstitial lung disease.
-
Anti-carbamylated protein (anti-CarP) antibodies:
- Anti-CarP antibodies are present in 25-50% of RA patients, independent of RF or ACPA positivity.
- Anti-CarP antibodies are associated with poor prognosis and increased morbidity, including RA-associated interstitial lung disease.
-
Malondialdehyde-acetaldehyde (MAA) adducts and anti-MAA antibodies:
- Anti-MAA antibodies are associated with radiological progression in seronegative RA.
-
Protein-protein interactions in signaling pathways:
- The JAK-STAT, MAPK, PI3K-AKT, and SYK signaling pathways are all implicated in the pathogenesis of RA and are potential targets for therapeutic intervention.
In summary, the context highlights several key protein-protein associations in RA, including autoantibodies (RF, ACPAs, anti-PAD, anti-CarP, anti-MAA) and signaling pathway components (JAK, STAT, MAPK, PI3K, SYK), which play important roles in the pathogenesis and progression of the disease.
Query
Tell me about the findings of GWAS studies in rheumatoid arthritis.
Response
Here are some key findings from GWAS studies in rheumatoid arthritis:
-
Genome-wide association studies (GWAS) have identified over 100 genetic risk loci associated with rheumatoid arthritis susceptibility.
-
The HLA-DRB1 gene is the strongest genetic risk factor, accounting for about 50% of the genetic component of rheumatoid arthritis. Specific HLA-DRB1 alleles containing the "shared epitope" sequence are strongly associated with increased RA risk.
-
Other notable genetic risk factors identified through GWAS include PTPN22, STAT4, CCR6, PADI4, CTLA4, and CD40. These genes are involved in immune regulation and inflammation pathways.
-
Genetic risk factors can differ between seropositive (ACPA-positive) and seronegative rheumatoid arthritis. For example, HLA-DRB1 alleles have a stronger association with seropositive RA.
-
GWAS have also identified genetic variants associated with disease severity and response to treatment in rheumatoid arthritis. For example, variants in the FCGR3A and PTPRC genes have been linked to response to anti-TNF therapy.
-
Overall, GWAS have provided important insights into the genetic architecture and pathogenesis of rheumatoid arthritis, which has implications for developing targeted therapies and personalized treatment approaches.
Investigate the Source Documents
Querying the Knowledgebase will return relevant snippets of the source documents. Sometimes the formatting returned by langchain can be a bit off, but you can always go back to the source.
x = 0
y = 10
for answer in ai_responses:
for s in answer['source_documents']:
if x <= y:
print(s.metadata)
else:
break
x = x + 1
{'page': 6, 'source': '/home/jovyan/data/pubget_ra/papers/fimmu-12-790122.pdf'} {'source': '/home/jovyan/data/pubget_ra/pubget_data/query_55c6003c0195b20fd4bdc411f67a8dcf/articles/d52/pmcid_11167034/article.xml'} {'page': 7, 'source': '/home/jovyan/data/pubget_ra/papers/fimmu-12-790122.pdf'} {'page': 17, 'source': '/home/jovyan/data/pubget_ra/papers/fimmu-12-790122.pdf'} {'source': '/home/jovyan/data/pubget_ra/pubget_data/query_55c6003c0195b20fd4bdc411f67a8dcf/articles/657/pmcid_11151399/article.xml'} {'page': 3, 'source': '/home/jovyan/data/pubget_ra/papers/41392_2023_Article_1331.pdf'} {'page': 4, 'source': '/home/jovyan/data/pubget_ra/papers/41392_2023_Article_1331.pdf'} {'page': 5, 'source': '/home/jovyan/data/pubget_ra/papers/fimmu-12-790122.pdf'} {'source': '/home/jovyan/data/pubget_ra/pubget_data/query_55c6003c0195b20fd4bdc411f67a8dcf/articles/e02/pmcid_11219584/article.xml'} {'page': 8, 'source': '/home/jovyan/data/pubget_ra/papers/fimmu-12-790122.pdf'} {'source': '/home/jovyan/data/pubget_ra/pubget_data/query_55c6003c0195b20fd4bdc411f67a8dcf/articles/6fb/pmcid_11203675/article.xml'}
Wrap Up
There you have it! We created a knowledgebase on the cheap, used AWS Bedrock to load the embeddings, and then used a Claude LLM to run our queries. Here we used PubMed papers, but we also could have used meeting notes, powerpoint slides, crawled websites, or in house databases.
If you have any questions, comments, or tutorial requests please don't hesitate to reach out to me by email at [email protected]