
Picture by Creator
# Introduction
Do you know that a big portion of invaluable info nonetheless exists in unstructured textual content? For instance, analysis papers, scientific notes, monetary experiences, and so on. Extracting dependable and structured info from these texts has all the time been a problem. LangExtract is an open-source Python library (launched by Google) that solves this drawback utilizing giant language fashions (LLMs). You possibly can outline what to extract by way of easy prompts and some examples, after which it makes use of LLMs (like Google’s Gemini, OpenAI, or native fashions) to tug out that info from paperwork of any size. One other factor that makes it helpful is its assist for very lengthy paperwork (by way of chunking and multi-pass processing) and interactive visualization of outcomes. Let’s discover this library in additional element.
# 1. Putting in and Setting Up
To put in LangExtract domestically, first guarantee you could have Python 3.10+ put in. The library is accessible on PyPI. In a terminal or digital surroundings, run:
For an remoted surroundings, you might first create and activate a digital surroundings:
python -m venv langextract_env
supply langextract_env/bin/activate # On Home windows: .langextract_envScriptsactivate
pip set up langextract
There are different choices from supply and utilizing Docker as nicely that you may test from right here.
# 2. Setting Up API Keys (for Cloud Fashions)
LangExtract itself is free and open-source, however in case you use cloud-hosted LLMs (like Google Gemini or OpenAI GPT fashions), you could provide an API key. You possibly can set the LANGEXTRACT_API_KEY surroundings variable or retailer it in a .env file in your working listing. For instance:
export LANGEXTRACT_API_KEY="YOUR_API_KEY_HERE"
or in a .env file:
cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF
echo '.env' >> .gitignore
On-device LLMs by way of Ollama or different native backends don’t require an API key. To allow OpenAI, you’d run pip set up langextract[openai], set your OPENAI_API_KEY, and use an OpenAI model_id. For Vertex AI (enterprise customers), service account authentication is supported.
# 3. Defining an Extraction Job
LangExtract works by you telling it what info to extract. You do that by writing a transparent immediate description and supplying a number of ExampleData annotations that present what an accurate extraction appears to be like like on pattern textual content. As an illustration, to extract characters, feelings, and relationships from a line of literature, you would possibly write:
import langextract as lx
immediate = """
Extract characters, feelings, and relationships so as of look.
Use actual textual content for extractions. Don't paraphrase or overlap entities.
Present significant attributes for every entity so as to add context."""
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks? ...",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"}
),
lx.data.Extraction(
extraction_class="emotion",
extraction_text="But soft!",
attributes={"feeling": "gentle awe"}
)
]
)
]
These examples (taken from LangExtract’s README) inform the mannequin precisely what sort of structured output is anticipated. You possibly can create related examples to your area.
# 4. Working the Extraction
As soon as your immediate and examples are outlined, you merely name the lx.extract() operate. The important thing arguments are:
text_or_documents: Your enter textual content, or an inventory of texts, or perhaps a URL string (LangExtract can fetch and course of textual content from a Gutenberg or different URL).prompt_description: The extraction directions (a string).examples: A listing ofExampleDatathat illustrate the specified output.model_id: The identifier of the LLM to make use of (e.g."gemini-2.5-flash"for Google Gemini Flash, or an Ollama mannequin like"gemma2:2b", or an OpenAI mannequin like"gpt-4o").- Different non-obligatory parameters:
extraction_passes(to re-run extraction for increased recall on lengthy texts),max_workers(to do parallel processing on chunks),fence_output,use_schema_constraints, and so on.
For instance:
input_text=""'JULIET. O Romeo, Romeo! wherefore artwork thou Romeo?
Deny thy father and refuse thy identify;
Or, if thou wilt not, be however sworn my love,
And I am going to not be a Capulet.
ROMEO. Shall I hear extra, or shall I communicate at this?
JULIET. 'Tis however thy identify that's my enemy;
Thou artwork thyself, although not a Montague.
What’s in a reputation? That which we name a rose
By every other identify would scent as candy.'''
outcome = lx.extract(
text_or_documents=input_text,
prompt_description=immediate,
examples=examples,
model_id="gemini-2.5-flash"
)
This sends the immediate and examples together with the textual content to the chosen LLM and returns a End result object. LangExtract routinely handles tokenizing lengthy texts into chunks, batching calls in parallel, and merging the outputs.
# 5. Dealing with Output and Visualization
The output of lx.extract() is a Python object (typically referred to as outcome) that accommodates the extracted entities and attributes. You possibly can examine it programmatically or reserve it for later. LangExtract additionally offers helper features to save lots of outcomes: for instance, you’ll be able to write the outcomes to a JSONL (JSON Strains) file (one doc per line) and generate an interactive HTML evaluation. For instance:
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html = lx.visualize("extraction_results.jsonl")
with open("viz.html", "w") as f:
f.write(html if isinstance(html, str) else html.knowledge)
This writes an extraction_results.jsonl file and an interactive viz.html file. The JSONL format is handy for giant datasets and additional processing, and the HTML file highlights every extracted span in context (color-coded by class) for simple human inspection like this:


# 6. Supporting Enter Codecs
LangExtract is versatile about enter. You possibly can provide:
- Plain textual content strings: Any textual content you load into Python (e.g. from a file or database) will be processed.
- URLs: As proven above, you’ll be able to cross a URL (e.g. a Mission Gutenberg hyperlink) as
text_or_documents="https://www.gutenberg.org/information/1513/1513-0.txt". LangExtract will obtain and extract from that doc. - Checklist of texts: Move a Python listing of strings to course of a number of paperwork in a single name.
- Wealthy textual content or Markdown: Since LangExtract works on the textual content stage, you can additionally feed in Markdown or HTML in case you pre-process it to uncooked textual content. (LangExtract itself doesn’t parse PDFs or pictures, you might want to extract textual content first.)
# 7. Conclusion
LangExtract makes it straightforward to show unstructured textual content into structured knowledge. With excessive accuracy, clear supply mapping, and easy customization, it really works nicely when rule-based strategies fall quick. It’s particularly helpful for advanced or domain-specific extractions. Whereas there may be room for enchancment, LangExtract is already a powerful software for extracting grounded info in 2025.
Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for knowledge science and the intersection of AI with medication. She co-authored the book “Maximizing Productiveness with ChatGPT”. As a Google Technology Scholar 2022 for APAC, she champions range and educational excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower girls in STEM fields.
