Blogs/Technology/How to Analyse Documents Using AWS Services?

How to Analyse Documents Using AWS Services?

Written by Shubham Ambastha

May 26, 2026

7 Min Read

How to Analyse Documents Using AWS Services? Hero

Processing thousands of invoices, contracts, or medical records manually isn't just slow; it doesn't scale. AWS solves this with a pair of managed AI services: Amazon Textract for extraction and Amazon Comprehend for understanding. Together, they turn unstructured documents into structured, queryable data with no ML expertise required.

The Two Core Services To Analyse Documents

1. Amazon Textract

Textract is a fully managed OCR service that goes beyond reading text. It understands document structure, returning five extraction types:

Feature	What it extracts
Text	Raw lines and words
Forms	Key-value pairs (e.g. Name: Jane Doe)
Tables	Rows, columns, and cell contents
Queries	Answers to natural language questions (e.g. "What is the invoice total?")
Signatures	Location and confidence of detected signatures
Layout	Headers, footers, paragraphs, figures in reading order

Text

What it extracts

Raw lines and words

1 of 6

Supported inputs: PDF, JPEG, PNG, TIFF, from S3 or as direct byte payloads.

2. Amazon Comprehend

Comprehend is an NLP service that processes extracted text and adds semantic understanding. It doesn't read documents; it reads what Textract gives it.

Key capabilities:

Entity recognition - people, orgs, dates, locations
Sentiment analysis - positive, negative, neutral, mixed
Key phrase extraction - core topics and themes
PII detection - identify and redact sensitive data
Custom classification - train models on your own document categories

Amazon Comprehend Medical extends this for healthcare, extracting ICD-10 codes, medications (RxNorm), SNOMED-CT concepts, and PHI from clinical text. It is HIPAA eligible.

End-to-End Architecture

A production document analysis pipeline on AWS typically looks like this:

S3 (upload) → SQS → Lambda → Textract → SNS → Lambda → Comprehend → DynamoDB

Step by step:

Document uploaded to S3 input bucket
S3 event triggers SQS queue
Lambda function start-textract-job calls StartDocumentAnalysis
Textract completes and notifies via SNS topic
SNS triggers Lambda process-textract-result → saves extracted text to S3
Based on document type, Lambda calls Comprehend or Comprehend Medical
NLP results processed and stored in DynamoDB or DocumentDB

Partner with Us for Success

Experience seamless collaboration and exceptional results.

This is fully serverless, no instances to manage, scales automatically with volume.

Working Code

Extract text and forms with Textract (synchronous)

Use AnalyzeDocument for single-page documents or real-time use cases:

import boto3
textract = boto3.client('textract', region_name='ap-south-1')
def analyse_document(bucket: str, key: str) -> dict:
    response = textract.analyze_document(
        Document={
            'S3Object': {'Bucket': bucket, 'Name': key}
        },
        FeatureTypes=['FORMS', 'TABLES', 'SIGNATURES']
    )
    return response['Blocks']

Extract from multi-page PDFs (asynchronous)

For multi-page documents, use the async API with SQS notification:

import boto3
textract = boto3.client('textract', region_name='ap-south-1')
SNS_TOPIC_ARN = 'arn:aws:sns:ap-south-1:123456789:textract-topic'
ROLE_ARN = 'arn:aws:iam::123456789:role/TextractRole'
def start_document_analysis(bucket: str, key: str) -> str:
    response = textract.start_document_analysis(
        DocumentLocation={
            'S3Object': {'Bucket': bucket, 'Name': key}
        },
        FeatureTypes=['FORMS', 'TABLES'],
        NotificationChannel={
            'SNSTopicArn': SNS_TOPIC_ARN,
            'RoleArn': ROLE_ARN
        }
    )
    return response['JobId']  # Poll or wait for SNS notification
def get_analysis_results(job_id: str) -> list:
    pages = []
    response = textract.get_document_analysis(JobId=job_id)
    pages.extend(response['Blocks'])
    while 'NextToken' in response:
        response = textract.get_document_analysis(
            JobId=job_id,
            NextToken=response['NextToken']
        )
        pages.extend(response['Blocks'])
    return pages

Parse key-value pairs from forms

def extract_key_value_pairs(blocks: list) -> dict:
    key_map, value_map, block_map = {}, {}, {}
    for block in blocks:
        block_map[block['Id']] = block
        if block['BlockType'] == 'KEY_VALUE_SET':
            if 'KEY' in block.get('EntityTypes', []):
                key_map[block['Id']] = block
            else:
                value_map[block['Id']] = block
    kvs = {}
    for key_id, key_block in key_map.items():
        key_text = get_text(key_block, block_map)
        val_block = find_value_block(key_block, value_map)
        val_text = get_text(val_block, block_map) if val_block else ''
        kvs[key_text] = val_text
    return kvs
def get_text(block: dict, block_map: dict) -> str:
    text = ''
    for rel in block.get('Relationships', []):
        if rel['Type'] == 'CHILD':
            for child_id in rel['Ids']:
                child = block_map.get(child_id, {})
                if child.get('BlockType') == 'WORD':
                    text += child.get('Text', '') + ' '
    return text.strip()
def find_value_block(key_block: dict, value_map: dict):
    for rel in key_block.get('Relationships', []):
        if rel['Type'] == 'VALUE':
            for val_id in rel['Ids']:
                if val_id in value_map:
                    return value_map[val_id]
    return None

Analyse extracted text with Comprehend

import boto3
comprehend = boto3.client('comprehend', region_name='ap-south-1')
def analyse_text(text: str) -> dict:
    # Run entity detection and sentiment in parallel
    entities = comprehend.detect_entities(Text=text, LanguageCode='en')
    sentiment = comprehend.detect_sentiment(Text=text, LanguageCode='en')
    key_phrases = comprehend.detect_key_phrases(Text=text, LanguageCode='en')
    return {
        'entities': entities['Entities'],
        'sentiment': sentiment['Sentiment'],
        'key_phrases': [kp['Text'] for kp in key_phrases['KeyPhrases']]
    }

Comprehend Medical for clinical documents

comprehend_medical = boto3.client('comprehendmedical', region_name='ap-south-1')
def analyse_medical_text(text: str) -> dict:
    entities = comprehend_medical.detect_entities_v2(Text=text)
    phi = comprehend_medical.detect_phi(Text=text)
    return {
        'medical_entities': entities['Entities'],
        'phi_detected': phi['Entities']  # Protected Health Information
    }

Estimated Cost: 1,000 Documents/Month

Service	Usage	Estimated Cost
Amazon S3	1 GB storage	~$0.02
Amazon SQS	2,000 requests	$0.0008
AWS Lambda	Within free tier	$0.00
Amazon Textract	Text + Forms (1,000 pages)	$51.50
Amazon SNS	1,000 notifications	$0.0005
Amazon DynamoDB	1,000 reads/writes	$0.0015
Amazon Comprehend	1M characters	$1.00
Amazon Comprehend Medical	1M characters	$12.50
Total		~$65.00/month

Amazon S3

Usage

1 GB storage

Estimated Cost

~$0.02

1 of 9

Textract dominates cost, specifically the Forms and Tables feature ($50 per 1,000 pages vs $1.50 for text-only). If you only need raw text, skip FORMS and TABLES in FeatureTypes to reduce cost significantly.

Best Practices

Document quality matters most. High-resolution scans (300 DPI+), clear fonts, and consistent formatting directly improve Textract accuracy. Pre-process images to normalize contrast and orientation before uploading.

Use async for anything multi-page. AnalyzeDocument is synchronous and limited to a single page. For PDFs, always use StartDocumentAnalysis with SQS/SNS notification.

Use Queries for targeted extraction. Instead of parsing all blocks to find a specific field, pass a natural-language query like "What is the invoice number?", Textract returns a direct answer with a confidence score.

Encrypt at rest and in transit. Use AWS KMS for S3 bucket encryption and enable server-side encryption on DynamoDB. For medical documents, apply strict IAM policies limiting Comprehend Medical access to specific roles.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Chunk long texts for Comprehend. Comprehend's detect_entities has a 5,000-byte limit per call. Split the extracted text into chunks before sending, then merge the results.

Frequently Asked Questions

What's the difference between Textract and traditional OCR?

Traditional OCR extracts raw text. Textract understands document structure; it knows that Invoice No: is a key and INV-2024-001 is its value, and returns them as a pair rather than separate lines of text.

When should I use synchronous vs asynchronous Textract?

Use AnalyzeDocument (sync) for single-page documents and real-time use cases. Use StartDocumentAnalysis (async) for multi-page PDFs or batch processing; it's non-blocking and handles pagination automatically.

Is this HIPAA compliant?

Amazon Comprehend Medical is HIPAA eligible when deployed correctly. Ensure PHI is encrypted with KMS, access is restricted via IAM, and the AWS Business Associate Agreement (BAA) is signed.

Can Textract handle handwritten text?

Yes, Textract supports printed and handwritten text detection. Accuracy depends on handwriting quality; printed documents consistently yield higher confidence scores.

How do I improve extraction accuracy for custom document layouts?

Use Textract Adapters, trainable components you attach to AnalyzeDocument. You annotate sample documents, train the adapter, and pass the AdapterId with future requests to customize model output for your specific forms.

Final Thoughts

The Textract → Comprehend pipeline covers the full document intelligence loop: extract structure, then extract meaning. With Lambda and SQS tying the stages together, the entire workflow is serverless, event-driven, and scales to any volume without infrastructure changes. Start with AnalyzeDocument on a few sample documents to validate the extraction quality. Once that's solid, build the async pipeline for production; the architecture above handles the rest.

Shubham Ambastha

Sr Software Developer

I'm a Senior Software Developer with 4.5+ years of experience in building optimized, cost-effective web applications. Passionate about coding and innovation, I create impactful tech solutions.

Share this article

Next for you

8 Best GraphQL Libraries for Node.js in 2025 Cover

Technology

Jan 29, 2026 • 8 min read

8 Best GraphQL Libraries for Node.js in 2025

Why do some GraphQL APIs respond in milliseconds while others take seconds? The difference often comes down to choosing the right GraphQL library for Node.js. According to npm trends, Apollo Server Express alone sees over 800,000 weekly downloads, proving that developers need reliable tools to build production-ready GraphQL servers. The truth is, building GraphQL APIs in Node.js has never been easier, but picking the wrong library can slow down your entire application. Modern web applications d

I Tested 9 React Native Animation Libraries (Here’s What Works) Cover

Technology

Feb 10, 2026 • 14 min read

I Tested 9 React Native Animation Libraries (Here’s What Works)

Why do some mobile apps feel smooth while others feel clunky? I’ve noticed the difference is usually animations under load, especially during scrolling, navigation, and gesture-heavy screens. Google research shows 53% of mobile site visits are abandoned if pages take longer than three seconds to load, and the same performance expectations carry over to mobile apps. The truth is, smooth animations in React Native apps are no longer a luxury; they’re a must-have for a modern, engaging user experi

9 Critical Practices for Secure Web Application Development Cover

Technology

May 18, 2026 • 6 min read

9 Critical Practices for Secure Web Application Development

In 2026, developing modern web applications requires a balance between speed and security. Product strategy often pressures development teams to move fast, and ignoring application security can cause catastrophic results. For example, post-credential-based attacks have caused over $5 billion in losses. Security vulnerabilities in web applications are not just technical security problems; they are a business risk. The truth is that security incidents happen when web developers think about web se