Blogs/Technology

How to Analyse Documents Using AWS Services?

Written by Shubham Ambastha
May 26, 2026
7 Min Read
How to Analyse Documents Using AWS Services? Hero

Processing thousands of invoices, contracts, or medical records manually isn't just slow; it doesn't scale. AWS solves this with a pair of managed AI services: Amazon Textract for extraction and Amazon Comprehend for understanding. Together, they turn unstructured documents into structured, queryable data with no ML expertise required.

The Two Core Services To Analyse Documents

1. Amazon Textract

Textract is a fully managed OCR service that goes beyond reading text. It understands document structure, returning five extraction types:

FeatureWhat it extracts

Text

Raw lines and words

Forms

Key-value pairs (e.g. Name: Jane Doe)

Tables

Rows, columns, and cell contents

Queries

Answers to natural language questions (e.g. "What is the invoice total?")

Signatures

Location and confidence of detected signatures

Layout

Headers, footers, paragraphs, figures in reading order

Text

What it extracts

Raw lines and words

1 of 6

Supported inputs: PDF, JPEG, PNG, TIFF, from S3 or as direct byte payloads.

2. Amazon Comprehend

Comprehend is an NLP service that processes extracted text and adds semantic understanding. It doesn't read documents; it reads what Textract gives it.

Key capabilities:

  • Entity recognition - people, orgs, dates, locations
  • Sentiment analysis - positive, negative, neutral, mixed
  • Key phrase extraction - core topics and themes
  • PII detection - identify and redact sensitive data
  • Custom classification - train models on your own document categories

Amazon Comprehend Medical extends this for healthcare, extracting ICD-10 codes, medications (RxNorm), SNOMED-CT concepts, and PHI from clinical text. It is HIPAA eligible.

End-to-End Architecture

A production document analysis pipeline on AWS typically looks like this:

S3 (upload) → SQS → Lambda → Textract → SNS → Lambda → Comprehend → DynamoDB

Step by step:

  1. Document uploaded to S3 input bucket
  2. S3 event triggers SQS queue
  3. Lambda function start-textract-job calls StartDocumentAnalysis
  4. Textract completes and notifies via SNS topic
  5. SNS triggers Lambda process-textract-result → saves extracted text to S3
  6. Based on document type, Lambda calls Comprehend or Comprehend Medical
  7. NLP results processed and stored in DynamoDB or DocumentDB

Partner with Us for Success

Experience seamless collaboration and exceptional results.

This is fully serverless, no instances to manage, scales automatically with volume.

Working Code

Extract text and forms with Textract (synchronous)

Use AnalyzeDocument for single-page documents or real-time use cases:

import boto3
textract = boto3.client('textract', region_name='ap-south-1')
def analyse_document(bucket: str, key: str) -> dict:
    response = textract.analyze_document(
        Document={
            'S3Object': {'Bucket': bucket, 'Name': key}
        },
        FeatureTypes=['FORMS', 'TABLES', 'SIGNATURES']
    )
    return response['Blocks']

Extract from multi-page PDFs (asynchronous)

For multi-page documents, use the async API with SQS notification:

import boto3
textract = boto3.client('textract', region_name='ap-south-1')
SNS_TOPIC_ARN = 'arn:aws:sns:ap-south-1:123456789:textract-topic'
ROLE_ARN = 'arn:aws:iam::123456789:role/TextractRole'
def start_document_analysis(bucket: str, key: str) -> str:
    response = textract.start_document_analysis(
        DocumentLocation={
            'S3Object': {'Bucket': bucket, 'Name': key}
        },
        FeatureTypes=['FORMS', 'TABLES'],
        NotificationChannel={
            'SNSTopicArn': SNS_TOPIC_ARN,
            'RoleArn': ROLE_ARN
        }
    )
    return response['JobId']  # Poll or wait for SNS notification
def get_analysis_results(job_id: str) -> list:
    pages = []
    response = textract.get_document_analysis(JobId=job_id)
    pages.extend(response['Blocks'])
    while 'NextToken' in response:
        response = textract.get_document_analysis(
            JobId=job_id,
            NextToken=response['NextToken']
        )
        pages.extend(response['Blocks'])
    return pages

Parse key-value pairs from forms

def extract_key_value_pairs(blocks: list) -> dict:
    key_map, value_map, block_map = {}, {}, {}
    for block in blocks:
        block_map[block['Id']] = block
        if block['BlockType'] == 'KEY_VALUE_SET':
            if 'KEY' in block.get('EntityTypes', []):
                key_map[block['Id']] = block
            else:
                value_map[block['Id']] = block
    kvs = {}
    for key_id, key_block in key_map.items():
        key_text = get_text(key_block, block_map)
        val_block = find_value_block(key_block, value_map)
        val_text = get_text(val_block, block_map) if val_block else ''
        kvs[key_text] = val_text
    return kvs
def get_text(block: dict, block_map: dict) -> str:
    text = ''
    for rel in block.get('Relationships', []):
        if rel['Type'] == 'CHILD':
            for child_id in rel['Ids']:
                child = block_map.get(child_id, {})
                if child.get('BlockType') == 'WORD':
                    text += child.get('Text', '') + ' '
    return text.strip()
def find_value_block(key_block: dict, value_map: dict):
    for rel in key_block.get('Relationships', []):
        if rel['Type'] == 'VALUE':
            for val_id in rel['Ids']:
                if val_id in value_map:
                    return value_map[val_id]
    return None

Analyse extracted text with Comprehend

import boto3
comprehend = boto3.client('comprehend', region_name='ap-south-1')
def analyse_text(text: str) -> dict:
    # Run entity detection and sentiment in parallel
    entities = comprehend.detect_entities(Text=text, LanguageCode='en')
    sentiment = comprehend.detect_sentiment(Text=text, LanguageCode='en')
    key_phrases = comprehend.detect_key_phrases(Text=text, LanguageCode='en')
    return {
        'entities': entities['Entities'],
        'sentiment': sentiment['Sentiment'],
        'key_phrases': [kp['Text'] for kp in key_phrases['KeyPhrases']]
    }

Comprehend Medical for clinical documents

comprehend_medical = boto3.client('comprehendmedical', region_name='ap-south-1')
def analyse_medical_text(text: str) -> dict:
    entities = comprehend_medical.detect_entities_v2(Text=text)
    phi = comprehend_medical.detect_phi(Text=text)
    return {
        'medical_entities': entities['Entities'],
        'phi_detected': phi['Entities']  # Protected Health Information
    }

Estimated Cost: 1,000 Documents/Month

ServiceUsageEstimated Cost

Amazon S3

1 GB storage

~$0.02

Amazon SQS

2,000 requests

$0.0008

AWS Lambda

Within free tier

$0.00

Amazon Textract

Text + Forms (1,000 pages)

$51.50

Amazon SNS

1,000 notifications

$0.0005

Amazon DynamoDB

1,000 reads/writes

$0.0015

Amazon Comprehend

1M characters

$1.00

Amazon Comprehend Medical

1M characters

$12.50

Total


~$65.00/month

Amazon S3

Usage

1 GB storage

Estimated Cost

~$0.02

1 of 9

Textract dominates cost, specifically the Forms and Tables feature ($50 per 1,000 pages vs $1.50 for text-only). If you only need raw text, skip FORMS and TABLES in FeatureTypes to reduce cost significantly.

Best Practices

Document quality matters most. High-resolution scans (300 DPI+), clear fonts, and consistent formatting directly improve Textract accuracy. Pre-process images to normalize contrast and orientation before uploading.

Use async for anything multi-page. AnalyzeDocument is synchronous and limited to a single page. For PDFs, always use StartDocumentAnalysis with SQS/SNS notification.

Use Queries for targeted extraction. Instead of parsing all blocks to find a specific field, pass a natural-language query like "What is the invoice number?", Textract returns a direct answer with a confidence score.

Encrypt at rest and in transit. Use AWS KMS for S3 bucket encryption and enable server-side encryption on DynamoDB. For medical documents, apply strict IAM policies limiting Comprehend Medical access to specific roles.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Chunk long texts for Comprehend. Comprehend's detect_entities has a 5,000-byte limit per call. Split the extracted text into chunks before sending, then merge the results.

Frequently Asked Questions

What's the difference between Textract and traditional OCR? 

Traditional OCR extracts raw text. Textract understands document structure; it knows that Invoice No: is a key and INV-2024-001 is its value, and returns them as a pair rather than separate lines of text.

When should I use synchronous vs asynchronous Textract? 

Use AnalyzeDocument (sync) for single-page documents and real-time use cases. Use StartDocumentAnalysis (async) for multi-page PDFs or batch processing; it's non-blocking and handles pagination automatically.

Is this HIPAA compliant? 

Amazon Comprehend Medical is HIPAA eligible when deployed correctly. Ensure PHI is encrypted with KMS, access is restricted via IAM, and the AWS Business Associate Agreement (BAA) is signed.

Can Textract handle handwritten text? 

Yes, Textract supports printed and handwritten text detection. Accuracy depends on handwriting quality; printed documents consistently yield higher confidence scores.

How do I improve extraction accuracy for custom document layouts? 

Use Textract Adapters, trainable components you attach to AnalyzeDocument. You annotate sample documents, train the adapter, and pass the AdapterId with future requests to customize model output for your specific forms.

Final Thoughts

The Textract → Comprehend pipeline covers the full document intelligence loop: extract structure, then extract meaning. With Lambda and SQS tying the stages together, the entire workflow is serverless, event-driven, and scales to any volume without infrastructure changes. Start with AnalyzeDocument on a few sample documents to validate the extraction quality. Once that's solid, build the async pipeline for production; the architecture above handles the rest.

Author-Shubham Ambastha
Shubham Ambastha

I'm a Senior Software Developer with 4.5+ years of experience in building optimized, cost-effective web applications. Passionate about coding and innovation, I create impactful tech solutions.

Share this article

Phone

Next for you

8 Best GraphQL Libraries for Node.js in 2025 Cover

Technology

Jan 29, 20268 min read

8 Best GraphQL Libraries for Node.js in 2025

Why do some GraphQL APIs respond in milliseconds while others take seconds? The difference often comes down to choosing the right GraphQL library for Node.js. According to npm trends, Apollo Server Express alone sees over 800,000 weekly downloads, proving that developers need reliable tools to build production-ready GraphQL servers. The truth is, building GraphQL APIs in Node.js has never been easier, but picking the wrong library can slow down your entire application. Modern web applications d

I Tested 9 React Native Animation Libraries (Here’s What Works) Cover

Technology

Feb 10, 202614 min read

I Tested 9 React Native Animation Libraries (Here’s What Works)

Why do some mobile apps feel smooth while others feel clunky? I’ve noticed the difference is usually animations under load, especially during scrolling, navigation, and gesture-heavy screens. Google research shows 53% of mobile site visits are abandoned if pages take longer than three seconds to load, and the same performance expectations carry over to mobile apps. The truth is, smooth animations in React Native apps are no longer a luxury; they’re a must-have for a modern, engaging user experi

9 Critical Practices for Secure Web Application Development Cover

Technology

May 18, 20266 min read

9 Critical Practices for Secure Web Application Development

In 2026, developing modern web applications requires a balance between speed and security. Product strategy often pressures development teams to move fast, and ignoring application security can cause catastrophic results. For example, post-credential-based attacks have caused over $5 billion in losses. Security vulnerabilities in web applications are not just technical security problems; they are a business risk. The truth is that security incidents happen when web developers think about web se