Logo
VAERS Processing Script

v1.25


VAERS Complete - Enhanced Data Processing Script

Overview

vaers_complete.py is a comprehensive Python script for processing VAERS (Vaccine Adverse Event Reporting System) data with advanced features including multi-core parallel processing, memory-efficient chunked data handling, and comprehensive change tracking across CDC data releases.

Original Author: Gary Hawkins - http://univaers.com/download/ Enhanced Version: 2025 by Jason Page

Features

  • ✓ Multi-core parallel processing for faster execution
  • ✓ Memory-efficient chunked data handling for large datasets
  • ✓ Command-line dataset selection (COVID-19 era or full historical data)
  • ✓ Progress bars for all major operations
  • ✓ Comprehensive error tracking and reporting
  • ✓ Fixed statistics functionality
  • ✓ Change detection and tracking across data releases
  • ✓ Deduplication and data consolidation
  • ✓ Complete audit trail of modifications to VAERS reports

Requirements

Python Dependencies

pip install pandas numpy tqdm zipfile-deflate64
  • pandas: Data manipulation and analysis
  • numpy: Numerical operations
  • tqdm: Progress bars (optional but recommended)
  • zipfile-deflate64: Enhanced ZIP file handling (optional, falls back to standard zipfile)

System Requirements

  • Python 3.x
  • Multi-core CPU recommended for parallel processing
  • Sufficient RAM for large dataset processing (16GB+ recommended for full dataset)

Command-Line Options

Basic Syntax

python vaers_complete.py [OPTIONS]

Options Reference

--dataset {covid,full}

Default: covid

Selects which dataset to process:

  • covid: Process COVID-19 era data only (from 2020-12-13 onwards by default)
  • full: Process full historical VAERS dataset (from 1990-01-01 onwards by default)

Examples:

python vaers_complete.py --dataset covid
python vaers_complete.py --dataset full

--cores NUMBER

Default: Number of CPU cores available on system

Specifies the number of CPU cores to use for parallel processing.

Examples:

python vaers_complete.py --cores 8
python vaers_complete.py --cores 16
python vaers_complete.py --dataset full --cores 4

--chunk-size NUMBER

Default: 50000

Sets the chunk size for processing large datasets. Larger chunks use more memory but may be faster. Smaller chunks are more memory-efficient.

Examples:

python vaers_complete.py --chunk-size 100000
python vaers_complete.py --chunk-size 25000

--date-floor DATE

Default: 2020-12-13 for COVID dataset, 1990-01-01 for full dataset

Sets the earliest date to process (format: YYYY-MM-DD). Records before this date will be excluded.

Examples:

python vaers_complete.py --date-floor 2021-01-01
python vaers_complete.py --dataset full --date-floor 2000-01-01

--date-ceiling DATE

Default: 2025-01-01

Sets the latest date to process (format: YYYY-MM-DD). Records after this date will be excluded.

Examples:

python vaers_complete.py --date-ceiling 2024-12-31
python vaers_complete.py --date-floor 2020-01-01 --date-ceiling 2023-12-31

--test

Default: Not set

Uses test cases directory (z_test_cases) instead of the main working directory. Useful for development and testing.

Example:

python vaers_complete.py --test

--no-progress

Default: Not set

Disables progress bars. Useful for logging output to files or when running in environments without terminal support.

Example:

python vaers_complete.py --no-progress > output.log

--merge-only

Default: Not set

Skips all processing and only creates the final merged file from existing processed data. Useful when you want to regenerate the final output without reprocessing everything.

Example:

python vaers_complete.py --merge-only

Usage Examples

Process COVID-19 data with 8 cores

python vaers_complete.py --dataset covid --cores 8

Process full historical dataset with 16 cores and larger chunks

python vaers_complete.py --dataset full --cores 16 --chunk-size 100000

Process COVID data from a specific start date

python vaers_complete.py --dataset covid --date-floor 2021-01-01

Process data for a specific date range

python vaers_complete.py --date-floor 2021-01-01 --date-ceiling 2023-12-31 --cores 8

Process with smaller chunks for memory-constrained systems

python vaers_complete.py --dataset covid --chunk-size 25000 --cores 4

Create final merged file only

python vaers_complete.py --merge-only

Run with test data

python vaers_complete.py --test --cores 4

Process without progress bars (for logging)

python vaers_complete.py --dataset covid --no-progress > processing.log 2>&1

Directory Structure

The script expects and creates the following directory structure:

.
├── 0_VAERS_Downloads/          # Input: Raw VAERS ZIP files from CDC
├── 1_vaers_working/            # Intermediate: Extracted CSV files
├── 1_vaers_consolidated/       # Intermediate: Consolidated data files
├── 2_vaers_full_compared/      # Output: Comparison results with change tracking
├── 3_vaers_flattened/          # Intermediate: Flattened data (one row per VAERS_ID)
├── stats.csv                   # Output: Processing statistics
├── never_published_any.txt     # Output: VAERS IDs never published
├── ever_published_any.txt      # Output: All VAERS IDs ever published
├── ever_published_covid.txt    # Output: COVID-related VAERS IDs
├── writeups_deduped.txt        # Output: Deduplicated symptom descriptions
└── VAERS_FINAL_MERGED.csv      # Final output: Complete merged dataset

Test Mode Directory Structure

When using --test flag:

z_test_cases/
├── drops/                      # Input: Test VAERS data
├── 1_vaers_working/
├── 1_vaers_consolidated/
├── 2_vaers_full_compared/
├── 3_vaers_flattened/
└── [output files]

Processing Workflow

The script performs the following main steps:

1. Consolidation

Combines the three VAERS data files for each data release:

  • *VAERSDATA.csv - Main report data
  • *VAERSVAX.csv - Vaccination details
  • *VAERSSYMPTOMS.csv - Symptom entries

Output: Consolidated files in 1_vaers_consolidated/

2. Flattening

Aggregates multiple vaccine entries per report into single rows:

  • Groups vaccine records by VAERS_ID
  • Merges all related data into one row per report
  • Joins symptom entries

Output: Flattened files in 3_vaers_flattened/

3. Comparison

Compares current data release with previous releases to detect changes:

  • Identifies new reports
  • Detects modifications to existing reports
  • Tracks deletions
  • Records all changes in the changes column
  • Counts cell edits

Output: Comparison files in 2_vaers_full_compared/

4. Final Merge

Creates the final consolidated output file containing:

  • All reports with complete change history
  • Cell edit counts
  • Status indicators (new, modified, deleted)
  • Complete audit trail

Output: VAERS_FINAL_MERGED.csv

Output Files

Primary Output

VAERS_FINAL_MERGED.csv

  • Complete dataset with all VAERS reports
  • Includes all historical changes tracked across data releases
  • Contains columns: cell_edits, status, changes
  • One row per VAERS_ID with complete information

Statistics and Tracking Files

stats.csv

  • Processing statistics for each data release
  • Counts of new reports, modifications, deletions
  • Date ranges and record counts

never_published_any.txt

  • VAERS IDs that were never published in any release
  • Identifies gaps in the VAERS ID sequence

ever_published_any.txt

  • Complete list of all VAERS IDs ever published
  • Includes all vaccine types

ever_published_covid.txt

  • List of COVID-19 vaccine-related VAERS IDs
  • Filtered by VAX_TYPE containing 'covid'

writeups_deduped.txt

  • Deduplicated symptom text descriptions
  • Useful for analysis of unique symptom patterns

Key Columns in Output

The final merged file contains all standard VAERS columns plus:

Standard VAERS Columns

  • VAERS_ID - Unique report identifier
  • AGE_YRS, SEX, STATE - Demographic information
  • DIED, L_THREAT, ER_VISIT, HOSPITAL, DISABLE - Serious outcomes
  • VAX_TYPE, VAX_MANU, VAX_LOT - Vaccine information
  • VAX_DATE, ONSET_DATE, RPT_DATE - Date information
  • SYMPTOM_TEXT - Symptom description
  • And many more...

Enhanced Tracking Columns

  • cell_edits - Count of cells modified across all releases
  • status - Report status (new, modified, deleted)
  • changes - Detailed log of all changes made to the report
  • symptom_entries - Aggregated symptom entries

Performance Tuning

For Fast Processing (High RAM)

python vaers_complete.py --dataset covid --cores 16 --chunk-size 100000

For Memory-Constrained Systems

python vaers_complete.py --dataset covid --cores 4 --chunk-size 25000

For Very Large Full Dataset

python vaers_complete.py --dataset full --cores 16 --chunk-size 50000

Error Handling

The script includes comprehensive error handling:

  • All errors are collected and displayed at the end of processing
  • Errors include timestamps for tracking
  • Processing continues when possible, skipping problematic files
  • Final error summary shows total errors encountered
  • Exit code 0 = success, 1 = errors occurred

Data Filtering

COVID Dataset Mode

By default, filters to COVID-19 era data:

  • Automatically detects the earliest COVID VAERS_ID
  • Removes all reports prior to first COVID vaccine report
  • Typically starts from VAERS_ID ~896636 (first trial report)

Full Dataset Mode

Processes complete historical VAERS data:

  • Includes all vaccine types from 1990 onwards (or specified date-floor)
  • Significantly larger processing time and storage requirements

Change Tracking

The script tracks modifications to VAERS reports across CDC data releases:

  • New reports: First appearance in a data release
  • Modifications: Changes to any field in existing reports
  • Deletions: Reports removed from later releases
  • Cell edits: Count of individual cell changes
  • Change log: Detailed description of what changed

Example change tracking entry:

2023-01-15: AGE_YRS changed from "45" to "46"
2023-01-15: SYMPTOM_TEXT appended with "Patient recovered"

Troubleshooting

Out of Memory Errors

  • Reduce --chunk-size to 25000 or lower
  • Reduce --cores to use fewer parallel processes
  • Process smaller date ranges using --date-floor and --date-ceiling

Progress Bars Not Showing

  • Install tqdm: pip install tqdm
  • Or disable with --no-progress if not needed

ZIP File Errors

  • Install zipfile-deflate64: pip install zipfile-deflate64
  • Script falls back to standard zipfile if not available

Missing Input Files

  • Ensure VAERS data files are in 0_VAERS_Downloads/ directory
  • Check that files are in correct ZIP format from CDC

License and Attribution

Original script by Gary Hawkins (http://univaers.com/download/) Enhanced version with performance improvements and additional features by Jason Page.

Notes

  • The script automatically handles mixed date formats (MM/DD/YYYY → YYYY-MM-DD)
  • Duplicate records are automatically identified and removed
  • String type handling is optimized for memory efficiency
  • All CSV files use UTF-8-sig encoding for compatibility
  • Progress tracking can be disabled for automated/batch processing

Support

For issues, questions, or contributions, refer to the original source or the repository where this script is maintained.

Comments

Leave a Comment


Approved Comments (0)

No comments yet.