VAERS Processing Script

v1.25

Original Article URL: https://chart.vaersdata.org/?page=vaers-processing-script

VAERS Complete - Enhanced Data Processing Script

Overview

vaers_complete.py is a comprehensive Python script for processing VAERS (Vaccine Adverse Event Reporting System) data with advanced features including multi-core parallel processing, memory-efficient chunked data handling, and comprehensive change tracking across CDC data releases.

Original Author: Gary Hawkins - http://univaers.com/download/ Enhanced Version: 2025 by Jason Page

Features

✓ Multi-core parallel processing for faster execution
✓ Memory-efficient chunked data handling for large datasets
✓ Command-line dataset selection (COVID-19 era or full historical data)
✓ Progress bars for all major operations
✓ Comprehensive error tracking and reporting
✓ Fixed statistics functionality
✓ Change detection and tracking across data releases
✓ Deduplication and data consolidation
✓ Complete audit trail of modifications to VAERS reports

Requirements

Python Dependencies

pip install pandas numpy tqdm zipfile-deflate64

pandas: Data manipulation and analysis
numpy: Numerical operations
tqdm: Progress bars (optional but recommended)
zipfile-deflate64: Enhanced ZIP file handling (optional, falls back to standard zipfile)

System Requirements

Python 3.x
Multi-core CPU recommended for parallel processing
Sufficient RAM for large dataset processing (16GB+ recommended for full dataset)

Command-Line Options

Basic Syntax

python vaers_complete.py [OPTIONS]

Options Reference

`--dataset {covid,full}`

Default: covid

Selects which dataset to process:

covid: Process COVID-19 era data only (from 2020-12-13 onwards by default)
full: Process full historical VAERS dataset (from 1990-01-01 onwards by default)

Examples:

python vaers_complete.py --dataset covid
python vaers_complete.py --dataset full

`--cores NUMBER`

Default: Number of CPU cores available on system

Specifies the number of CPU cores to use for parallel processing.

Examples:

python vaers_complete.py --cores 8
python vaers_complete.py --cores 16
python vaers_complete.py --dataset full --cores 4

`--chunk-size NUMBER`

Default: 50000

Sets the chunk size for processing large datasets. Larger chunks use more memory but may be faster. Smaller chunks are more memory-efficient.

Examples:

python vaers_complete.py --chunk-size 100000
python vaers_complete.py --chunk-size 25000

`--date-floor DATE`

Default: 2020-12-13 for COVID dataset, 1990-01-01 for full dataset

Sets the earliest date to process (format: YYYY-MM-DD). Records before this date will be excluded.

Examples:

python vaers_complete.py --date-floor 2021-01-01
python vaers_complete.py --dataset full --date-floor 2000-01-01

`--date-ceiling DATE`

Default: 2025-01-01

Sets the latest date to process (format: YYYY-MM-DD). Records after this date will be excluded.

Examples:

python vaers_complete.py --date-ceiling 2024-12-31
python vaers_complete.py --date-floor 2020-01-01 --date-ceiling 2023-12-31

`--test`

Default: Not set

Uses test cases directory (z_test_cases) instead of the main working directory. Useful for development and testing.

Example:

python vaers_complete.py --test

`--no-progress`

Default: Not set

Disables progress bars. Useful for logging output to files or when running in environments without terminal support.

Example:

python vaers_complete.py --no-progress > output.log

`--merge-only`

Default: Not set

Skips all processing and only creates the final merged file from existing processed data. Useful when you want to regenerate the final output without reprocessing everything.

Example:

python vaers_complete.py --merge-only

Usage Examples

Process COVID-19 data with 8 cores

python vaers_complete.py --dataset covid --cores 8

Process full historical dataset with 16 cores and larger chunks

python vaers_complete.py --dataset full --cores 16 --chunk-size 100000

Process COVID data from a specific start date

python vaers_complete.py --dataset covid --date-floor 2021-01-01

Process data for a specific date range

python vaers_complete.py --date-floor 2021-01-01 --date-ceiling 2023-12-31 --cores 8

Process with smaller chunks for memory-constrained systems

python vaers_complete.py --dataset covid --chunk-size 25000 --cores 4

Create final merged file only

python vaers_complete.py --merge-only

Run with test data

python vaers_complete.py --test --cores 4

Process without progress bars (for logging)

python vaers_complete.py --dataset covid --no-progress > processing.log 2>&1

Directory Structure

The script expects and creates the following directory structure:

.
├── 0_VAERS_Downloads/          # Input: Raw VAERS ZIP files from CDC
├── 1_vaers_working/            # Intermediate: Extracted CSV files
├── 1_vaers_consolidated/       # Intermediate: Consolidated data files
├── 2_vaers_full_compared/      # Output: Comparison results with change tracking
├── 3_vaers_flattened/          # Intermediate: Flattened data (one row per VAERS_ID)
├── stats.csv                   # Output: Processing statistics
├── never_published_any.txt     # Output: VAERS IDs never published
├── ever_published_any.txt      # Output: All VAERS IDs ever published
├── ever_published_covid.txt    # Output: COVID-related VAERS IDs
├── writeups_deduped.txt        # Output: Deduplicated symptom descriptions
└── VAERS_FINAL_MERGED.csv      # Final output: Complete merged dataset

Test Mode Directory Structure

When using --test flag:

z_test_cases/
├── drops/                      # Input: Test VAERS data
├── 1_vaers_working/
├── 1_vaers_consolidated/
├── 2_vaers_full_compared/
├── 3_vaers_flattened/
└── [output files]

Processing Workflow

The script performs the following main steps:

1. Consolidation

Combines the three VAERS data files for each data release:

*VAERSDATA.csv - Main report data
*VAERSVAX.csv - Vaccination details
*VAERSSYMPTOMS.csv - Symptom entries

Output: Consolidated files in 1_vaers_consolidated/

2. Flattening

Aggregates multiple vaccine entries per report into single rows:

Groups vaccine records by VAERS_ID
Merges all related data into one row per report
Joins symptom entries

Output: Flattened files in 3_vaers_flattened/

3. Comparison

Compares current data release with previous releases to detect changes:

Identifies new reports
Detects modifications to existing reports
Tracks deletions
Records all changes in the changes column
Counts cell edits

Output: Comparison files in 2_vaers_full_compared/

4. Final Merge

Creates the final consolidated output file containing:

All reports with complete change history
Cell edit counts
Status indicators (new, modified, deleted)
Complete audit trail

Output: VAERS_FINAL_MERGED.csv

Output Files

Primary Output

VAERS_FINAL_MERGED.csv

Complete dataset with all VAERS reports
Includes all historical changes tracked across data releases
Contains columns: cell_edits, status, changes
One row per VAERS_ID with complete information

Statistics and Tracking Files

stats.csv

Processing statistics for each data release
Counts of new reports, modifications, deletions
Date ranges and record counts

never_published_any.txt

VAERS IDs that were never published in any release
Identifies gaps in the VAERS ID sequence

ever_published_any.txt

Complete list of all VAERS IDs ever published
Includes all vaccine types

ever_published_covid.txt

List of COVID-19 vaccine-related VAERS IDs
Filtered by VAX_TYPE containing 'covid'

writeups_deduped.txt

Deduplicated symptom text descriptions
Useful for analysis of unique symptom patterns

Key Columns in Output

The final merged file contains all standard VAERS columns plus:

Standard VAERS Columns

VAERS_ID - Unique report identifier
AGE_YRS, SEX, STATE - Demographic information
DIED, L_THREAT, ER_VISIT, HOSPITAL, DISABLE - Serious outcomes
VAX_TYPE, VAX_MANU, VAX_LOT - Vaccine information
VAX_DATE, ONSET_DATE, RPT_DATE - Date information
SYMPTOM_TEXT - Symptom description
And many more...

Enhanced Tracking Columns

cell_edits - Count of cells modified across all releases
status - Report status (new, modified, deleted)
changes - Detailed log of all changes made to the report
symptom_entries - Aggregated symptom entries

Performance Tuning

For Fast Processing (High RAM)

python vaers_complete.py --dataset covid --cores 16 --chunk-size 100000

For Memory-Constrained Systems

python vaers_complete.py --dataset covid --cores 4 --chunk-size 25000

For Very Large Full Dataset

python vaers_complete.py --dataset full --cores 16 --chunk-size 50000

Error Handling

The script includes comprehensive error handling:

All errors are collected and displayed at the end of processing
Errors include timestamps for tracking
Processing continues when possible, skipping problematic files
Final error summary shows total errors encountered
Exit code 0 = success, 1 = errors occurred

Data Filtering

COVID Dataset Mode

By default, filters to COVID-19 era data:

Automatically detects the earliest COVID VAERS_ID
Removes all reports prior to first COVID vaccine report
Typically starts from VAERS_ID ~896636 (first trial report)

Full Dataset Mode

Processes complete historical VAERS data:

Includes all vaccine types from 1990 onwards (or specified date-floor)
Significantly larger processing time and storage requirements

Change Tracking

The script tracks modifications to VAERS reports across CDC data releases:

New reports: First appearance in a data release
Modifications: Changes to any field in existing reports
Deletions: Reports removed from later releases
Cell edits: Count of individual cell changes
Change log: Detailed description of what changed

Example change tracking entry:

2023-01-15: AGE_YRS changed from "45" to "46"
2023-01-15: SYMPTOM_TEXT appended with "Patient recovered"

Troubleshooting

Out of Memory Errors

Reduce --chunk-size to 25000 or lower
Reduce --cores to use fewer parallel processes
Process smaller date ranges using --date-floor and --date-ceiling

Progress Bars Not Showing

Install tqdm: pip install tqdm
Or disable with --no-progress if not needed

ZIP File Errors

Install zipfile-deflate64: pip install zipfile-deflate64
Script falls back to standard zipfile if not available

Missing Input Files

Ensure VAERS data files are in 0_VAERS_Downloads/ directory
Check that files are in correct ZIP format from CDC

License and Attribution

Original script by Gary Hawkins (http://univaers.com/download/) Enhanced version with performance improvements and additional features by Jason Page.

Notes

The script automatically handles mixed date formats (MM/DD/YYYY → YYYY-MM-DD)
Duplicate records are automatically identified and removed
String type handling is optimized for memory efficiency
All CSV files use UTF-8-sig encoding for compatibility
Progress tracking can be disabled for automated/batch processing

Support

For issues, questions, or contributions, refer to the original source or the repository where this script is maintained.

Comments

Approved Comments (0)

No comments yet.

VAERS Complete - Enhanced Data Processing Script

Overview

Features

Requirements

Python Dependencies

System Requirements

Command-Line Options

Basic Syntax

Options Reference

--dataset {covid,full}

--cores NUMBER

--chunk-size NUMBER

--date-floor DATE

--date-ceiling DATE

--test

--no-progress

--merge-only

Usage Examples

Process COVID-19 data with 8 cores

Process full historical dataset with 16 cores and larger chunks

Process COVID data from a specific start date

Process data for a specific date range

Process with smaller chunks for memory-constrained systems

Create final merged file only

Run with test data

Process without progress bars (for logging)

Directory Structure

Test Mode Directory Structure

Processing Workflow

1. Consolidation

2. Flattening

3. Comparison

4. Final Merge

Output Files

Primary Output

Statistics and Tracking Files

Key Columns in Output

Standard VAERS Columns

Enhanced Tracking Columns

Performance Tuning

For Fast Processing (High RAM)

For Memory-Constrained Systems

For Very Large Full Dataset

Error Handling

Data Filtering

COVID Dataset Mode

Full Dataset Mode

Change Tracking

Troubleshooting

Out of Memory Errors

Progress Bars Not Showing

ZIP File Errors

Missing Input Files

License and Attribution

Notes

Support

Leave a Comment

Approved Comments (0)

`--dataset {covid,full}`

`--cores NUMBER`

`--chunk-size NUMBER`

`--date-floor DATE`

`--date-ceiling DATE`

`--test`

`--no-progress`

`--merge-only`