v1.25
Original Article URL: https://chart.vaersdata.org/?page=vaers-processing-script
VAERS Complete - Enhanced Data Processing Script
Overview
vaers_complete.py is a comprehensive Python script for processing VAERS (Vaccine Adverse Event Reporting System) data with advanced features including multi-core parallel processing, memory-efficient chunked data handling, and comprehensive change tracking across CDC data releases.
Original Author: Gary Hawkins - http://univaers.com/download/ Enhanced Version: 2025 by Jason Page
Features
- ✓ Multi-core parallel processing for faster execution
- ✓ Memory-efficient chunked data handling for large datasets
- ✓ Command-line dataset selection (COVID-19 era or full historical data)
- ✓ Progress bars for all major operations
- ✓ Comprehensive error tracking and reporting
- ✓ Fixed statistics functionality
- ✓ Change detection and tracking across data releases
- ✓ Deduplication and data consolidation
- ✓ Complete audit trail of modifications to VAERS reports
Requirements
Python Dependencies
pip install pandas numpy tqdm zipfile-deflate64
- pandas: Data manipulation and analysis
- numpy: Numerical operations
- tqdm: Progress bars (optional but recommended)
- zipfile-deflate64: Enhanced ZIP file handling (optional, falls back to standard zipfile)
System Requirements
- Python 3.x
- Multi-core CPU recommended for parallel processing
- Sufficient RAM for large dataset processing (16GB+ recommended for full dataset)
Command-Line Options
Basic Syntax
python vaers_complete.py [OPTIONS]
Options Reference
--dataset {covid,full}
Default: covid
Selects which dataset to process:
covid: Process COVID-19 era data only (from 2020-12-13 onwards by default)full: Process full historical VAERS dataset (from 1990-01-01 onwards by default)
Examples:
python vaers_complete.py --dataset covid
python vaers_complete.py --dataset full
--cores NUMBER
Default: Number of CPU cores available on system
Specifies the number of CPU cores to use for parallel processing.
Examples:
python vaers_complete.py --cores 8
python vaers_complete.py --cores 16
python vaers_complete.py --dataset full --cores 4
--chunk-size NUMBER
Default: 50000
Sets the chunk size for processing large datasets. Larger chunks use more memory but may be faster. Smaller chunks are more memory-efficient.
Examples:
python vaers_complete.py --chunk-size 100000
python vaers_complete.py --chunk-size 25000
--date-floor DATE
Default: 2020-12-13 for COVID dataset, 1990-01-01 for full dataset
Sets the earliest date to process (format: YYYY-MM-DD). Records before this date will be excluded.
Examples:
python vaers_complete.py --date-floor 2021-01-01
python vaers_complete.py --dataset full --date-floor 2000-01-01
--date-ceiling DATE
Default: 2025-01-01
Sets the latest date to process (format: YYYY-MM-DD). Records after this date will be excluded.
Examples:
python vaers_complete.py --date-ceiling 2024-12-31
python vaers_complete.py --date-floor 2020-01-01 --date-ceiling 2023-12-31
--test
Default: Not set
Uses test cases directory (z_test_cases) instead of the main working directory. Useful for development and testing.
Example:
python vaers_complete.py --test
--no-progress
Default: Not set
Disables progress bars. Useful for logging output to files or when running in environments without terminal support.
Example:
python vaers_complete.py --no-progress > output.log
--merge-only
Default: Not set
Skips all processing and only creates the final merged file from existing processed data. Useful when you want to regenerate the final output without reprocessing everything.
Example:
python vaers_complete.py --merge-only
Usage Examples
Process COVID-19 data with 8 cores
python vaers_complete.py --dataset covid --cores 8
Process full historical dataset with 16 cores and larger chunks
python vaers_complete.py --dataset full --cores 16 --chunk-size 100000
Process COVID data from a specific start date
python vaers_complete.py --dataset covid --date-floor 2021-01-01
Process data for a specific date range
python vaers_complete.py --date-floor 2021-01-01 --date-ceiling 2023-12-31 --cores 8
Process with smaller chunks for memory-constrained systems
python vaers_complete.py --dataset covid --chunk-size 25000 --cores 4
Create final merged file only
python vaers_complete.py --merge-only
Run with test data
python vaers_complete.py --test --cores 4
Process without progress bars (for logging)
python vaers_complete.py --dataset covid --no-progress > processing.log 2>&1
Directory Structure
The script expects and creates the following directory structure:
.
├── 0_VAERS_Downloads/ # Input: Raw VAERS ZIP files from CDC
├── 1_vaers_working/ # Intermediate: Extracted CSV files
├── 1_vaers_consolidated/ # Intermediate: Consolidated data files
├── 2_vaers_full_compared/ # Output: Comparison results with change tracking
├── 3_vaers_flattened/ # Intermediate: Flattened data (one row per VAERS_ID)
├── stats.csv # Output: Processing statistics
├── never_published_any.txt # Output: VAERS IDs never published
├── ever_published_any.txt # Output: All VAERS IDs ever published
├── ever_published_covid.txt # Output: COVID-related VAERS IDs
├── writeups_deduped.txt # Output: Deduplicated symptom descriptions
└── VAERS_FINAL_MERGED.csv # Final output: Complete merged dataset
Test Mode Directory Structure
When using --test flag:
z_test_cases/
├── drops/ # Input: Test VAERS data
├── 1_vaers_working/
├── 1_vaers_consolidated/
├── 2_vaers_full_compared/
├── 3_vaers_flattened/
└── [output files]
Processing Workflow
The script performs the following main steps:
1. Consolidation
Combines the three VAERS data files for each data release:
*VAERSDATA.csv- Main report data*VAERSVAX.csv- Vaccination details*VAERSSYMPTOMS.csv- Symptom entries
Output: Consolidated files in 1_vaers_consolidated/
2. Flattening
Aggregates multiple vaccine entries per report into single rows:
- Groups vaccine records by VAERS_ID
- Merges all related data into one row per report
- Joins symptom entries
Output: Flattened files in 3_vaers_flattened/
3. Comparison
Compares current data release with previous releases to detect changes:
- Identifies new reports
- Detects modifications to existing reports
- Tracks deletions
- Records all changes in the
changescolumn - Counts cell edits
Output: Comparison files in 2_vaers_full_compared/
4. Final Merge
Creates the final consolidated output file containing:
- All reports with complete change history
- Cell edit counts
- Status indicators (new, modified, deleted)
- Complete audit trail
Output: VAERS_FINAL_MERGED.csv
Output Files
Primary Output
VAERS_FINAL_MERGED.csv
- Complete dataset with all VAERS reports
- Includes all historical changes tracked across data releases
- Contains columns:
cell_edits,status,changes - One row per VAERS_ID with complete information
Statistics and Tracking Files
stats.csv
- Processing statistics for each data release
- Counts of new reports, modifications, deletions
- Date ranges and record counts
never_published_any.txt
- VAERS IDs that were never published in any release
- Identifies gaps in the VAERS ID sequence
ever_published_any.txt
- Complete list of all VAERS IDs ever published
- Includes all vaccine types
ever_published_covid.txt
- List of COVID-19 vaccine-related VAERS IDs
- Filtered by VAX_TYPE containing 'covid'
writeups_deduped.txt
- Deduplicated symptom text descriptions
- Useful for analysis of unique symptom patterns
Key Columns in Output
The final merged file contains all standard VAERS columns plus:
Standard VAERS Columns
VAERS_ID- Unique report identifierAGE_YRS,SEX,STATE- Demographic informationDIED,L_THREAT,ER_VISIT,HOSPITAL,DISABLE- Serious outcomesVAX_TYPE,VAX_MANU,VAX_LOT- Vaccine informationVAX_DATE,ONSET_DATE,RPT_DATE- Date informationSYMPTOM_TEXT- Symptom description- And many more...
Enhanced Tracking Columns
cell_edits- Count of cells modified across all releasesstatus- Report status (new, modified, deleted)changes- Detailed log of all changes made to the reportsymptom_entries- Aggregated symptom entries
Performance Tuning
For Fast Processing (High RAM)
python vaers_complete.py --dataset covid --cores 16 --chunk-size 100000
For Memory-Constrained Systems
python vaers_complete.py --dataset covid --cores 4 --chunk-size 25000
For Very Large Full Dataset
python vaers_complete.py --dataset full --cores 16 --chunk-size 50000
Error Handling
The script includes comprehensive error handling:
- All errors are collected and displayed at the end of processing
- Errors include timestamps for tracking
- Processing continues when possible, skipping problematic files
- Final error summary shows total errors encountered
- Exit code 0 = success, 1 = errors occurred
Data Filtering
COVID Dataset Mode
By default, filters to COVID-19 era data:
- Automatically detects the earliest COVID VAERS_ID
- Removes all reports prior to first COVID vaccine report
- Typically starts from VAERS_ID ~896636 (first trial report)
Full Dataset Mode
Processes complete historical VAERS data:
- Includes all vaccine types from 1990 onwards (or specified date-floor)
- Significantly larger processing time and storage requirements
Change Tracking
The script tracks modifications to VAERS reports across CDC data releases:
- New reports: First appearance in a data release
- Modifications: Changes to any field in existing reports
- Deletions: Reports removed from later releases
- Cell edits: Count of individual cell changes
- Change log: Detailed description of what changed
Example change tracking entry:
2023-01-15: AGE_YRS changed from "45" to "46"
2023-01-15: SYMPTOM_TEXT appended with "Patient recovered"
Troubleshooting
Out of Memory Errors
- Reduce
--chunk-sizeto 25000 or lower - Reduce
--coresto use fewer parallel processes - Process smaller date ranges using
--date-floorand--date-ceiling
Progress Bars Not Showing
- Install tqdm:
pip install tqdm - Or disable with
--no-progressif not needed
ZIP File Errors
- Install zipfile-deflate64:
pip install zipfile-deflate64 - Script falls back to standard zipfile if not available
Missing Input Files
- Ensure VAERS data files are in
0_VAERS_Downloads/directory - Check that files are in correct ZIP format from CDC
License and Attribution
Original script by Gary Hawkins (http://univaers.com/download/) Enhanced version with performance improvements and additional features by Jason Page.
Notes
- The script automatically handles mixed date formats (MM/DD/YYYY → YYYY-MM-DD)
- Duplicate records are automatically identified and removed
- String type handling is optimized for memory efficiency
- All CSV files use UTF-8-sig encoding for compatibility
- Progress tracking can be disabled for automated/batch processing
Support
For issues, questions, or contributions, refer to the original source or the repository where this script is maintained.
Leave a Comment
Approved Comments (0)
No comments yet.