CSV Column Analyzer
Analyze CSV data for null counts, unique values, and inferred data types
Understanding CSV Data Analysis
CSV (Comma-Separated Values) is one of the most common data exchange formats. Analyzing CSV data helps you understand data quality, identify issues, and plan data transformations before loading into databases or analytics systems.
Key Data Quality Metrics
Null Count & Percentage
Null (missing) values indicate incomplete data. High null percentages may indicate:
- Data collection problems
- Optional fields that users skip
- Integration issues between systems
- Recently added columns without backfill
Rule of thumb: Columns with >50% nulls are often candidates for removal or special handling.
Unique Count & Cardinality
The number of unique values reveals the cardinality of a column:
- High cardinality (100% unique): Likely IDs or unique identifiers
- Medium cardinality (10-80% unique): Names, dates, categories
- Low cardinality (<10% unique): Status flags, types, boolean values
Cardinality affects indexing strategies and database performance.
Inferred Data Types
The analyzer attempts to infer the data type based on the values:
- Integer: All numeric values without decimal points
- Float: Numeric values with decimal points
- Boolean: true/false, yes/no, 1/0 values
- Date/String: Contains date separators (-, /)
- String: All other text values
Data Quality Scoring
Quality is assessed based on null percentage:
- Good: <20% nulls - High quality, ready for use
- Fair: 20-50% nulls - Usable but needs attention
- Poor: >50% nulls - Significant data quality issues
Common Data Issues
Missing Values
Can be represented as empty strings, "NULL", "NA", "N/A", or simply blank cells. Always normalize null representations when cleaning data.
Type Inconsistencies
A column might contain mixed types (e.g., numbers and text). This causes problems when importing to databases that require consistent types.
Encoding Issues
Special characters may display incorrectly if the file encoding doesn't match the reader. Use UTF-8 encoding when possible.
Best Practices
Before Loading Data
- Analyze a sample to understand data structure
- Check for null patterns and decide on handling strategy
- Verify inferred types match expectations
- Look for outliers in sample values
- Identify high-cardinality columns for indexing
Handling Nulls
- Drop: Remove rows/columns with too many nulls
- Impute: Fill with mean, median, or mode
- Forward/Backward fill: Use previous/next value
- Flag: Add boolean column indicating null presence
- Keep: Some nulls are meaningful (e.g., optional fields)
Sample CSV Format
id,name,age,email,status 1,John,25,john@example.com,active 2,Jane,30,,inactive 3,Bob,35,bob@example.com,active 4,Alice,,alice@example.com,active 5,Charlie,40,charlie@example.com,
When to Use
- Before importing to database
- Data quality assessment
- Schema design planning
- ETL pipeline validation
- Identifying data issues
- Index strategy planning