Skip to main content

Command Palette

Search for a command to run...

Common Data File Formats Every Data Scientist Should Know

Essential Data File Formats for Modern Data Science

Published
3 min read

Data comes in many shapes and sizes. To work with it, a data scientist must understand the file formats in which data is stored, exchanged, and processed. Each format has its own strengths, weaknesses, and best use cases.

In this article, we’ll explore five of the most common data file formats you’ll encounter in data science: CSV/TSV, Excel (.xlsx), XML, PDF, and JSON.


1. CSV / TSV (Comma-Separated Values / Tab-Separated Values)

  • Structure: Plain text files where each line is a row, and values are separated by commas (CSV) or tabs (TSV).

  • Strengths:

    • Lightweight and simple.

    • Easy to open in text editors, Excel, or programming libraries like Pandas.

    • Widely supported across platforms.

  • Limitations:

    • Cannot store formatting (colors, formulas).

    • Ambiguity with commas inside text unless carefully quoted.

  • Use Case: Data exchange between systems, simple tabular datasets.

Example (CSV):

Name, Age, Score
Alice, 23, 89
Bob, 21, 92

2. MS Excel Open XML Spreadsheet (.xlsx)

  • Structure: XML-based zipped files created by Microsoft Excel. Each workbook can contain multiple sheets, formulas, charts, and metadata.

  • Strengths:

    • User-friendly and widely used in business.

    • Supports advanced features (formulas, pivot tables, charts).

    • Can store large amounts of structured data.

  • Limitations:

    • Heavier and slower to process than CSV.

    • Requires libraries like openpyxl, xlrd, or APIs for automation.

  • Use Case: Business reporting, financial modeling, structured datasets with formulas.


3. XML (eXtensible Markup Language)

  • Structure: Hierarchical, tag-based format. Data is enclosed within user-defined tags.

  • Strengths:

    • Human-readable and machine-readable.

    • Good for hierarchical or nested data.

    • Widely used in web services, configuration files, and legacy systems.

  • Limitations:

    • Verbose—files can get large quickly.

    • Parsing overhead compared to simpler formats like CSV/JSON.

  • Use Case: Metadata storage, configuration, web APIs (SOAP), document exchange.

Example:

<Student>
    <Name>Alice</Name>
    <Age>23</Age>
    <Score>89</Score>
</Student>

4. PDF (Portable Document Format)

  • Structure: Binary/encoded format developed by Adobe to preserve document layout. Can contain text, images, tables, and multimedia.

  • Strengths:

    • Universally readable.

    • Preserves formatting across devices.

    • Widely used for official reports, research papers, scanned documents.

  • Limitations:

    • Not designed for raw data storage.

    • Extracting structured data is complex (requires libraries like PyPDF2, pdfplumber, or OCR for scanned PDFs).

  • Use Case: Reports, academic papers, financial statements, scanned data sources.


5. JSON (JavaScript Object Notation)

  • Structure: Lightweight text format for representing structured data as key–value pairs. Supports nesting and arrays.

  • Strengths:

    • Human- and machine-readable.

    • Native to web APIs (especially REST).

    • Easily handled in Python, R, and almost every modern language.

  • Limitations:

    • Can become messy with very large datasets.

    • No built-in schema enforcement (but can be combined with standards like JSON Schema).

  • Use Case: Web APIs, configuration files, hierarchical/nested data.

Example:

{
  "Name": "Alice",
  "Age": 23,
  "Score": 89
}

Conclusion

  • CSV/TSV → Simple tabular data, great for portability.

  • Excel (.xlsx) → Business-friendly, formula-rich spreadsheets.

  • XML → Structured, tag-based, useful for hierarchical data.

  • PDF → Perfect for documents, harder for raw data.

  • JSON → Lightweight, flexible, and the backbone of modern web data.