Data Sourcing¶
Creating your own dataset is always recommended. But, there are many reasons why you many prefer to work with existing material. If you are not interested in creating your own dataset, many data files already exist online that you might explore with visualization.
This section will offer a few places you might find previously compiled data. This is definitely not an exhaustive list, but more of a launching-off point to get you started.
Locating Existing Datasets¶
There are seemingly innumerable places from which to source data depending on your field, interests, and skill level. The below provides just a sampling of places you might look.
National Level and Governmental Data¶
Datasets Produced by Specific Governmental Agencies or Departments i.e. EPA, NOAA
Census Data US Census
City and/or State-Level Data i.e. NYC Open Data, New York State Open Data (Note: many states and cities provide datasets in this way.)
Popular Large Data Repositories¶
Academic Research Projects and Repositories¶
UN Peace Agreements Database at University of Edinburgh
NULab for Texts, Maps, and Networks dataset list created at Northeastern University
Datasets and Lists Created by Tools¶
Gephi’s Sample Datasets
Tableau’s “Datasets for Everyone”
Datasets by Journalists¶
Humanities and Heritage Data Collections¶
Many (likely most) major museums, libraries, and archives across the world have released their data in various formats. The below can provide a general example of the types of institutions to look into on your own. Also included are datasets relevant to digital humanists.
Post45 Data Collective contains peer-reviewed datasets by digital humanists and social scientists. Note: this is a new resource, and so datasets will be continuously added.
Project Gutenberg contains literary and historical texts out of copyright. Note: if you are interested in working with material from Project Gutenberg, you should refer to Allison Parrish’s “Gutenberg, dammit” project.
Library of Congress Labs i.e. “LC for Robots” and “Web Archive Datasets” pages include relevant LOC dataset information.
Note: If anyone is aware of free awesome (and esp. inclusive/feminist/ethical) datasets not included above please do reach out. I’d love to include them! I realize that there are many large data repositories, but if any have been particularly useful for learning (or are particularly extensive and wide-ranging) those would be ideal.