Resources
Scientific Communication
Open Source Tools
Apache Hadoop is a project developing open-source software for reliable, scalable, distributed computing
Apache Flink is an open-source platform for distributed stream and batch data processing
Apache Spark is a general engine for large-scale data processing
Data Privacy
Enabling Big Data through Europe’s New Data Protection Regulation. Viktor Mayer-Schönberger & Yann Padova
Privacy in the Age of Big Data, The Stanford Law Review
Ethics and Big Data
Perspectives on Big Data, Ethics, and Society. May 23, 2016 / By Jacob Metcalf, Emily F. Keller Danah Boyd
The Social, Cultural, & Ethical Dimensions of “Big Data”, March 17, 2014 – New York, NY
Tutorials
Rules of Machine Learning: Best Practices for ML Engineering
Machine Learning Tutorial for Beginners (kaggle.com/kanncaa1)
Deep Learning and Convolutional Neural Networks
Introducing convolutional networks
Natural Language Processing (almost) from Scratch (arxiv.org)
Deep Learning applied to NLP (arxiv.org)
Machine Learning Crash Course (google.com)
Machine Learning with Python (tutorialspoint.com)
Python Machine Learning (2nd Ed.) Code Repository (github.com/rasbt)
Advanced AI Tools
TensorFlow - an open source software library for numerical computation using data flow graphs.
The Microsoft Cognitive Toolkit: A free, open-source, commercial-grade toolkit that trains deep learning algorithms to learn like the human brain.
Geographic Datasets
Global Map: A set of consistent GIS layers covering the whole globe at 1km resolution including: transportation, elevation, drainage, vegetation, administrative boundaries, land cover, land use and population centres. Produced by the International Steering Committee on Global Mapping.
Koordinates: GIS data aggregation site including data in a number of categories such as elevation, environment, climate etc. Some global datasets, some based on continents, some for specific countries. Registration required.
European Environment Agency: Maps and datasets from the European Environment Agency, covering a huge range of physical geography and environmental topics. Europe only.
Satellite Application Facility on Climate Monitoring: Provides near real-time and retroactively-generated datasets of cloud cover, type and temperature, surface radiation budget and temperatures, among others.
Gridded climatic data for North America, South America and Europe: A huge range of climatic data at 1km and 4km resolution, derived from various models, including temperature, precipitation, snow and derived variables such as water deficit.
Natural Disaster Hazards: Hazard Frequency, Mortality and Economic Loss Risk as gridded data for the globe. Covers cyclones, drought, earthquakes, flood, landslide, volcano and a combination of them all.
Natural Disaster Hotspots: A wide range of geographic data on natural disasters (including volcanoes, earthquakes, landslide, flood and 'multihazards') with hazard frequency, economic loss etc.
Open Flights: Airport, airline and route data across the globe. Data is provided as CSV files which can be easily processed to produce GIS outputs. Data includes all known airports, and a large number of routes betwen airports.
Global Roads Open Access Data Set: A vector dataset of roads across the world, using a globally consistent data model, and suitable for mapping at the 1:250,000 level. Only roads between settlements are included, not residential streets, and the dataset is accurate to approximately 50m.
Earth Engine’s public data catalog includes a variety of standard Earth science raster datasets.
Capitaine European Train Stations: Metadata for all train stations in Europe including latitude and longitude.
GAR15: UN dataset for Global Assessment of Risk, showing the amount of capital invested in infrastructure at a 5km resolution. Useful for assessment of infrastructure risk and cost of natural disasters.
MODIS provides continuous global coverage every one to two days, and collects data from 36 spectral bands. Resolution: 250-1000m. 1999 Wide range of different datasets.
DATASETS for Data Science, Machine Learning and AI courses
The following datasets have been filtered and refined from a social media (Twitter) dataset, which can be used for courses on Big data, Machine Learning, Data Science and AI.
DTdata has a header row consisting of four attributes such as Topic, TWDate, RTNumber and Demand, and 27 rows of training data. Demand would be the output variable as the predicted class, and the others would be the input variables.
NBdata is same with upper dataset (i.e. DTdata), except for the number of retweet. The RTNumber column containing numerical numbers is transformed to categorical values for easy calculating the probabilities. In addition, the data set contains one record as test data.
KMdata contains 161 tweets with location data (i.e. GPS coordinates) to group it. Note that we created the latitude and longitude of extracted physical addresses from the collected tweets by performing a geocoding procedure, and negative values of the west longitudes were changed into positive values to fulfil the k-mean clustering.
SVMdata1 is generated by grouping into two or three clusters for the KMdata. It contains four column TWNumber, Latitude, Longitude and ClusterValue. The column ClusterValue indicates group numbers as the results of k-means clustering.
ANNdata is manipulated from an original data set and consequently contains five columns such as TWDate, RTNumber as integer, Latitude, Longitude and Demand. The TWDate was modified as generation days (i.e. 27, 28 and 29), and the Demand was distinguished into three values (i.e. 0, 0.5 and 1). The values denote the relevance degree of tweets for demand, in other words "0" and "1" respectively represent "no relevance for demand" and "related to demand."
Data Github Repository. In this repository you can find direct links to all the Public datasets, and you can find datasets for all the domains.
UCI(University of California) datasets. Here you can get access to the free data sets.
Open ML https://openml.org You can find more than 20,000 datasets here.
Datasets in Norway
Norwegian Mapping Agency Open Data: Open data from the Norwegian Mapping Agency, including topographical maps, road networks, elevation data, place names etc.
An API with ready-made datasets from SSB
Floods datasets in Norway
Transport datasets
Norweigan Land Cover: Various datasets concerning land resources in Norway provided by the Norwegian Landscape and Forest Institute, including land type, forest, tree species and site index .
Open and free geospatial data from Norway
Geological Survey of Norway: Geological data for Norway
Norwegian Petroleum Directorate: Data on licensed extraction areas, wells, fields, pipelines and survey data
HSDPA-bandwidth logs for mobile HTTP streaming scenarios (source: UiO)
Soccer Video and Player Position Dataset
Other Video/Audio Datasets
Berkeley DeepDrive BDD100k: The dataset for self-driving AI. It has over 100,000 videos of over 1,100-hour driving experiences across different times of the day and weather conditions. The annotated images come from New York and San Francisco areas.
Europeana Data, contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana - the trusted and comprehensive resource for European cultural heritage content.
Pouring Dataset: Videos of people pouring a variety of liquids from and into a variety of receptacles, used for research on unsupervised imitation learning (This data is licensed by Google Inc. under a Creative Commons Attribution 4.0 International License.)
An autonomous driving dataset and benchmark for optical flow: HD1K Benchmark Suite.
Multi-view video datasets based on 360° cameras.
The Cityscapes Dataset focuses on semantic understanding of urban street scenes.
DriveU Traffic Light Dataset — a dataset which addresses to researchers in the field of traffic light recognition/detection.
Related Websites, Datasets and Software
Sahana (Open and free system)
Ushahidi (Open and free system)
GeoNames (Geo-tagging software)
OpenStreetMap (Geographical information, important for gazeteers)
PyBossa (Crowdsourcing software)
Data visualization tools
GATE (Text processing)
WEKA (Open-source data mining software in Java)
ArkNLP (Twitter specific Natural Language Processing)
HDX (Humanitarian Data eXchange, datasets of humanitarian variables by UN OCHA)
TREC Temporal Summarization Track (Corpus for social media update summarization)
Twitter Events Corpus 120 million tweets, with relevance judgments for over 500 events
Disaster Risk - Datasets
TREC Microblog Corpus (Corpus of social media messages)
TREC Temporal Summarization – crisis events from 2012 aligned with TREC KBA Corpus
CrisisLex (Corpora of disaster-related social media messages)
CredBank (Corpus for credibility research)
Japan Radiation Map (derived from the SPEEDI data set)
Open Source Projects (Machine Learning, AI)
TensorFlow system is designed to facilitate research in machine learning, and to make it quick and easy to transition from research prototype to production system. Github URL: Tensorflow
Scikit-learn is simple and efficient tools for data mining and data analysis, accessible to anyone, and reusable in several context, built on NumPy, SciPy, and matplotlib, open source, commercially usable – BSD license. Github URL: Scikit-learn
Keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
PyBrain is a modular Machine Learning Library for Python. Github URL: PyBrain
Fuel is a data pipeline framework which provides your machine learning models with the data they need. It is planned to be used by both the Blocks and Pylearn2 neural network libraries. Github URL: Fuel
PyTorch, Tensors and Dynamic neural networks in Python with strong GPU acceleration. Github URL: pytorch
Theano allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Github URL: Theano
Gensim is a free Python library with features such as scalable statistical semantics, analyse plain-text documents for semantic structure, retrieve semantically similar documents. Github URL: Gensim
Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. Github URL: Caffe
Chainer is a Python-based, standalone open source framework for deep learning models. Github URL: Chainer
Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. Github URL: Statsmodels
Shogun is Machine learning toolbox which provides a wide range of unified and efficient Machine Learning (ML) methods. Github URL: Shogun
Neon is Nervana's Python-based deep learning library. It provides ease of use while delivering the highest performance.
Contributors: 78 (66% up), Commits: 1112, Github URL: NeonNilearn is a Python module for fast and easy statistical learning on NeuroImaging data. It leverages the scikit-learn Python toolbox for multivariate statistics with applications such as predictive modelling, classification, decoding, or connectivity analysis. Github URL: Nilearn
Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. Github URL: Pylearn2
NuPIC is an open source project based on a theory of neocortex called Hierarchical Temporal Memory (HTM). Github URL: NuPIC
Orange3 is open source machine learning and data visualization for novice and expert. Interactive data analysis workflows with a large toolbox. Github URL: Orange3
Pymc is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Github URL: Pymc
Deap is a novel evolutionary computation framework for rapid prototyping and testing of ideas. IGithub URL: Deap
Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. Github URL: Annoy