Part One: Collecting Live News Data Using the News API
This project was completed for Cloud Computing @ George Washington University, with Daniel Anderson, Michael Arango, and Megan Foler. The full python code can be found here.
Introduction
Many of us choose the news we read, with or without knowing it, whether it's by the news sources we visit, articles we click, or personalities we follow. But what about the mood of the articles we want to read, from any source? This project creates a program to provide a user the choice to receive positive or negative news, from news sources around the world.
There are three parts to design this program:
1) Collecting Live News Data Using the News API
First, we will receive news article metadata and perform data preprocessing.
2) Using the Google Translate API to Translate International News
Next, we will translate any articles not in English.
3) Using the Google Natural Language API to Analyze News Sentiment
Finally, we will select positive or negative news, based on the Natural Language analysis and visualize the results.
The News API
The News API is a free to use API to pull live news articles from 70 international sources.
Set Up
Using this API takes few steps to set up. We must first request a key, here.
Then, install the newsapi
python library with pip install newsapi
Once those steps are complete, we can write the following to connect our new API key:
import newsapi
apikey = 'insertkeynumber'
Articles
The News API offers two classes, Articles
and Sources
, that redirect to two endpoints, https://newsapi.org/v1/articles and https://newsapi.org/v1/sources, respectively.
Starting with articles, we make our first request using .get()
#instantiate an article object
from newsapi.articles import Articles
a = Articles(API_KEY=apikey)
data = a.get(source="bbc-news", sort_by='top')
The .get()
method takes three arguments: source (required), sort_by (optional), and attributes_format (optional Default:True). The source
is the identifer for the news source or blog you want headlines from. The sort_by
argument lets the user specify which type of list they want. The possible options are top, latest, and popular. Note: not all options are available for all sources (default: top).
For each article request, the api returns:
author
- The author of the article
description
- A description or preface for the article.
title
- The headline or title of the article.
url
- The direct URL to the content page of the article.
urlToImage
- The URL to a relevant image for the article.
publishedAt
- The best attempt at finding a date for the article, in UTC (+0).1
A snippet of output for BBC News articles only. You can choose any source that you prefer:
{'articles': [{'author': 'BBC News',
'description': 'Five people are reported dead and 1,000 have been rescued, the US National Weather Service says.',
'publishedAt': '2017-08-27T13:54:29Z',
'title': "Storm Harvey: 1,000 rescued as Houston hit by 'catastrophic floods'",
'url': 'http://www.bbc.co.uk/news/world-us-canada-41067315',
'urlToImage': 'https://ichef.bbci.co.uk/images/ic/1024x576/p05dg1pb.jpg'},
{'author': 'BBC News',
'description': 'Rockport residents describe the moment the great storm hit their homes in Texas.',
'publishedAt': '2017-08-27T07:08:39Z',
'title': 'Harvey: Too poor to flee the hurricane',
'url': 'http://www.bbc.co.uk/news/world-us-canada-41065335',
'urlToImage': 'https://ichef.bbci.co.uk/news/1024/cpsprodpb/B42E/production/_97562164_judy.jpg'},
{'author': 'BBC News',
'description': 'The sister of two terrorist suspects has condemned the attacks in Spain which left 15 people dead.',
'publishedAt': '2017-08-26T23:48:54Z',
'title': "Spain attacks: Suspects' sister condemns violence",
'url': 'http://www.bbc.co.uk/news/av/world-europe-41064715/spain-attacks-suspects-sister-condemns-violence',
'urlToImage': 'https://ichef-1.bbci.co.uk/news/1024/cpsprodpb/4728/production/_97561281_p05df89t.jpg'},
Data Processing
A pandas DataFrame is much easer to work with down the road, so we will convert from JSON to DataFrame using pd.DataFrame.from_dict
and .appy([pd.Series])
:
import pandas
data = pd.DataFrame.from_dict(data)
data = pd.concat([data.drop(['articles'], axis=1), data['articles'].apply(pd.Series)], axis=1)
data.head()
Sources
Next, let's work with Sources
The .get()
method for Sources
takes three arguments: category (optional), language (optional), and country (optional).
First set up the API connection
from newsapi.sources import Sources
s = Sources(API_KEY=apikey)
Next, we make a request for all available sources and convert to pandas dataframe
# return all available sources
sources = s.get()
# convert to dataframe and drop status column
sources = data = pd.DataFrame.from_dict(sources).drop('status', axis=1)
# take the array column 'sources' and spread it across multiple columns
sources = pd.concat([sources.drop(['sources'], axis=1),
sources['sources'].apply(pd.Series)], axis=1).drop('urlsToLogos', axis=1)
sources.tail()
Combining the News API Data
As you will see in the next post, we want both article and source information when analyzing the content of these articles.
We now call all articles and transform the dataset into a DataFrame:
results = []
for i in range(len(sources)):
results.append(a.get(sources['id'][i]))
results = pd.DataFrame.from_dict(results)
results = pd.concat([results.drop(['articles'], axis=1),
results['articles'].apply(pd.Series)], axis=1)
We use the pd.melt()
panads function so that our data is organized to reflect one article per row.
results = pd.melt(frame = results, id_vars=['source','status','sortBy'],
var_name='new').drop(['status','sortBy','new'],axis=1)
We want a column for each piece of metadata, so we use apply(pd.Series)
again, and we drop the columns with empty values:
results = pd.concat([results.drop(['value'], axis=1),
results['value'].apply(pd.Series)], axis=1)
results = results.drop([0], axis=1)
results = results.dropna()
Conclusion
We have completed Part 1 of this project. Now, we will use our preprocessed data to perform translation and sentiment analysis.
Click here to continue to Part 2: Calling the Google Translate API to Translate International News
Go Top