Scraping Blogs With Python | Newspaper3k CheatSheet
Newspaper3k is a Python library that allows for easy access to articles from online newspapers. It can be used to scrape articles from a wide range of news sources and extract information such as the article's title, author, text, and more. It is designed to be easy to use and allows for simple integration into other applications or scripts.
Getting Started
pip install newspaper3k
Once you have the library installed, you can start using it in your Python script.
To start, you will need to import the library:
from newspaper import Article
Once you have imported the library, you can start using it by creating an instance of the Article class and passing in the URL of the article you want to scrape:
article = Article('https://sid.black/blogs/gpt3-crash-course')
To download the article's content, you'll need to call the download method:
article.download()
To parse the article's HTML, you'll need to call the parse method:
article.parse()
You can then access the article's properties, such as the title, text, and author, using the following properties:
title = article.title
text = article.text
authors = article.authors
If you want to extract article's images, you can use article.images
property.
You can also extract the article's keywords and summary by calling the nlp
method:
article.nlp()
keywords = article.keywords
summary = article.summary
This should be enough to get you started with this amazing library.
Advanced Usage Of Newspaper3K
Extracting Article Meta Data: The Article
class has several methods that allow you to extract metadata such as the article's publish date, top image, and meta keywords. For example, you can use the publish_date
property to get the date the article was published, and the top_image
property to get the URL of the top image in the article.
article.download()
article.parse()
publish_date = article.publish_date
top_image = article.top_image
meta_keywords = article.meta_keywords
Article Crawling: In addition to scraping individual articles, you can use the build
method to crawl a news website and extract all of its articles. This is useful if you want to scrape multiple articles from the same website without having to manually specify each article's URL.
from newspaper import build
source = build('https://www.example.com/')
for article in source.articles:
print(article.url)
Multilingual Article Support: Newspaper3k supports multiple languages and can automatically detect the language of the article. You can also specify the language of the article when initializing the Article
class.
article = Article(url='https://sid.black', language='es')
article.download()
article.parse()
Custom Configuration: The newspaper.newspaper.NewsPool
class allows you to create a custom configuration for your scraping tasks. You can set the number of threads to use for downloading, the timeout for downloading, and other settings.
from newspaper import NewsPool
pool = NewsPool()
pool.set(threads_per_source=2, threads=2)
Article Extraction from XML and RSS feed: You can also extract articles from XML and RSS feed using newspaper.feed.Feed
class.
from newspaper import Feed
feed = Feed(url='https://www.example.com/rss')
for article in feed.articles:
print(article.url)
Article Text Cleaning: Newspaper3k includes a built-in text cleaner that can be used to remove unwanted elements from the article text, such as ads, related links, and other noise. You can use the article.clean_top_node()
method to clean the text.
article.download()
article.parse()
article.clean_top_node()
Custom Article Extraction: If you need to extract information from an article that is not directly supported by the Article class, you can use the article.html
property to access the raw HTML of the article, and then use a tool like BeautifulSoup to extract the information you need.
from bs4 import BeautifulSoup
article.download()
html = article.html
soup = BeautifulSoup(html, 'html.parser')
# Extract information from the soup object
Article caching: If you are scraping a large number of articles, it can be helpful to cache the downloaded articles to avoid re-downloading the same article multiple times. You can use the newspaper.article.Article.is_downloaded()
method to check if an article has been downloaded and the newspaper.article.Article.is_cached()
method to check if an article has been cached.
if not article.is_downloaded():
article.download()
if not article.is_cached():
article.parse()
article.nlp()
article.cached()
In conclusion, Newspaper3k is a powerful and easy-to-use Python library for web scraping articles from online newspapers. It provides a simple API for extracting information such as the article's title, author, text, and more. It also supports many advanced features like article crawling, multilingual support, and custom configurations. It also has built-in functions to clean the article text, cache articles and handle errors.
The library is well-documented and actively maintained, making it a great choice for developers looking to scrape news articles in their projects. It also supports extracting articles from XML and RSS feed which makes it a versatile tool.
It's important to keep in mind that web scraping can be against some website's terms of service, so you should always check the website's terms before scraping its data.