Scraping Blogs With Python | Newspaper3k CheatSheet

Scraping Blogs With Python | 
 Newspaper3k CheatSheet
Photo by Skylar Zilka / Unsplash

Newspaper3k is a Python library that allows for easy access to articles from online newspapers. It can be used to scrape articles from a wide range of news sources and extract information such as the article's title, author, text, and more. It is designed to be easy to use and allows for simple integration into other applications or scripts.

Getting Started

pip install newspaper3k

Once you have the library installed, you can start using it in your Python script.

To start, you will need to import the library:

from newspaper import Article

Once you have imported the library, you can start using it by creating an instance of the Article class and passing in the URL of the article you want to scrape:

article = Article('https://sid.black/blogs/gpt3-crash-course')

To download the article's content, you'll need to call the download method:

article.download()

To parse the article's HTML, you'll need to call the parse method:

article.parse()

You can then access the article's properties, such as the title, text, and author, using the following properties:

title = article.title
text = article.text
authors = article.authors

If you want to extract article's images, you can use article.images property.

You can also extract the article's keywords and summary by calling the nlp method:

article.nlp()
keywords = article.keywords
summary = article.summary

This should be enough to get you started with this amazing library.

Advanced Usage Of Newspaper3K

Extracting Article Meta Data: The Article class has several methods that allow you to extract metadata such as the article's publish date, top image, and meta keywords. For example, you can use the publish_date property to get the date the article was published, and the top_image property to get the URL of the top image in the article.

article.download()
article.parse()

publish_date = article.publish_date
top_image = article.top_image
meta_keywords = article.meta_keywords

Article Crawling: In addition to scraping individual articles, you can use the build method to crawl a news website and extract all of its articles. This is useful if you want to scrape multiple articles from the same website without having to manually specify each article's URL.

from newspaper import build

source = build('https://www.example.com/')
for article in source.articles:
    print(article.url)

Multilingual Article Support: Newspaper3k supports multiple languages and can automatically detect the language of the article. You can also specify the language of the article when initializing the Article class.

article = Article(url='https://sid.black', language='es')
article.download()
article.parse()

Custom Configuration: The newspaper.newspaper.NewsPool class allows you to create a custom configuration for your scraping tasks. You can set the number of threads to use for downloading, the timeout for downloading, and other settings.

from newspaper import NewsPool

pool = NewsPool()
pool.set(threads_per_source=2, threads=2)

Article Extraction from XML and RSS feed: You can also extract articles from XML and RSS feed using newspaper.feed.Feed class.

from newspaper import Feed

feed = Feed(url='https://www.example.com/rss')
for article in feed.articles:
    print(article.url)

Article Text Cleaning: Newspaper3k includes a built-in text cleaner that can be used to remove unwanted elements from the article text, such as ads, related links, and other noise. You can use the article.clean_top_node() method to clean the text.

article.download()
article.parse()
article.clean_top_node()

Custom Article Extraction: If you need to extract information from an article that is not directly supported by the Article class, you can use the article.html property to access the raw HTML of the article, and then use a tool like BeautifulSoup to extract the information you need.

from bs4 import BeautifulSoup

article.download()
html = article.html
soup = BeautifulSoup(html, 'html.parser')

# Extract information from the soup object

Article caching: If you are scraping a large number of articles, it can be helpful to cache the downloaded articles to avoid re-downloading the same article multiple times. You can use the newspaper.article.Article.is_downloaded() method to check if an article has been downloaded and the newspaper.article.Article.is_cached() method to check if an article has been cached.

if not article.is_downloaded():
    article.download()

if not article.is_cached():
    article.parse()
    article.nlp()
    article.cached()

In conclusion, Newspaper3k is a powerful and easy-to-use Python library for web scraping articles from online newspapers. It provides a simple API for extracting information such as the article's title, author, text, and more. It also supports many advanced features like article crawling, multilingual support, and custom configurations. It also has built-in functions to clean the article text, cache articles and handle errors.

The library is well-documented and actively maintained, making it a great choice for developers looking to scrape news articles in their projects. It also supports extracting articles from XML and RSS feed which makes it a versatile tool.

It's important to keep in mind that web scraping can be against some website's terms of service, so you should always check the website's terms before scraping its data.