WebScrapeSummarizer: Turning Web Content into Concise Summaries

Zeeshan Ahmad
4 min readSep 6, 2023

--

Introduction ๐Ÿš€

In todayโ€™s digital age, the internet is overflowing with content. From lengthy articles to verbose web pages, we often find ourselves overwhelmed with information. What if there was a tool that could distill this ocean of content into concise summaries? Enter WebScrapeSummarizer โ€” a tool designed to fetch, process, and summarize web content with the power of OpenAI.

2. Setting the Stage ๐ŸŽฌ

WebScrapeSummarizer is built upon a robust tech stack:

  • PHP for server-side processing.
  • OpenAI to harness the power of state-of-the-art natural language processing.
  • Web scraping tools to fetch the data from the web.

3. Diving into the Architecture ๐Ÿ—๏ธ

Our project is structured into distinct modules, each responsible for a specific function. From scraping web content to interfacing with OpenAI, the architecture is modular and scalable.

directory structure
Directory Structure

4. Web Scraping Magic ๐Ÿ•ธ๏ธ

Web scraping is the backbone of our tool. Itโ€™s the process of fetching content from the web, acting as our data source. The scraper.php module is where this magic happens.

// Snippet from scraper.php

function scrapeContent($domain) {
// Initialize cURL session
$ch = curl_init();

// Set cURL options
curl_setopt($ch, CURLOPT_URL, $domain);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // Follow redirects
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'); // Set a user agent to simulate a browser request

// Execute cURL session and fetch the content
$content = curl_exec($ch);

// Check for any cURL errors
if (curl_errno($ch)) {
return "Error: " . curl_error($ch);
}

// Close cURL session
curl_close($ch);

// Return the scraped content
return $content;
}

5. Powering Up with OpenAI ๐Ÿค–

OpenAI, with its advanced natural language processing capabilities, transforms the scraped content into concise summaries. The openai.php module integrates our tool with OpenAI's API.

// Snippet from openai.php

function processWithOpenAI($content) {
// OpenAI API endpoint (for GPT-3, for instance)
$apiEndpoint = 'https://api.openai.com/v1/engines/davinci/completions';

// Your OpenAI API Key
$apiKey = 'YOUR_OPENAI_API_KEY'; // Replace with your actual API key

// Initialize cURL session
$ch = curl_init();

// Set cURL options
curl_setopt($ch, CURLOPT_URL, $apiEndpoint);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'Authorization: Bearer ' . $apiKey,
'Content-Type: application/json'
]);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode([
'prompt' => $content,
'max_tokens' => 150 // Limit to 150 tokens for summarization
]));

// Execute cURL session and get the response
$response = curl_exec($ch);

// Check for any cURL errors
if (curl_errno($ch)) {
return "Error: " . curl_error($ch);
}

// Close cURL session
curl_close($ch);

// Decode the JSON response
$responseData = json_decode($response, true);

// Return the summarized content from OpenAI's response
return $responseData['choices'][0]['text'] ?? 'Error processing content';
}

6. Serving the Output: CSV Files ๐Ÿ“Š

Why CSV? Itโ€™s a universally accepted format, easily integrable with other tools and platforms. Our csv_handler.php module handles this, turning processed data into downloadable CSV files.

// Snippet from csv_handler.php

function writeToCSV($data, $filename = "output.csv") {
// Open or create the CSV file in write mode
$fileHandle = fopen($filename, 'w');

if (!$fileHandle) {
return "Error: Unable to open or create the CSV file.";
}

// Loop through the data and write each line to the CSV
foreach ($data as $row) {
fputcsv($fileHandle, $row);
}

// Close the file handle
fclose($fileHandle);

// Return the name of the file that was written to
return $filename;
}

7. Bringing It All Together: The Interface ๐Ÿ–ฅ๏ธ

The user interface is where all these components come together. The index.php provides an intuitive interface for users to input a domain and get the summarized content.

Interface in Action
Interface In Action

8. Setting Up and Running WebScrapeSummarizer ๐Ÿ› ๏ธ

Ready to try it out? Follow the Setup and Installation Instructions from our GitHub README to get started.

Donโ€™t forget to check out the repository and star it if you like the project!

9. Conclusion and Call to Action ๐ŸŽ‰

Building WebScrapeSummarizer has been an enlightening journey. From conceptualizing to coding, the project stands as a testament to the power of combining web scraping with AI. We invite you to try WebScrapeSummarizer, star our repo, contribute, or simply share this post to spread the word!

10. Additional Resources ๐Ÿ“š

Building a Google Chatbot

Discord Bot for Physics Students: An article I penned earlier.

EtherTeleBot: Telegram Crypto Bot

--

--

Zeeshan Ahmad

AI/ML/DL enthusiast | Python/Web Automation expert | Passionate Problem Solver