[C#] Asynchronously Retrieving Web Page HTML with HttpClient

目次

Overview

This is the most fundamental implementation for accessing a specified URL (web page) and retrieving the response body (HTML or text) as a string. By using the GetStringAsync method of System.Net.Http.HttpClient, you can complete the process from issuing an HTTP GET request to converting it into a string in a single line.


Specifications (Input/Output)

  • Input: Target URL (e.g., the home page of a news site).
  • Output: The character count of the retrieved HTML source code and a snippet of the beginning.
  • Prerequisite: Uses the standard .NET library (System.Net.Http). Requires an active internet connection.

Basic Usage

Pass the URL to the GetStringAsync method of an HttpClient instance (singleton usage is recommended).

// Share the instance via a static field
private static readonly HttpClient sharedClient = new HttpClient();

public async Task PrintHtmlAsync()
{
    // Retrieve text asynchronously
    string html = await sharedClient.GetStringAsync("https://www.example.com");
    Console.WriteLine(html);
}

Full Code Example

The following implementation simulates “retrieving a company’s Terms of Service page to verify content.” It includes exception handling suitable for production use.

using System;
using System.Net.Http;
using System.Threading.Tasks;

class Program
{
    // [Important] Do not instantiate HttpClient for every request. 
    // Share it across the application to prevent "Socket Exhaustion."
    private static readonly HttpClient _httpClient = new HttpClient();

    static async Task Main()
    {
        // Target URL (using example.com as a placeholder)
        string targetUrl = "https://www.example.com";

        Console.WriteLine($"Sending request to: {targetUrl}");

        try
        {
            // Explicitly set a timeout (default is 100 seconds)
            _httpClient.Timeout = TimeSpan.FromSeconds(10);

            // Retrieve HTML from the web server as a string
            string content = await _httpClient.GetStringAsync(targetUrl);

            Console.WriteLine("--- Retrieval Successful ---");
            Console.WriteLine($"Data Size: {content.Length} characters");
            Console.WriteLine("--- First 500 Characters ---");
            
            // Truncate if the content is too long
            string preview = content.Length > 500 
                ? content.Substring(0, 500) + "..." 
                : content;
                
            Console.WriteLine(preview);
        }
        catch (HttpRequestException ex)
        {
            // Handles 404 Not Found, DNS errors, etc.
            Console.WriteLine($"[Communication Error] {ex.Message}");
            if (ex.StatusCode.HasValue)
            {
                Console.WriteLine($"HTTP Status: {ex.StatusCode}");
            }
        }
        catch (TaskCanceledException)
        {
            // Handles timeouts
            Console.WriteLine("[Timeout] No response within the time limit.");
        }
        catch (Exception ex)
        {
            // Handles other unexpected errors
            Console.WriteLine($"[System Error] {ex.Message}");
        }
    }
}

Customization Points

  • Adding Headers: When accessing APIs or specific sites, you might need an Authorization token or a User-Agent.C#_httpClient.DefaultRequestHeaders.Add("User-Agent", "MyApp/1.0"); _httpClient.DefaultRequestHeaders.Add("Authorization", "Bearer my_token");
  • Retrieving as Byte Array: Use GetByteArrayAsync instead of GetStringAsync if you need to download image data or handle sites with specific encodings (like Shift-JIS). You can then manually convert the bytes using the Encoding class.

Important Notes

  • Instance Lifecycle: Using using (var client = new HttpClient()) for every request is an anti-pattern. It leaves sockets in a TIME_WAIT state, leading to port exhaustion under high loads. Always share the instance as static or use IHttpClientFactory.
  • DNS Updates: While keeping a static instance historically caused DNS update issues, modern .NET (Core 2.1 and later, including .NET 5/6/8) handles this correctly via SocketsHttpHandler internally. Singleton usage is generally safe.
  • Strictly Asynchronous: Using .Result or .Wait() on asynchronous methods can cause deadlocks in GUI or ASP.NET applications. Always use await.

Advanced Application

Processing as a Stream (Memory Efficiency)

When retrieving massive HTML or text files, expanding the entire content into memory with GetStringAsync is inefficient. Use GetStreamAsync to process the data as it is being read.

using (var stream = await _httpClient.GetStreamAsync(targetUrl))
using (var reader = new System.IO.StreamReader(stream))
{
    // Process the data line by line to keep memory consumption low
    while (!reader.EndOfStream)
    {
        string? line = await reader.ReadLineAsync();
        if (line != null && line.Contains("<title>"))
        {
            Console.WriteLine($"Title tag found: {line}");
        }
    }
}

Conclusion

HttpClient is a class designed to be “reused,” not “disposed” after a single use.

In production environments, exception handling (try-catch) for network errors and timeouts is mandatory.

HttpClient.GetStringAsync is the simplest way to retrieve text data from the web.

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次