How to Use OpenAI Whisper API for Audio Transcription in Your App (2025 Guide)

09/07/2025

How to Use OpenAI Whisper API for Audio Transcription in Your App (2025 Guide)

This guide walks you through using OpenAI’s Whisper API to transcribe audio into text inside your app. From uploading audio files to handling responses, discover how to build seamless voice input features with high accuracy, real-time processing, and language support.

Use OpenAI’s Whisper API for Transcription in Your App

Convert audio to accurate text with a powerful AI model!

In today's multimedia-rich world, converting spoken words into written text is a fundamental need for many applications – from meeting summaries and voice assistants to content creation and accessibility tools. OpenAI's **Whisper API** offers a highly accurate and robust solution for **speech-to-text (STT) transcription**, leveraging a powerful AI model trained on a vast dataset.

This tutorial will guide you through integrating the Whisper API into your web application. We'll set up a Node.js backend to securely handle audio file uploads and API calls to OpenAI, and a simple HTML/JavaScript frontend to record audio and display the transcription.

What is OpenAI Whisper?

OpenAI Whisper is an open-source automatic speech recognition (ASR) system. It was trained on 680,000 hours of multilingual and multitask supervised data, resulting in a highly robust and accurate model. The Whisper API provides access to this powerful model, offering features like:

  • High Accuracy: Excellent performance even with background noise or varied accents.
  • Multilingual Support: Can transcribe in multiple languages and translate those languages into English.
  • Language Detection: Automatically detects the spoken language.
  • Robustness: Handles different audio formats and quality levels.

Prerequisites

  • An OpenAI Account and API Key: Sign up at platform.openai.com and generate your API key.
  • Node.js and npm (or yarn) installed on your machine.
  • Basic knowledge of HTML, CSS, and JavaScript.
  • A modern web browser with microphone access.

Step-by-Step Integration Guide

Step 1: Get Your OpenAI API Key

Generate your secret API key from the OpenAI dashboard. Keep it confidential!

🚨 Security Warning: Your API key must NEVER be exposed in client-side code. We will use a Node.js backend to protect it.

Step 2: Set Up the Node.js Backend

This backend will handle audio uploads from the frontend and make secure calls to the Whisper API.

a. Initialize Project and Install Dependencies:


mkdir whisper-transcription-app
cd whisper-transcription-app
mkdir backend
cd backend
npm init -y
npm install express cors dotenv openai multer

`multer` is a middleware for handling `multipart/form-data`, which is primarily used for uploading files.

b. Create .env File:

In the `backend` directory, create a file named `.env` and add your OpenAI API key:

OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY_HERE

(Replace `sk-YOUR_OPENAI_API_KEY_HERE` with your actual key)

c. Create server.js:

In the `backend` directory, create `server.js` with the following code:


// backend/server.js
require('dotenv').config();
const express = require('express');
const cors = require('cors');
const { OpenAI } = require('openai');
const multer = require('multer');
const fs = require('fs'); // Node.js File System module
const path = require('path');

const app = express();
const port = process.env.PORT || 3001;

// Initialize OpenAI client
const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
});

// Configure Multer for file uploads
const upload = multer({ dest: 'uploads/' }); // Temporary storage for uploaded files

// Middleware
app.use(cors());
app.use(express.json());

// Transcription endpoint
app.post('/transcribe', upload.single('audio'), async (req, res) => {
    if (!req.file) {
        return res.status(400).json({ error: 'No audio file uploaded.' });
    }

    const audioFilePath = req.file.path;

    try {
        // Read the audio file into a buffer
        const audioFile = fs.createReadStream(audioFilePath);

        // Call OpenAI Whisper API
        const transcription = await openai.audio.transcriptions.create({
            file: audioFile,
            model: "whisper-1", // The Whisper model
            response_format: "json", // Or "text", "srt", "vtt"
        });

        // Clean up the temporary file
        fs.unlinkSync(audioFilePath);

        res.json({ transcription: transcription.text });

    } catch (error) {
        console.error('Error during transcription:', error);
        // Clean up the temporary file even if an error occurs
        if (fs.existsSync(audioFilePath)) {
            fs.unlinkSync(audioFilePath);
        }
        if (error.response) {
            console.error('OpenAI API response error:', error.response.status, error.response.data);
            res.status(error.response.status).json({ error: error.response.data });
        } else {
            res.status(500).json({ error: 'Failed to transcribe audio.' });
        }
    }
});

// Start the server
app.listen(port, () => {
    console.log(`Whisper backend listening at http://localhost:${port}`);
});

d. Start the Backend Server:

node server.js

Your backend will now be running on `http://localhost:3001`.

Step 3: Build the Frontend (HTML, CSS, JavaScript)

Create an `index.html` file in the `whisper-transcription-app` root directory (next to your `backend` folder):


<!-- index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Whisper API Transcription Demo</title>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700&display=swap" rel="stylesheet">
    <style>
        body {
            font-family: 'Inter', sans-serif;
            display: flex;
            justify-content: center;
            align-items: center;
            min-height: 100vh;
            background-color: #f7fcfc; /* Light blue-green background */
            margin: 0;
            padding: 20px;
            box-sizing: border-box;
        }
        .container {
            background-color: #ffffff;
            border-radius: 12px;
            box-shadow: 0 6px 20px rgba(0, 0, 0, 0.1);
            width: 100%;
            max-width: 600px;
            padding: 30px;
            text-align: center;
            border: 1px solid #dcdcdc;
        }
        h1 {
            color: #20c997;
            margin-bottom: 25px;
            font-size: 2em;
        }
        button {
            background-color: #20c997;
            color: #fff;
            border: none;
            padding: 15px 30px;
            border-radius: 30px;
            cursor: pointer;
            font-size: 1.1em;
            font-weight: bold;
            transition: background-color 0.3s ease, transform 0.1s ease;
            box-shadow: 0 4px 10px rgba(32, 201, 151, 0.3);
            margin-bottom: 20px;
        }
        button:hover {
            background-color: #1aa67e;
            transform: translateY(-2px);
        }
        button:disabled {
            background-color: #a7e9d9;
            cursor: not-allowed;
            box-shadow: none;
        }
        #transcription-output {
            background-color: #e6f7f7;
            border: 1px solid #b2dfdb;
            border-radius: 8px;
            padding: 20px;
            min-height: 100px;
            text-align: left;
            font-size: 1.1em;
            color: #00796b;
            word-wrap: break-word;
            white-space: pre-wrap; /* Preserves whitespace and line breaks */
            margin-top: 20px;
        }
        .status-message {
            color: #555;
            font-style: italic;
            margin-top: 15px;
        }
        .error-message {
            color: #dc3545;
            font-weight: bold;
            margin-top: 15px;
        }
    </style>
</head>
<body>
    <div class="container">
        <h1>Audio Transcription with Whisper API</h1>
        <button id="recordButton">Start Recording</button>
        <p class="status-message" id="statusMessage">Press "Start Recording" to begin.</p>
        <div id="transcription-output">Your transcription will appear here...</div>
    </div>

    <script>
        const recordButton = document.getElementById('recordButton');
        const statusMessage = document.getElementById('statusMessage');
        const transcriptionOutput = document.getElementById('transcription-output');

        // IMPORTANT: Replace with your backend proxy URL
        const BACKEND_URL = 'http://localhost:3001/transcribe';

        let mediaRecorder;
        let audioChunks = [];
        let isRecording = false;

        recordButton.addEventListener('click', () => {
            if (isRecording) {
                stopRecording();
            } else {
                startRecording();
            }
        });

        async function startRecording() {
            try {
                const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
                mediaRecorder = new MediaRecorder(stream);
                audioChunks = [];

                mediaRecorder.ondataavailable = (event) => {
                    audioChunks.push(event.data);
                };

                mediaRecorder.onstop = async () => {
                    const audioBlob = new Blob(audioChunks, { type: 'audio/webm' }); // Use webm for broader compatibility
                    transcribeAudio(audioBlob);
                    // Stop stream tracks to release microphone
                    stream.getTracks().forEach(track => track.stop());
                };

                mediaRecorder.start();
                isRecording = true;
                recordButton.textContent = 'Stop Recording';
                recordButton.style.backgroundColor = '#dc3545'; // Red for stop
                recordButton.style.boxShadow = '0 4px 10px rgba(220, 53, 69, 0.3)';
                statusMessage.textContent = 'Recording... Click again to stop.';
                transcriptionOutput.textContent = 'Recording started...';
            } catch (error) {
                console.error('Error accessing microphone:', error);
                statusMessage.className = 'error-message';
                statusMessage.textContent = 'Error: Could not access microphone. Please allow permissions.';
                recordButton.disabled = false;
            }
        }

        function stopRecording() {
            if (mediaRecorder && isRecording) {
                mediaRecorder.stop();
                isRecording = false;
                recordButton.textContent = 'Processing...';
                recordButton.disabled = true;
                recordButton.style.backgroundColor = '#6c757d'; // Grey for processing
                recordButton.style.boxShadow = 'none';
                statusMessage.textContent = 'Sending audio for transcription...';
            }
        }

        async function transcribeAudio(audioBlob) {
            const formData = new FormData();
            formData.append('audio', audioBlob, 'audio.webm'); // 'audio' matches upload.single('audio') in backend

            try {
                const response = await fetch(BACKEND_URL, {
                    method: 'POST',
                    body: formData,
                });

                if (!response.ok) {
                    const errorData = await response.json();
                    throw new Error(errorData.error || `HTTP error! status: ${response.status}`);
                }

                const data = await response.json();
                transcriptionOutput.textContent = data.transcription;
                statusMessage.className = 'status-message';
                statusMessage.textContent = 'Transcription complete!';

            } catch (error) {
                console.error('Error during transcription:', error);
                transcriptionOutput.textContent = 'Error: Failed to transcribe audio.';
                statusMessage.className = 'error-message';
                statusMessage.textContent = 'Transcription failed. Please try again.';
            } finally {
                recordButton.textContent = 'Start Recording';
                recordButton.style.backgroundColor = '#20c997'; // Green again
                recordButton.style.boxShadow = '0 4px 10px rgba(32, 201, 151, 0.3)';
                recordButton.disabled = false;
            }
        }
    </script>
</body>
</html>

Running the Application

  1. Start the Backend: Navigate to the `backend` directory in your terminal and run `node server.js`.
  2. Open the Frontend: Open the `index.html` file directly in your web browser (you can usually drag and drop it into the browser, or right-click and "Open with...").
  3. Grant Microphone Access: Your browser will likely ask for permission to access your microphone. Grant it.
  4. Record and Transcribe:
    • Click the "**Start Recording**" button.
    • Speak clearly into your microphone.
    • Click the "**Stop Recording**" button.
    • The recorded audio will be sent to your backend, then to OpenAI Whisper, and the transcription will appear on the page.

Error Handling and Best Practices

  • API Key Security: Always keep your OpenAI API key on the server-side. Never expose it in client-side code.
  • User Feedback: Provide clear status messages to the user (e.g., "Recording...", "Processing...", "Transcription complete!", "Error: ...").
  • File Size Limits: OpenAI Whisper API has file size limits (currently 25 MB for the audio file). For longer audio, consider chunking the audio into smaller segments before sending them for transcription.
  • Error Logging: Implement robust error logging on your backend to diagnose issues with API calls or file processing.
  • Audio Quality: While Whisper is robust, better audio quality generally leads to more accurate transcriptions.
  • Cross-Browser Compatibility: Test microphone access and MediaRecorder API behavior across different browsers.

Further Enhancements

This is a basic implementation. Here are some ideas to enhance your transcription app:

  • Display Audio Waveform: Use a library like Web Audio API to visualize the audio input.
  • Progress Indicator: Add a loading spinner or progress bar during transcription.
  • Download Transcription: Provide an option to download the transcribed text as a `.txt` or `.srt` file.
  • Select Language: Allow users to specify the input language for more accurate transcription, if they know it.
  • Transcription Editing: Integrate a text editor for users to correct or refine the transcribed text.
  • Integration with other OpenAI models: Pass the transcribed text to GPT for summarization, translation, or other NLP tasks.