Transcription data prep¶
The transcription data prep scripts download YouTube video transcripts and prepare them for use with the Semantic Search with OpenAI Embeddings and Functions sample.
The transcription data prep scripts have been tested on the latest releases Windows 11, macOS Ventura and Ubuntu 22.04 (and above).
Create required Azure OpenAI Service resources¶
[!IMPORTANT] We suggest you update the Azure CLI to the latest version to ensure compatibility with OpenAI See Documentation
- Create a resource group
[!NOTE] For these instructions we're using the resource group named "semantic-video-search" in East US. You can change the name of the resource group, but when changing the location for the resources, check the model availability table.
| Bash Session | |
|---|---|
1 | |
- Create an Azure OpenAI Service resource.
| Bash Session | |
|---|---|
1 2 | |
- Get the endpoint and keys for usage in this application
| Bash Session | |
|---|---|
1 2 3 4 | |
- Deploy the following models:
text-embedding-ada-002version2or greater, namedtext-embedding-ada-002gpt-35-turboversion0613or greater, namedgpt-35-turbo
| Bash Session | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Required software¶
- Python 3.9 or greater
Environment variables¶
The following environment variables are required to run the YouTube transcription data prep scripts.
On Windows¶
Recommend adding the variables to your user environment variables.
Windows Start > Edit the system environment variables > Environment Variables > User variables for [USER] > New.
| Text Only | |
|---|---|
1 2 3 4 | |
<!-- You can add the environment variables to your PowerShell profile.
| PowerShell | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Install the required Python libraries¶
- Install the git client if it's not already installed.
-
From a
Terminalwindow, clone the sample to your preferred repo folder.Bash 1git clone https://github.com/gloveboxes/semanic-search-openai-embeddings-functions.git -
Navigate to the
data_prepfolder.
| Bash | |
|---|---|
1 | |
-
Create a Python virtual environment.
On Windows:
PowerShell 1python -m venv .venvOn macOS and Linux:
Bash 1python3 -m venv .venv -
Activate the Python virtual environment.
On Windows:
| PowerShell | |
|---|---|
1 | |
On macOS and Linux:
| Bash | |
|---|---|
1 | |
- Install the required libraries.
On windows:
| PowerShell | |
|---|---|
1 | |
On macOS and Linux:
| Bash | |
|---|---|
1 | |
Run the YouTube transcription data prep scripts¶
On windows¶
| PowerShell | |
|---|---|
1 | |
On macOS and Linux¶
| Bash | |
|---|---|
1 | |