Pixtral12Bで動画ファイルの自動タグ付を行う

Pixtral12Bは、非常に高い性能を持つビジョン-言語モデル(VLMl;Vision-Language Model)です。

CogVideoXに独自のデータを学習させるためには、まず事前に6秒程度の動画ファイルと、その動画ファイルの内容を説明するテキストファイルが必要です。

まず、動画ファイルを6秒ごとに分割して格納するコードを書きます。

import subprocess
import os
import argparse

def extract_frames(video_path, output_folder, interval=6):
    # 出力フォルダが存在しない場合は作成
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # ffmpegコマンドを構築
    command = [
        'ffmpeg',
        '-i', video_path,
        '-vf', f'fps=1/{interval}',
        f'{output_folder}/frame_%04d.png'
    ]

    # ffmpegを実行
    subprocess.run(command, check=True)

def main():
    parser = argparse.ArgumentParser(description='動画から一定間隔で画像を抽出するスクリプト')
    parser.add_argument('video_path', help='動画ファイルのパス')
    parser.add_argument('output_folder', help='出力フォルダのパス')
    parser.add_argument('-i', '--interval', type=int, default=6, help='フレーム抽出間隔（秒）、デフォルトは6秒')

    args = parser.parse_args()

    try:
        extract_frames(args.video_path, args.output_folder, args.interval)
        print(f"画像の抽出が完了しました。{args.output_folder}フォルダを確認してください。")
    except subprocess.CalledProcessError:
        print("エラーが発生しました。ffmpegがインストールされていることを確認してください。")
    except Exception as e:
        print(f"エラーが発生しました: {str(e)}")

if __name__ == "__main__":
    main()

これで例えば以下のように動画を分割します。

$ python cutmov.py src.mp4 out

すると、outディレクトリに細切れになった動画ファイルが格納されます。

この動画ファイルを一つずつ読み込んで、できれば動画の前半、中盤、後半の3回に分けてPixtralで解析し、動画で何が起きているのか説明させます。

import os
import cv2
from PIL import Image
import pytesseract
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage, TextChunk, ImageURLChunk
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from huggingface_hub import snapshot_download
from pathlib import Path


# モデルのダウンロードと準備
mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')
mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistral-community/pixtral-12b-240910", 
                  allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], 
                  local_dir=mistral_models_path)

# トークナイザーとモデルのロード
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path)


import os
import cv2
import base64
from io import BytesIO
from PIL import Image

def image_to_base64(image):
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    img_str = base64.b64encode(buffered.getvalue()).decode()
    return f"data:image/png;base64,{img_str}"

def recognize(base64_image):
    prompt="""Possible objects in the photo include the main character, a man, a young girl, air pirates, an airplane, a river, the sea, a mature woman, a building, a car, and blueprints. Based on this, explain in detail what is shown on the screen. In particular, explain the positional relationship, such as what pose the person is in, what is on the right and left, top and bottom of the screen, and estimate the camera angle."""
    completion_request = ChatCompletionRequest(
        messages=[UserMessage(content=[ImageURLChunk(image_url=base64_image), TextChunk(text=prompt)])]
    )
    
    encoded = tokenizer.encode_chat_completion(completion_request)
    images = encoded.images
    tokens = encoded.tokens

    out_tokens, _ = generate([tokens], model, images=[images], max_tokens=1024, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
    result = tokenizer.decode(out_tokens[0])
    print(result)
    return result

def extract_frames(video_path):
    video = cv2.VideoCapture(video_path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    frames = []

    # 冒頭のフレーム
    video.set(cv2.CAP_PROP_POS_FRAMES, 0)
    ret, frame = video.read()
    if ret:
        frames.append(frame)

    # 中盤のフレーム
    middle_frame = total_frames // 2
    video.set(cv2.CAP_PROP_POS_FRAMES, middle_frame)
    ret, frame = video.read()
    if ret:
        frames.append(frame)

    # 最後のフレーム
    video.set(cv2.CAP_PROP_POS_FRAMES, total_frames - 1)
    ret, frame = video.read()
    if ret:
        frames.append(frame)

    video.release()
    return frames

def process_videos(input_dir, output_dir):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for filename in os.listdir(input_dir):
        if filename.endswith('.mp4'):
            video_path = os.path.join(input_dir, filename)
            
            frames = extract_frames(video_path)
            
            recognized_texts = []
            for i, frame in enumerate(frames):
                image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
                base64_image = image_to_base64(image)
                recognized_text = recognize(base64_image)
                recognized_texts.append(f"Frame {i+1}: {recognized_text}")

            output_filename = os.path.splitext(filename)[0] + '.txt'
            output_path = os.path.join(output_dir, output_filename)

            with open(output_path, 'w', encoding='utf-8') as f:
                f.write("\n\n".join(recognized_texts))

            print(f"Processed: {filename} -> {output_filename}")

# 使用例
input_directory = 'out'
output_directory = 'txt'
process_videos(input_directory, output_directory)

このコードを実行すると、自動的にtxtディレクトリに動画ファイルに対応した説明文のテキストを得ることができます。

本当はこのテキストももっと適切なものにスクリーニングするべきですが、ひとまずこれで最低限ファインチューニングに必要なデータが揃います。

プロンプトでカメラアングルや登場人物の位置関係について明示するように指示を出しているのがいい感じに効いているようです。

例えば、上のような画像に対してこんなキャプションが得られます。


The image depicts a modern urban scene with a focus on public transportation and historic architecture. Here is a detailed breakdown of the elements present and their positional relationships:

1. **Main Character and People**:

   - **Main Character**: A man is seen walking towards the tram. He appears to be in motion, possibly in a hurry or heading to catch the tram.

   - **Young Girl**: A young girl is also visible, standing near the tram. She seems to be waiting or about to board the tram.

   - **Other People**: There are a few other individuals scattered around the scene, some near the tram and others in the background.

2. **Transportation**:

   - **Tram**: A modern tram is prominently featured in the foreground, running along a track. It has a sleek, futuristic design with multiple connected cars.

   - **Car**: A small car is parked or driving on the right side of the image, near the edge of the cobblestone square.

3. **Architecture**:

   - **Building**: A large, historic building with classical architecture dominates the background. It has multiple windows, ornate details, and domed towers on either side.

   - **Statue**: In front of the building, there is a statue, which appears to be a significant landmark or monument.

4. **Environment**:

   - **Cobblestone Square**: The ground is covered with cobblestones, indicating an old town or historic area.

   - **Sky**: The sky is clear with a soft light, suggesting either early morning or late afternoon.

5. **Camera Angle**:

   - The image is taken from a low angle, looking up at the tram and the building. This angle emphasizes the height of the building and the size of the tram.

   - The perspective also allows for a clear view of the people and their activities in the foreground.

6. **Positional Relationships**:

   - **Foreground**: The tram and the people (main character and young girl) are in the foreground.

   - **Middle Ground**: The cobblestone square and the statue are in the middle ground.

   - **Background**: The large historic building with domed towers is in the background.

In summary, the image captures a moment in a historic urban square where a modern tram is in motion, with people going about their daily routines against the backdrop of an impressive historic building. The low-angle shot emphasizes the grandeur of the architecture and the scale of the tram.

Pixtral12Bで動画ファイルの自動タグ付を行う

最新記事