IP로 학원에 배포하며 테스트를 진행하던 중, 업로드 과정에서 다음과 같은 버그가 발생한 것을 확인했습니다. bug

Job 테이블 데이터를 확인해 보니, 다음처럼 statusfailed이지만 completed_at 값이 존재해 서비스 프로세스상 예상한 데이터 상태가 아니었습니다. completed_at 값이 있다면 statuscompleted여야 했습니다.

                  id                  |   status   |       course_name       |                                             error_message                                   |          created_at           |          updated_at           |         completed_at          
--------------------------------------+------------+-------------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+-------------------------------
 29fc0ec8-8f54-43fc-8dc9-2c92d0e877e0 | failed     | 배열                    | An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist. | 2026-06-15 08:39:37.532818+00 | 2026-06-15 14:32:11.78985+00  | 2026-06-15 12:31:10.671766+00
 0e460663-1de7-4813-a128-5895d3265d70 | processing | DDL - 2                 |                                   | 2026-06-15 08:38:17.577126+00 | 2026-06-15 12:31:10.869183+00 | 
 76dcdfc5-54a6-46da-88c8-093ec68b34f6 | completed  | DDL                     |                                   | 2026-06-15 08:26:43.584661+00 | 2026-06-15 10:13:57.687732+00 | 2026-06-15 10:13:57.687653+00
 3eff8d62-19cb-4de4-a282-127bc4b3130f | failed     | Limit                   | An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist. | 2026-06-15 08:18:12.511962+00 | 2026-06-15 14:32:11.914926+00 | 2026-06-15 09:22:42.636486+00
 da51db02-dec2-4c48-98eb-3f1ce814360a | completed  | Sub Query 연습문제 풀이 |                                   | 2026-06-15 08:14:15.792211+00 | 2026-06-15 09:20:24.200712+00 | 2026-06-15 09:20:24.200621+00
 7274d1d0-cb3b-4968-b2be-b1a3fecadc81 | completed  | test                    |                                   | 2026-06-13 13:14:25.077061+00 | 2026-06-13 13:24:46.968948+00 | 2026-06-13 13:24:46.968876+00
 174733b5-1daf-45a1-8917-418512249c99 | completed  | 업로드 테스트           |                                   | 2026-06-13 12:54:14.423268+00 | 2026-06-13 13:04:50.440667+00 | 2026-06-13 13:04:50.440582+00
 42c62df1-1a68-4df5-8596-a2ef0cc83987 | completed  | Sub Query               |                                   | 2026-06-12 07:21:23.330349+00 | 2026-06-12 08:02:24.979543+00 | 2026-06-12 08:02:24.979464+00
 c63f7271-8b41-4524-a91b-f8838b25aec5 | completed  | Join                    |                                   | 2026-06-12 03:16:45.313736+00 | 2026-06-12 04:16:24.737723+00 | 2026-06-12 04:16:24.737644+00
 75341055-d81f-418b-98ef-8d8dd33ddc19 | completed  | 반복문                  |                                   | 2026-06-11 06:58:51.113466+00 | 2026-06-11 07:25:07.850604+00 | 2026-06-11 07:25:07.850508+00

문제 1: status와 completed_at 조합

Celery가 실패 시 run_publish_job을 재시도하도록 다음처럼 코드가 되어 있으므로, 이와 관련된 버그로 방향을 잡았습니다.

@shared_task(bind=True, max_retries=2, default_retry_delay=30, queue="encode")
def publish_course_task(self, job_id: str) -> None:
    from media.services.publish_worker import run_publish_job

    try:
        run_publish_job(UUID(job_id))
    except PublishJob.DoesNotExist as exc:
        ...

확실한 에러 분석을 위해 로그 파일을 확인했습니다.

celery-1  | [2026-06-15 17:18:12,517: INFO/MainProcess] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] received
celery-1  | [2026-06-15 18:20:24,317: INFO/MainProcess] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] received
celery-1  | {"timestamp": "2026-06-15T18:20:24.326794+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job started", "event": "publish_job_started", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "user_id": 2, "email": "user@example.com", "staging_dir": "staging/3eff8d62-19cb-4de4-a282-127bc4b3130f"}
celery-1  | {"timestamp": "2026-06-15T18:20:24.508392+09:00", "level": "INFO", "logger": "media.publish", "message": "Staged inputs prepared", "event": "staging_prepare_completed", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "video_count": 1, "thumbnail_count": 0, "staging_dir": "staging/3eff8d62-19cb-4de4-a282-127bc4b3130f"}
celery-1  | {"timestamp": "2026-06-15T18:22:39.790782+09:00", "level": "INFO", "logger": "media.publish", "message": "Thumbnail saved", "event": "thumbnail_saved", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "user_id": 2, "video_index": 0, "thumbnail_source": "frame", "thumbnail_url": "https://d111111abcdef8.cloudfront.net/thumbnails/2026/06/15/ee3054e6c9d740e8b5150af8854b7e15.webp"}
celery-1  | {"timestamp": "2026-06-15T18:22:39.792062+09:00", "level": "INFO", "logger": "media.storage", "message": "HLS upload started", "event": "hls_upload_started", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "s3_prefix": "hls/2026/06/15/e4e7fa07c90e402fbb5aacd929aee16d", "use_s3": true, "file_count": 36, "total_bytes": 12472171}
celery-1  | {"timestamp": "2026-06-15T18:22:42.630271+09:00", "level": "INFO", "logger": "media.storage", "message": "HLS upload completed", "event": "hls_upload_completed", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "s3_prefix": "hls/2026/06/15/e4e7fa07c90e402fbb5aacd929aee16d", "use_s3": true, "file_count": 36, "total_bytes": 12472171, "duration_ms": 2838}
celery-1  | {"timestamp": "2026-06-15T18:22:42.639548+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job completed", "event": "publish_job_completed", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "user_id": 2, "email": "user@example.com", "course_id": 24, "duration_ms": 138321}
celery-1  | [2026-06-15 18:22:42,732: INFO/ForkPoolWorker-1] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] succeeded in 138.41680589399766s: None
celery-1  | [2026-06-15 19:20:34,143: INFO/MainProcess] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] received
celery-1  | [2026-06-15 20:20:34,729: INFO/MainProcess] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] received
celery-1  | [2026-06-15 21:20:35,357: INFO/MainProcess] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] received
celery-1  | {"timestamp": "2026-06-15T21:31:10.801677+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job started", "event": "publish_job_started", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "user_id": 2, "email": "user@example.com", "staging_dir": "staging/3eff8d62-19cb-4de4-a282-127bc4b3130f"}
celery-1  | {"timestamp": "2026-06-15T21:31:10.838364+09:00", "level": "ERROR", "logger": "media.storage", "message": "Failed to read json key", "event": "storage_read_failed", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "key": "staging/3eff8d62-19cb-4de4-a282-127bc4b3130f/manifest.json", "bucket": "class-s3-bucket-example", "use_s3": true, "error_type": "NoSuchKey", "exception": "Traceback (most recent call last):\n  File \"/app/media/services/storage.py\", line 244, in read_json_key\n    resp = _s3_client().get_object(Bucket=_bucket(), Key=key)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 606, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.13/site-packages/botocore/context.py\", line 123, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 1094, in _make_api_call\n    raise error_class(parsed_response, operation_name)\nbotocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist."}
celery-1  | {"timestamp": "2026-06-15T21:31:10.844181+09:00", "level": "ERROR", "logger": "media.publish", "message": "Publish job failed", "event": "publish_job_failed", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "user_id": 2, "email": "user@example.com", "error_type": "NoSuchKey", "exception": "Traceback (most recent call last):\n  File \"/app/media/services/publish_worker.py\", line 171, in run_publish_job\n    manifest = load_manifest(prefix)\n  File \"/app/media/services/staging.py\", line 105, in load_manifest\n    manifest = read_json_key(key=_manifest_key(staging_prefix))\n  File \"/app/media/services/storage.py\", line 244, in read_json_key\n    resp = _s3_client().get_object(Bucket=_bucket(), Key=key)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 606, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.13/site-packages/botocore/context.py\", line 123, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 1094, in _make_api_call\n    raise error_class(parsed_response, operation_name)\nbotocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist."}
celery-1  | {"timestamp": "2026-06-15T21:31:10.847119+09:00", "level": "WARNING", "logger": "media.publish", "message": "Celery will retry publish task", "event": "publish_task_retry", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "error_type": "NoSuchKey", "task_name": "media.tasks.publish_course_task", "attempt": 1, "max_retries": 2}
celery-1  | [2026-06-15 21:31:10,854: INFO/MainProcess] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] received
celery-1  | [2026-06-15 21:31:10,856: INFO/ForkPoolWorker-2] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] retry: Retry in 30s: NoSuchKey('An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.')
celery-1  | [2026-06-15 22:32:16,202: INFO/MainProcess] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] received
celery-1  | {"timestamp": "2026-06-15T23:31:11.232661+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job started", "event": "publish_job_started", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "user_id": 2, "email": "user@example.com", "staging_dir": "staging/3eff8d62-19cb-4de4-a282-127bc4b3130f"}
celery-1  | {"timestamp": "2026-06-15T23:31:11.283739+09:00", "level": "ERROR", "logger": "media.storage", "message": "Failed to read json key", "event": "storage_read_failed", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "key": "staging/3eff8d62-19cb-4de4-a282-127bc4b3130f/manifest.json", "bucket": "class-s3-bucket-example", "use_s3": true, "error_type": "NoSuchKey", "exception": "Traceback (most recent call last):\n  File \"/app/media/services/storage.py\", line 244, in read_json_key\n    resp = _s3_client().get_object(Bucket=_bucket(), Key=key)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 606, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.13/site-packages/botocore/context.py\", line 123, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 1094, in _make_api_call\n    raise error_class(parsed_response, operation_name)\nbotocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist."}
celery-1  | {"timestamp": "2026-06-15T23:31:11.288211+09:00", "level": "ERROR", "logger": "media.publish", "message": "Publish job failed", "event": "publish_job_failed", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "user_id": 2, "email": "user@example.com", "error_type": "NoSuchKey", "exception": "Traceback (most recent call last):\n  File \"/app/media/services/publish_worker.py\", line 171, in run_publish_job\n    manifest = load_manifest(prefix)\n  File \"/app/media/services/staging.py\", line 105, in load_manifest\n    manifest = read_json_key(key=_manifest_key(staging_prefix))\n  File \"/app/media/services/storage.py\", line 244, in read_json_key\n    resp = _s3_client().get_object(Bucket=_bucket(), Key=key)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 606, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.13/site-packages/botocore/context.py\", line 123, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 1094, in _make_api_call\n    raise error_class(parsed_response, operation_name)\nbotocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist."}
celery-1  | {"timestamp": "2026-06-15T23:31:11.290839+09:00", "level": "WARNING", "logger": "media.publish", "message": "Celery will retry publish task", "event": "publish_task_retry", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "error_type": "NoSuchKey", "task_name": "media.tasks.publish_course_task", "attempt": 1, "max_retries": 2}
celery-1  | [2026-06-15 23:31:11,292: INFO/MainProcess] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] received
celery-1  | [2026-06-15 23:31:11,294: INFO/ForkPoolWorker-3] Task media.tasks.publish_course_task[3eff8d62-19cb-4de4-a282-127bc4b3130f] retry: Retry in 30s: NoSuchKey('An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.')
celery-1  | {"timestamp": "2026-06-15T23:31:11.388251+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job started", "event": "publish_job_started", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "user_id": 2, "email": "user@example.com", "staging_dir": "staging/3eff8d62-19cb-4de4-a282-127bc4b3130f"}
celery-1  | {"timestamp": "2026-06-15T23:31:11.440223+09:00", "level": "ERROR", "logger": "media.storage", "message": "Failed to read json key", "event": "storage_read_failed", "job_id": "3eff8d62-19cb-4de4-a282-127bc4b3130f", "key": "staging/3eff8d62-19cb-4de4-a282-127bc4b3130f/manifest.json", "bucket": "class-s3-bucket-example", "use_s3": true, "error_type": "NoSuchKey", "exception": "Traceback (most recent call last):\n  File \"/app/media/services/storage.py\", line 244, in read_json_key\n    resp = _s3_client().get_object(Bucket=_bucket(), Key=key)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 606, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.13/site-packages/botocore/context.py\", line 123, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/local/lib/python3.13/site-packages/botocore/client.py\", line 1094, in _make_api_call\n    raise error_class(parsed_response, operation_name)\nbotocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist."}
...

위 로그는 일부만 발췌한 것이며, 로그를 보면 같은 task_id로 여러 번 요청이 메시지 브로커에 들어왔고, 워커가 동일한 요청을 여러 번 실행하면서 문제가 발생했습니다.

즉 문제는 재시도가 아닌, 브로커가 같은 작업을 재배달하면서 프로세스가 정상적으로 끝난 경우에도 다시 작업을 하게 되면서 버그가 발생하게 된 것입니다.

정상적으로 프로세스가 끝난 경우에는 S3 스테이징의 해당 영상이 이미 삭제된 이후이기 때문에, 이전과 같이 접근하면 GetObject 에러가 발생했던 것입니다.

쉬운 비교를 위해 워커가 메시지를 받은 시각을 정리하면 아래와 같습니다.

한 시간 정도의 간격이 발생하는 것을 볼 수 있었습니다. 여기까지 보면 확실한 것은 같은 메시지가 여러 번 적재된다는 점입니다.

#시각이전 received와 간격
117:18:12.517
218:20:24.317+62분 12초
319:20:34.143+60분 10초
420:20:34.729+60분 0초
521:20:35.357+60분 1초
621:31:10.854+10분 35초
722:32:16.202+61분 5초
823:31:11.292+58분 55초
923:31:11.449+0.16초
1023:31:11.594+0.15초
1123:31:11.665+0.07초
1223:31:11.741+0.08초
1323:31:41.358+30초
1423:31:41.526+0.17초
1523:31:41.678+0.15초

백엔드에서 메시지를 적재하는 코드는 강좌 등록 시 한 곳뿐이었습니다. 즉 인프라 설정 문제로 의심이 갔습니다.

        job = PublishJob.objects.create(
            id=job_id,
            user=request.user,
            status=PublishJobStatus.QUEUED,
            staging_dir=prefix,
            course_name=data["name"],
            curriculum_id=data["curriculum"].id,
            curriculum_name=data["curriculum"].name,
        )
        publish_course_task.apply_async(
            args=[str(job.id)],
            task_id=str(job.id),
            queue="encode",
        )

        log_event(
            logger,
            logging.INFO,
            LogEvent.PUBLISH_JOB_QUEUED,
            "Publish job queued",
            **publish_fields(
                job_id=str(job.id),
                user_id=request.user.id,
                email=request.user.email,
                video_count=len(data["videos"]),
                curriculum_id=data["curriculum"].id,
            ),
        )

기존 Celery 브로커 설정은 다음과 같았습니다.

CELERY_BROKER_TRANSPORT_OPTIONS = {
    "global_keyprefix": os.environ.get(
        "CELERY_BROKER_KEY_PREFIX",
        "class_s:celery:",
    ),
}

여기서 문제는 visibility_timeout이 없다는 점이었습니다. visibility_timeout은 워커가 작업을 가져간 뒤 ack하기 전까지 브로커가 그 메시지를 unacked로 유지하는 시간입니다. 이 시간 안에 ack가 오지 않으면 브로커는 메시지가 유실되었다고 판단하고 다시 요청을 보냅니다. 이 설정의 기본값이 1시간이었습니다. docs.celery#visibility_timeout

이 1시간이라는 수치가 위에서 received가 여러 번 발생한 시간 차이와 같다는 점을 보면, 원인임을 알 수 있었습니다. 하지만 워커가 한 번에 하나만 점유하는데 비슷한 시간대로 여러 번 찍힌 것은 설명이 되지 않았습니다.

이 지점은 worker_prefetch_multiplier가 기본값 4인 점으로 설명이 가능했습니다. worker_prefetch_multiplier는 브로커가 미리 워커에 작업을 적재한 순간부터 visibility_timeout 시간이 흐르기 때문에, 위 현상을 설명할 수 있었습니다. docs.celery#worker_prefetch_multiplier

해결

다음과 같이 설정을 수정했습니다.

  • visibility_timeout: 4시간
  • prefetch: 1개
  • task time limit: 4시간 (visibility_timeout과 논리적으로 동일하게)

그리고 서비스 로직 측면에서는 다음을 추가해 안정성을 강화했습니다.

  • COMPLETED 상태를 FAILED로 덮어쓰지 않도록 방지
  • 이미 완료된 작업은 재시도되지 않도록 함 (run_publish_job 시작 시 status == COMPLETED이면 즉시 return)
  • NoSuchKey 등은 retry 대상에서 제외하고 예외 처리 메시지를 추가함

문제 2: 영원히 상태가 처리 중

업로드가 다음 날이 되어도 “처리 중"으로 고정되어 있었습니다.
하지만 서버 상황을 보면 Celery가 CPU를 거의 점유하지 않아, 인코딩 작업이 진행되지 않는 것을 확인할 수 있었습니다.

CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O       PIDS
73640d6083e6   class_s-nginx-1      0.00%     3.156MiB / 7.601GiB   0.04%     1.62GB / 1.6GB    4.1kB / 775MB   3
23677574c8d7   class_s-celery-1     0.06%     145.8MiB / 7.601GiB   1.87%     1.31GB / 677MB    0B / 846MB      2
03d4f9c444c6   class_s-backend-1    0.04%     526.8MiB / 7.601GiB   6.77%     1.59GB / 1.58GB   0B / 7.44MB     5
a4d13fa09ce8   class_s-frontend-1   0.00%     58.67MiB / 7.601GiB   0.75%     3.02MB / 8.95MB   0B / 0B         11
f626fa3a9812   class_s-postgres-1   0.00%     28.59MiB / 7.601GiB   0.37%     11.5MB / 16.4MB   0B / 2.34MB     9
2b252bc54deb   class_s-redis-1      0.41%     3.477MiB / 7.601GiB   0.04%     62.9MB / 47.6MB   4.1kB / 889kB   6

processing 상태에서 멈췄지만 에러 로그가 발생하지 않는, 의도 밖의 상황이 발생했습니다.
현재 정황상으로는 다음 로직까지는 실행된 것을 알 수 있었습니다.


# 워커 실행
def run_publish_job(job_id: UUID) -> None:


# 인코딩 작업 전 처리 시작을 알림
job.status = PublishJobStatus.PROCESSINGjob.save(update_fields=["status", "updated_at"])set_publish_progress(    job_id,    progress_percent=0,    current_step="",    step_label="처리 시작...",)

로그를 보면 스테이징에 올라가고 인코딩 직전까지만 실행된 것을 확인할 수 있습니다. (여러 번 찍힌 것은 문제 1과 동일한 이슈입니다.)

/app/logs/app.log:{"timestamp": "2026-06-15T17:38:17.576518+09:00", "level": "INFO", "logger": "media.publish", "message": "Staging payload saved", "event": "staging_save_completed", "request_id": "47144c7f58c04d8f", "curriculum_id": 7, "video_count": 1, "thumbnail_count": 0, "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}
/app/logs/app.log:{"timestamp": "2026-06-15T17:38:17.580829+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job queued", "event": "publish_job_queued", "request_id": "47144c7f58c04d8f", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "user_id": 2, "email": "user@example.com", "curriculum_id": 7, "video_count": 1}
/app/logs/app.log:{"timestamp": "2026-06-15T19:13:57.826072+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job started", "event": "publish_job_started", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "user_id": 2, "email": "user@example.com", "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}
/app/logs/app.log:{"timestamp": "2026-06-15T19:13:57.987056+09:00", "level": "INFO", "logger": "media.publish", "message": "Staged inputs prepared", "event": "staging_prepare_completed", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "video_count": 1, "thumbnail_count": 0, "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}
/app/logs/app.log:{"timestamp": "2026-06-15T21:31:10.868951+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job started", "event": "publish_job_started", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "user_id": 2, "email": "user@example.com", "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}
/app/logs/app.log:{"timestamp": "2026-06-15T21:31:11.060713+09:00", "level": "INFO", "logger": "media.publish", "message": "Staged inputs prepared", "event": "staging_prepare_completed", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "video_count": 1, "thumbnail_count": 0, "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}
/app/logs/celery.log:{"timestamp": "2026-06-15T17:38:17.576617+09:00", "level": "INFO", "logger": "media.publish", "message": "Staging payload saved", "event": "staging_save_completed", "request_id": "47144c7f58c04d8f", "curriculum_id": 7, "video_count": 1, "thumbnail_count": 0, "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}
/app/logs/celery.log:{"timestamp": "2026-06-15T17:38:17.580914+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job queued", "event": "publish_job_queued", "request_id": "47144c7f58c04d8f", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "user_id": 2, "email": "user@example.com", "curriculum_id": 7, "video_count": 1}
/app/logs/celery.log:{"timestamp": "2026-06-15T19:13:57.826186+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job started", "event": "publish_job_started", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "user_id": 2, "email": "user@example.com", "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}
/app/logs/celery.log:{"timestamp": "2026-06-15T19:13:57.987138+09:00", "level": "INFO", "logger": "media.publish", "message": "Staged inputs prepared", "event": "staging_prepare_completed", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "video_count": 1, "thumbnail_count": 0, "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}
/app/logs/celery.log:{"timestamp": "2026-06-15T21:31:10.869064+09:00", "level": "INFO", "logger": "media.publish", "message": "Publish job started", "event": "publish_job_started", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "user_id": 2, "email": "user@example.com", "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}
/app/logs/celery.log:{"timestamp": "2026-06-15T21:31:11.060837+09:00", "level": "INFO", "logger": "media.publish", "message": "Staged inputs prepared", "event": "staging_prepare_completed", "job_id": "0e460663-1de7-4813-a128-5895d3265d70", "video_count": 1, "thumbnail_count": 0, "staging_dir": "staging/0e460663-1de7-4813-a128-5895d3265d70"}

여기서 의심할 수 있는 점은 인코딩 자체의 타임아웃 때문에 서비스 로직이 실행되지 않았을 가능성입니다. 로그를 보면 알 수 있듯이, 1차 시도 후 2시간이 지나 타임아웃이 발생하여 인코딩이 중지된 것입니다.

  • 19:13:57: 1차 publish_job_started 0
  • 21:13:57: time_limit 2h 도달 +2시간
  • 21:31:10: 2차 publish_job_started +2h 17m

해결

타임 리미트를 4시간으로 변경했습니다.

CELERY_TASK_TIME_LIMIT = 60 * 60 * 2  # 2시간 (인코딩 대비)
=> 
CELERY_PUBLISH_MAX_SECONDS = 60 * 60 * 4  # encode 상한 4h
CELERY_TASK_TIME_LIMIT = CELERY_PUBLISH_MAX_SECONDS

정리

두 이슈 모두 긴 HLS 인코딩에 비해 visibility_timeout, task_time_limit, prefetch 설정이 짧았던 것에서 발생했습니다. 문제 1은 브로커 재배달과 Celery retry가 겹치며 DB 상태가 깨졌고, 문제 2는 time_limit 강제 종료로 processing이 남았습니다.