AWS SageMaker 推理实践

June 26, 2025 · 2 min read

准备

环境: AWS SageMarker Notebook (ml.t3.medium)
服务: AWS SageMarker Inference(Models, Endpoint configurations, Endpoints)
Models Repo: huggingface models (other like S3 bucket, fine-tuning models)

Abbreviation

tgi: Text Generate Interface

推理代码

1. Packges and ENVs preparation

!pip install sagemaker --upgrade
!pip install boto3 --upgrade

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. Config model info

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.serverless import ServerlessInferenceConfig

# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
	'SM_NUM_GPUS': json.dumps(1)
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="3.2.3"),
	env=hub,
	role=role,
)

print("huggingface_model", huggingface_model)

3.1 Create Models, Endpoint configurations, Endpoints instances

# # Specify MemorySizeInMB and MaxConcurrency in the serverless config object
# serverless_config = ServerlessInferenceConfig(
#     memory_size_in_mb=6144, max_concurrency=10,
# )

# # deploy the endpoint endpoint
# predictor = huggingface_model.deploy(
#     serverless_inference_config=serverless_config
# )

# deploy model to SageMaker Inference with instance
predictor = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g5.2xlarge",
	container_startup_health_check_timeout=300,
  )

data = {
  "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = predictor.predict(data=data)
print(res)

3.2 Already created endpiont

# To get the endpoint URL for making predictions
predictor_2 = sagemaker.predictor.Predictor(
    endpoint_name='huggingface-pytorch-tgi-inference-2025-06-26-09-59-09-507',
    sagemaker_session=sess,
    serializer=sagemaker.serializers.JSONSerializer(),  # Add this line
    deserializer=sagemaker.deserializers.JSONDeserializer()  # Add this line
)
res_2 = predictor_2.predict(data=data)
print(res_2)

4. Clean resources

predictor.delete_model()
predictor.delete_endpoint()

Issues

有些模型太大，网络下载时间过长，导致 Endpiont 无法正常启动。

准备​

推理代码​

1. Packges and ENVs preparation​

2. Config model info​

3.1 Create Models, Endpoint configurations, Endpoints instances​

3.2 Already created endpiont​

4. Clean resources​

Issues​

Resources​

准备

推理代码

1. Packges and ENVs preparation

2. Config model info

3.1 Create Models, Endpoint configurations, Endpoints instances

3.2 Already created endpiont

4. Clean resources

Issues

Resources