Skip to content

SAMZA-2804: Concurrency issues identified in run-class.sh on samza-yarn #1716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 2, 2025

Conversation

Image for: Conversation
Copy link
Contributor

bringhurst commented May 1, 2025

This PR contains fixes to several concurrency related issues when run-class.sh and run-framework-class.sh is used on samza-yarn.

BUG=SAMZA-2804

Details

Image for: Details

Race condition in pathing jar manifest creation

A race condition exists when setting up the classpath during container launch.

During container launch using samza-yarn, run-class.sh creates a pathing jar file (which holds the classpath for the container launch). However, during the creation of this pathing jar, temporary files, as well as the pathing jar itself is not placed in a location unique to the container. This results in multiple containers writing to the same pathing jar location and temporary file location, which results in a race condition.

This race condition may show up in several ways, such as when Yarn removes jars from a finished container (other containers will point to a classpath which no longer exists) or when multiple run-class.sh scripts attempt to write the manifest.txt or pathing jar at the same time.

Note that host affinity being enabled will make this problem worse. The pathing.jar is written to the usercache, so when the container which created the pathing.jar is finished and removed, any new container which launches on that host will point to jar files which do not exist anymore. When host affinity is enabled, it will not move to a new host and just keep failing.

This has been modified to use a path unique to the container.

Container logging directory fallback is not unique for each container

The fallback log directory is the same among all containers running on the same host.

This has been modified to use a path unique to the container.

Container tmp dir is not unique per-container

The JAVA_TMP_DIR directory is the same for all containers.

This has been modified to use a path unique to the container.

Testing done

Image for: Testing done

Validated that the manifest.txt and pathing.jar are now located in a per-container directory:

user1@host1 [ /<hadoop dir>/usercache/user1/appcache/application_1745988849220_0005/container_e4225_1745988849220_0005_01_000002/classpath_workspace ]$ readlink -f manifest.txt
/<hadoop dir>/usercache/user1/appcache/application_1745988849220_0005/container_e4225_1745988849220_0005_01_000002/classpath_workspace/manifest.txt
user1@host1 [ /<hadoop dir>/usercache/user1/appcache/application_1745988849220_0005/container_e4225_1745988849220_0005_01_000002/classpath_workspace ]$ readlink -f pathing.jar
/<hadoop dir>/usercache/user1/appcache/application_1745988849220_0005/container_e4225_1745988849220_0005_01_000002/classpath_workspace/pathing.jar
vapp0167@lor1-0005517 [ /<hadoop dir>/usercache/user1/appcache/application_1745988849220_0005/container_e4225_1745988849220_0005_01_000002/classpath_workspace ]$

Compared to before the patch, which shows them being written to an application-level package cache directory:

user2@host2 [ /<hadoop dir>/usercache/user2/appcache/application_1742409027443_0250/container_e32_1742409027443_0250_03_000005/__package/classpath_workspace ]$ readlink -f manifest.txt
/<hadoop dir>/usercache/user2/appcache/application_1742409027443_0250/filecache/10/app-1.2.3.tgz/classpath_workspace/manifest.txt
user2@host2 [ /<hadoop dir>/usercache/user2/appcache/application_1742409027443_0250/container_e32_1742409027443_0250_03_000005/__package/classpath_workspace ]$ readlink -f pathing.jar
/<hadoop dir>/usercache/user2/appcache/application_1742409027443_0250/filecache/10/app-1.2.3.tgz/classpath_workspace/pathing.jar
user2@host2 [ /<hadoop dir>/usercache/user2/appcache/application_1742409027443_0250/container_e32_1742409027443_0250_03_000005/__package/classpath_workspace ]$

See also

Image for: See also

An older attempt to solve this issue can be found at #1498, however, that change overlooks that $base_dir leads to __package, which is a symlink to the application level directory (note the use of readlink in the test example above).

bringhurst changed the title Add annotations for each line identified as having a potential issue. May 1, 2025
bringhurst changed the title SAMZA-2804: Add annotations for each line identified as having a potential issue. May 1, 2025
bringhurst force-pushed the bringhurst/SAMZA-2804 branch 5 times, most recently from 0f7f6a8 to 5e611eb Compare May 2, 2025 16:07
## Race condition in pathing jar manifest creation

A race condition exists when setting up the classpath during container launch.

During container launch using samza-yarn, run-class.sh creates a pathing jar file (which holds the classpath for the container launch). However, during the creation of this pathing jar, temporary files, as well as the pathing jar itself is not placed in a location unique to the container. This results in multiple containers writing to the same pathing jar location and temporary file location, which results in a race condition.

This race condition may show up in several ways, such as when Yarn removes jars from a finished container (other containers will point to a classpath which no longer exists) or when multiple run-class.sh scripts attempt to write the manifest.txt or pathing jar at the same time.

Note that host affinity being enabled will make this problem worse. The pathing.jar is written to the usercache, so when the container which created the pathing.jar is finished and removed, any new container which launches on that host will point to jar files which do not exist anymore. When host affinity is enabled, it will not move to a new host and just keep failing.

## Container logging directory fallback is not unique for each container

The fallback log directory is the same among all containers running on the same host. It should be unique per-container.

## Container tmp dir is not unique per-container

The JAVA_TMP_DIR directory is the same for all containers. We should make sure that it's safe to use the same directory for all containers.
bringhurst force-pushed the bringhurst/SAMZA-2804 branch from 5e611eb to 942e0d1 Compare May 2, 2025 16:10
bringhurst marked this pull request as ready for review May 2, 2025 19:56
Copy link
Contributor Author

@@ -97,12 +108,18 @@ else
fi

if [ -z "$SAMZA_LOG_DIR" ]; then
SAMZA_LOG_DIR="$base_dir"
# When on samza-yarn, SAMZA_LOG_DIR will point to the symlink located at:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the comments. Please remove when on samza-yarn prefix, since this script is applicable only for yarn deployments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to simpify.

# file containing the classpath string; used to avoid passing long classpaths directly to the jar command
PATHING_MANIFEST_FILE=$CLASSPATH_WORKSPACE_DIR/manifest.txt

# jar file to include on the classpath for running the main class
PATHING_JAR_FILE=$CLASSPATH_WORKSPACE_DIR/pathing.jar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if we can log the pathing_jar_file path and the manifest_file path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

# file containing the classpath string; used to avoid passing long classpaths directly to the jar command
PATHING_MANIFEST_FILE=$CLASSPATH_WORKSPACE_DIR/manifest.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to log the location of manifest-file and the pathing-jar-file path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

shanthoosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for making the changes

shanthoosh merged commit 82c53aa into apache:master May 2, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants