Skip to content

ci: Run EUnit tests in parallel on CI workers#5914

Draft
big-r81 wants to merge 1 commit intomainfrom
ci/default-parallel-eunit
Draft

ci: Run EUnit tests in parallel on CI workers#5914
big-r81 wants to merge 1 commit intomainfrom
ci/default-parallel-eunit

Conversation

@big-r81
Copy link
Contributor

@big-r81 big-r81 commented Mar 7, 2026

No description provided.

@big-r81 big-r81 force-pushed the ci/default-parallel-eunit branch 3 times, most recently from 09d9eb3 to 5b42dd3 Compare March 7, 2026 09:43
@big-r81 big-r81 marked this pull request as ready for review March 7, 2026 11:25
@nickva
Copy link
Contributor

nickva commented Mar 7, 2026

Looks great, much faster.

I saw a few retries I gathered all the retries below. In one case it failed twice in a row so we had 2 retries and 3rd one passed (on noble):

2026-03-07T09:58:43.470Z]     couch_replicator_scheduler_job_tests:79: -scheduler_job_main_db_test_/0-fun-0- (t_replicator_with_checkpoint_and_since_seq)...*failed*
[2026-03-07T09:58:43.470Z] in function couch_replicator_scheduler_job_tests:scheduler_docs_id/2 (test/eunit/couch_replicator_scheduler_job_tests.erl, line 314)
[2026-03-07T09:58:43.470Z] in call from couch_replicator_scheduler_job_tests:persistent_replicate/2 (test/eunit/couch_replicator_scheduler_job_tests.erl, line 270)
[2026-03-07T09:58:43.470Z] in call from couch_replicator_scheduler_job_tests:t_replicator_with_checkpoint_and_since_seq/1 (test/eunit/couch_replicator_scheduler_job_tests.erl, line 205)
[2026-03-07T09:58:43.470Z] in call from eunit_test:run_testfun/1 (eunit_test.erl, line 71)
[2026-03-07T09:58:43.470Z] in call from eunit_proc:run_test/1 (eunit_proc.erl, line 543)
[2026-03-07T09:58:43.470Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 368)
[2026-03-07T09:58:43.470Z] in call from eunit_proc:handle_test/2 (eunit_proc.erl, line 526)
[2026-03-07T09:58:43.470Z] in call from eunit_proc:tests_inorder/3 (eunit_proc.erl, line 468)
[2026-03-07T09:58:43.470Z] **error:{badmatch,[]}
[2026-03-07T09:58:43.470Z]   output:<<"">>
[2026-03-07T09:54:02.738Z]     chttpd_db_test:71: -all_test_/0-fun-18- (t_not_change_db_proper_after_rewriting_shardmap)...*failed*
[2026-03-07T09:54:02.738Z] in function chttpd_db_test:t_not_change_db_proper_after_rewriting_shardmap/1 (test/eunit/chttpd_db_test.erl, line 227)
[2026-03-07T09:54:02.738Z] in call from eunit_test:run_testfun/1 (eunit_test.erl, line 71)
[2026-03-07T09:54:02.738Z] in call from eunit_proc:run_test/1 (eunit_proc.erl, line 543)
[2026-03-07T09:54:02.738Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 368)
[2026-03-07T09:54:02.738Z] in call from eunit_proc:handle_test/2 (eunit_proc.erl, line 526)
[2026-03-07T09:54:02.738Z] in call from eunit_proc:tests_inorder/3 (eunit_proc.erl, line 468)
[2026-03-07T09:54:02.738Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 358)
[2026-03-07T09:54:02.738Z] in call from eunit_proc:run_group/2 (eunit_proc.erl, line 582)
[2026-03-07T09:54:02.738Z] **error:{badmatch,{412,
[2026-03-07T09:54:02.738Z]            #{<<"error">> => <<"file_exists">>,
[2026-03-07T09:54:02.738Z]              <<"reason">> =>
[2026-03-07T09:54:02.738Z]                  <<"The database could not be created, the file already exis"...>>}}}
[2026-03-07T09:54:02.738Z]   output:<<"">>
[2026-03-07T09:54:02.738Z] 
[2026-03-07T09:54:02.738Z]     undefined
[2026-03-07T09:54:02.738Z]     *** context cleanup failed ***
[2026-03-07T09:54:02.738Z] **in function chttpd_db_test:delete_db/1 (test/eunit/chttpd_db_test.erl, line 333)
[2026-03-07T09:54:02.738Z] in call from chttpd_db_test:teardown/1 (test/eunit/chttpd_db_test.erl, line 37)
[2026-03-07T09:54:02.738Z] **error:{failed_to_delete_test_db,<<"eunit-test-db-96f8f9da86c153bab3340f084b1f4642">>,
[2026-03-07T09:54:02.738Z]                           {404,
[2026-03-07T09:54:02.738Z]                            #{<<"error">> => <<"not_found">>,
[2026-03-07T09:54:02.738Z]                              <<"reason">> => <<"Database does not exist.">>}}}
[2026-03-07T10:02:39.711Z] module 'couch_index_crash_tests'
[2026-03-07T10:02:39.711Z]   couch_index_crash_tests: db_event_crash_test...ok
[2026-03-07T10:02:39.711Z]   Simulate index crashing
[2026-03-07T10:02:39.711Z]     couch_index_crash_tests:72: -index_crash_test_/0-fun-8- (t_can_open_mock_index)...[0.112 s] ok
[2026-03-07T10:02:39.711Z]     couch_index_crash_tests:73: -index_crash_test_/0-fun-6- (t_index_open_returns_error)...*failed*
[2026-03-07T10:02:39.711Z] in function couch_index_crash_tests:t_index_open_returns_error/1 (test/eunit/couch_index_crash_tests.erl, line 128)
[2026-03-07T10:02:39.711Z] in call from eunit_test:run_testfun/1 (eunit_test.erl, line 71)
[2026-03-07T10:02:39.711Z] in call from eunit_proc:run_test/1 (eunit_proc.erl, line 543)
[2026-03-07T10:02:39.711Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 368)
[2026-03-07T10:02:39.711Z] in call from eunit_proc:handle_test/2 (eunit_proc.erl, line 526)
[2026-03-07T10:02:39.711Z] in call from eunit_proc:tests_inorder/3 (eunit_proc.erl, line 468)
[2026-03-07T10:02:39.711Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 358)
[2026-03-07T10:02:39.711Z] in call from eunit_proc:run_group/2 (eunit_proc.erl, line 582)
[2026-03-07T10:02:39.711Z] **error:{assert,[{module,couch_index_crash_tests},
[2026-03-07T10:02:39.711Z]          {line,128},
[2026-03-07T10:02:39.711Z]          {expression,"meck : called ( couch_index_server , handle_call , [ { async_error , '_' , '_' } , '_' , '_' ] )"},
[2026-03-07T10:02:39.711Z]          {expected,true},
[2026-03-07T10:02:39.711Z]          {value,false}]}
[2026-03-07T10:02:39.711Z]   output:<<"">>
[2026-03-07T10:00:27.781Z]     rexi_tests:35: -rexi_buffer_test_/0-fun-6- (t_kill)...*failed*
[2026-03-07T10:00:27.781Z] in function rexi_tests:t_kill/1 (test/rexi_tests.erl, line 200)
[2026-03-07T10:00:27.781Z] in call from eunit_test:run_testfun/1 (eunit_test.erl, line 71)
[2026-03-07T10:00:27.781Z] in call from eunit_proc:run_test/1 (eunit_proc.erl, line 543)
[2026-03-07T10:00:27.781Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 368)
[2026-03-07T10:00:27.781Z] in call from eunit_proc:handle_test/2 (eunit_proc.erl, line 526)
[2026-03-07T10:00:27.781Z] in call from eunit_proc:tests_inorder/3 (eunit_proc.erl, line 468)
[2026-03-07T10:00:27.781Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 358)
[2026-03-07T10:00:27.781Z] in call from eunit_proc:run_group/2 (eunit_proc.erl, line 582)
[2026-03-07T10:00:27.781Z] **error:{assertEqual,[{module,rexi_tests},
[2026-03-07T10:00:27.781Z]               {line,200},
[2026-03-07T10:00:27.781Z]               {expression,"KillReason"},
[2026-03-07T10:00:27.781Z]               {expected,killed},
[2026-03-07T10:00:27.781Z]               {value,noproc}]}
[2026-03-07T10:00:27.781Z]   output:<<"">>
[2026-03-07T09:56:20.066Z]     couch_replicator_scheduler_job_tests:91: -scheduler_job_prefixed_db_test_/0-fun-0- (t_replicator_with_checkpoint_and_since_seq)...*failed*
[2026-03-07T09:56:20.066Z] in function couch_replicator_scheduler_job_tests:t_replicator_with_checkpoint_and_since_seq/1 (test/eunit/couch_replicator_scheduler_job_tests.erl, line 194)
[2026-03-07T09:56:20.066Z] in call from eunit_test:run_testfun/1 (eunit_test.erl, line 71)
[2026-03-07T09:56:20.066Z] in call from eunit_proc:run_test/1 (eunit_proc.erl, line 543)
[2026-03-07T09:56:20.066Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 368)
[2026-03-07T09:56:20.066Z] in call from eunit_proc:handle_test/2 (eunit_proc.erl, line 526)
[2026-03-07T09:56:20.066Z] in call from eunit_proc:tests_inorder/3 (eunit_proc.erl, line 468)
[2026-03-07T09:56:20.066Z] in call from eunit_proc:with_timeout/3 (eunit_proc.erl, line 358)
[2026-03-07T09:56:20.066Z] in call from eunit_proc:run_group/2 (eunit_proc.erl, line 582)
[2026-03-07T09:56:20.066Z] **error:{assertEqual,[{module,couch_replicator_scheduler_job_tests},
[2026-03-07T09:56:20.066Z]               {line,194},
[2026-03-07T09:56:20.066Z]               {expression,"RepId2"},
[2026-03-07T09:56:20.066Z]               {expected,<<"bc40164a602ecc5225f7674db55d74fa">>},
[2026-03-07T09:56:20.066Z]               {value,null}]}
[2026-03-07T09:56:20.066Z]   output:<<"">>

They don't seem to have any repeated patterns in them, all different tests.

What I haven't looked at is if these happen about as often in non-parallel tests. If these are worse with parallel test we could try a -j2 first perhaps and see if we get fewer of them

@big-r81 big-r81 marked this pull request as draft March 8, 2026 12:32
@big-r81 big-r81 force-pushed the ci/default-parallel-eunit branch from 5b42dd3 to c6bf8c9 Compare March 8, 2026 12:39
@big-r81
Copy link
Contributor Author

big-r81 commented Mar 8, 2026

What I haven't looked at is if these happen about as often in non-parallel tests. If these are worse with parallel test we could try a -j2 first perhaps and see if we get fewer of them

I cloned the noble entry (excluded jammy) with a non-parallel eunit setting to have a direct comparison of two runs. Maybe we get some hints ...

@big-r81 big-r81 force-pushed the ci/default-parallel-eunit branch from c6bf8c9 to 56e966d Compare March 9, 2026 00:02
INTERMEDIATE_ERLANG_VERSION = '27.3.4.8'

// Default GNU Make Eunit Options for supported platforms
DEFAULT_GNU_MAKE_EUNIT_OPTS = '-j4 --output-sync=target'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just trying -j2 first? If that gives us a good enough boost without too many retries that can be a good first step. Then we can try moving on to -j3 and -j4 later.

@big-r81 big-r81 force-pushed the ci/default-parallel-eunit branch 3 times, most recently from b6d2da8 to 61be08e Compare March 12, 2026 19:03
@big-r81 big-r81 force-pushed the ci/default-parallel-eunit branch from 61be08e to 4fea8a6 Compare March 12, 2026 19:57
@nickva
Copy link
Contributor

nickva commented Mar 12, 2026

Thanks for taking a look at this Ronny.

I am looking at the results and I wonder if it's just some hardware nodes that are the problem:

https://ci-couchdb.apache.org/job/jenkins-cm1/job/PullRequests/job/PR-5914/14/stages/?selected-node=952

In this case it's ubuntu-fra1-10 where there are 2 failures in a row and then a pass for a total for 50m and it becomes the longest pole in the tent so to speak. The eunit runs there take 15m with the -j flag and only 9-10m on others. I guess maybe we should just disable it, but only if we have enough nodes to run the CI in one go (without having to wait for nodes to free up).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants